The new polyglots
Choosing the right language for your data science capabilityczwartek, 4 października 2018
- The choice of language(s) and technology platform is a key decision for organisations as they develop their data science resources and now begin to build out AI capabilities
- This will impact not only the data science function’s success within the organisation, but also its ability to hire the best talent
The community around open source data science languages is a key resource for capability building, recruitment, and even business development
How many languages do you speak? English? French? Maybe whatever you happen to remember from GCSE German? What about R? SQL? Python? Julia? Perl or Java? Let me know if I am speaking gobbledygook.
As 80% of medium and large businesses in the UK report planning to hire a data scientist in 2019, people that are fluent in the disciplines of communication, analytics, technology and business will be those that will be able to steer their organisations most effectively.
The last few decades have seen massive growth of a wide range of computer science languages, many of these open-source, all suited to particular use cases. Java is used for web and application development, R for statistical analysis, MySQL for database management, Scala for distributed computing and Python for almost everything. While many of these languages were previously contained to academia, in more recent times these have entered the mainstream of business with a bang, and boy have they!
Like traditional languages, coding languages have grammar, syntax, transmit meaning, and yes... can even be beautiful. There is nothing like a well-timed "for loop", expertly-crafted and executed with flair, to bring a tear to the eye. Jokes aside, the choice of coding language can have very real consequences, both for the business and the data science professional.
For instance, the choice of coding language affects effectiveness of the team and ability to hire data scientists. I recently worked with a client who was unable to move from an enterprise to open-source data science platform, due to mis-founded concerns around data security. As a result of this obstacle, they faced significant barriers to hiring data scientists into their team. For a start, enterprise languages are difficult to learn in a non-enterprise environment, while a wealth of resources have sprung up to teach aspiring data scientists the principles of a multitude of open-source languages. The pool of talent is therefore constrained. Some of these enterprise languages also have less of a buzz than say R or Python. They are therefore not seen as equally “cool”, and from a data scientist's perspective, offer less attractive career prospects.
One of the most amazing things about open-source languages is the rich community of developers and users that forms around them. Any data scientist who has had to rely on stack overflow at 11pm the night before an important deadline will be able to attest to this. In many cities around the world there are meetups for those passionate about data science. Furthermore, the lack of open source package development for enterprise languages means functionality for certain purposes is limited. Support for advanced machine learning in R and Python is streets ahead of certain enterprise languages.
So R and Python are the tool of choice for the majority of data scientists. But how to choose between the two? Just as you wouldn't stroll into a French restaurant in a trendy arrondissement of Paris and start speaking Dutch to the waiters, the choice of data science language depends on what you intend to use it for. If you decide to build a website in R, you are likely to end up pulling your hair out. Python, which is more suited to full-stack development, allows you to program back-end servers, build machine learning models, program graphical user interfaces and create fully fledged website applications. Examples of businesses built on Python include YouTube, Spotify, Netflix and Dropbox. R, which was developed as a tool solely for statistical analysis, but somehow developed more and more sophisticated capabilities, is still more suited to number crunching, visualisation and advanced analytics.
We are seeing a lot of buzz in the media and business community about AI. While a consensus definition of AI is yet to emerge, many people’s understanding of an AI system is a computer programme built using machine learning techniques, which is not only able to learn, but also to take action without human intervention. In this regard, Python is more suited to AI development than R, as it’s full-stack characteristics mean it is more able to interact with enterprise systems. Effectively, Python may offer a smoother ramp up to developing a true AI capability.
With a slightly steeper learning curve and broader range of functionality, it also appears that Python is starting to be seen as the more advanced option for data scientists. But when it comes to ability to hit-the-ground running and conduct rapid data exploration, statistical analysis and machine learning, R still outpaces the competition.
People who are able to speak data as well as the business side of things are often described as "unicorns", that rare mythical creature that can do it all. They are said to be experts in many traditionally distinct disciplines, combining commercial insight with mathematics, statistics, computer science, artificial intelligence, and more. Perhaps, we will soon see the rise of the "super-unicorn", who grew up in France, can speak R, Python, Hadoop, throw a presentation together at short notice and fly across the globe to deliver a presentation in fluent Chinese. Time will tell.
Jos van der Boom, Consultant