Saturday 22 Sep, 2018

Starting at 9am

Etc Venues

One Drummond Gate, Victoria, London SW1V 2QQ
Menu

Choosing between R and Python: A Digital Analyst’s Guide


“R or Python? that would be an ecumenical matter! “ was the original title of a past data meetup in the city of Dublin where the topic was debated (if you are not familiar with Father Ted, it’s never too late).

Apparently making the choice between R and Python is not the most straightforward decision. A web search will return a huge number of articles trying to answer which one is better or which one to learn first. After examining facts and figures about each of the two, however, the typical conclusion of those articles is one of the following…

  • It doesn’t matter which one to learn – both languages are great
  • Why not learn both? – they will be handy sooner or later
  • We won’t explicitly recommend one – it’s ultimately up to you to choose

In other words, there is no clear cut, one-size fits all answer and that’s actually to be expected. Think about it; the practical applications can range from classification of medical images to self-driving cars software development, to time series forecasting for key business metrics. To make things simpler, this blog post will exclusively look at the question from the perspective of a digital analyst and will consider the workflows and types of tasks that are typically involved in this field. Of course, digital analysts can serve different roles, so we will look at a couple of different scenarios.

R and Python in digital analytics

Disclosure: I learnt programming with Python. When I started working with digital analytics, I soon switched to R which has been my primary language for programming since then. I still enjoy using Python and I make sure to keep up to date with the developments in the language. 

First of all, let’s reduce any unnecessary stress for potentially failing to choose the “right” language. In the context of digital analytics, the two languages have way more similarities than differences. Essentially no matter what choice you make you should not expect to be at a significant advantage or disadvantage. I have done no statistical analysis to support this, but empirically for over 90 % of the analytical tasks in digital analytics, R and Python have equivalent functionalities and capabilities. For example, for the common task of importing, transforming and exploring data, simply comparing the equivalent R/Python code for this can make the similarities in logic and expression between the two fairly obvious.

It’s true that there are some conceptual differences between the two languages, e.g. Python is primarily object oriented whereas R is primarily a functional programming language. These differences however are hardly noticeable for the most common digital analytics tasks.

Advantages of R

In digital analytics much of the analysis is “consumed” by humans and therefore there is a strong emphasis on the communication, interpretation, visualisation and reporting of the analysis- this plays to R’s strengths. R was developed by statisticians with a natural interest – just like digital analysts – in answering the what, how and why behind processes that generate data with emphasis on interpretability. This is reflected in the way the R language and its libraries approach problems and communicate solutions. R’s visualisation capability for example is a favourite among digital and business analysts, as it allows users to create elegant visualisations, following the principles of tidy data and the grammar of graphics.

Another advantage is simply that you can find support, resources and answers faster as a digital analyst who uses R. I am speaking from my own experiences, but I have always found that there is more code and content related to digital analytics written for R –including packages that are specifically developed for marketing analysis.

Is there a reason why the digital analytics community seems to be more geared towards using R? I think this is partly because many digital analysts come from non-technical and non-computer science backgrounds. These analysts look for a programming environment in which they can get up and running fast without the need to acquire software development skills first- if all they mean to do is analyse data. In this respect R, as a domain specific language for statistics and data analysis, can offer a smoother transition. It allows a digital analyst to go from zero to completing the first data analysis faster and with fewer dependencies compared to other environments.

Advantages of Python

Python has a growing number of advantages on its side. Even though these advantages might not be directly impacting digital analytics, they are still very relevant – and are likely to become even more so in the future.

Python is the primary language when it comes to working with cloud services, data and systems at scale, distributed environments and production environments.

But Python also has an “unfair” advantage over R by virtue of it being a so called “glue” language. Python is not just used by data analysts and data scientists but also by database engineers, web developers, system administrators etc. It has the reputation of being the second best language for…almost anything. This has led many organisations and teams to adopt Python as a common framework that minimises friction and avoids having to translate code from one language to another.

How relevant are the above points for the day to day work of a digital analyst today? Probably  not too much (for most of us anyway), but I think few would disagree that it will likely become much more necessary in the future as it will be useful for interacting with cloud services, managing larger datasets, working with more interdisciplinary data etc. These are all areas where Python excels.

So, which language should a digital analyst choose?

To answer the question let’s assume first that everything else is equal: If that’s not the case, if for example you have colleagues, partners or even the local community that can support you in learning language x, then you already have a very strong reason to select that one, regardless of what you ‘ll read below.

So, with the above assumption in mind, let’s now attempt to address the question. Even though choosing between R and Python is…obviously an ecumenical matter, I would argue that for the majority of digital analysts today, R is the most suitable language to learn.

As a digital analyst your standard workflow probably involves working with structured/tabular data. Typically you first want to access the data e.g. via an internal database or an external web UI or API, then transform, visualise, (model potentially) and finally report and present to your team. Does this sound like you?

If so, you probably already know that most of those tasks can be accomplished using a combination of tools like Excel, SQL and others (including Python of course). However, it’s hard to think of a more efficient and reproducible way to perform this type of analysis and reporting than R – especially with the help of a set of R libraries like dplyr for data manipulation, ggplot2 for visualisation, rmarkdown for reporting and shiny for interactive web applications. These R libraries allow the user to work with the data in a very easy and streamlined way, by bringing all aspects together into one place.

Of course not every analyst and team has the same needs and there is no doubt that there are plenty of cases where Python would be more appropriate or useful.

Sometimes preparing an ad hoc analysis – using R as described above for example – is perfectly suitable for most processes, but it might not be the optimal option if you have to automate and scale it at a later stage. For example, your organisation might decide to develop infrastructure to run A/B tests at scale or to use the results of an ongoing analysis in order to improve the customer experience in real time. Python is typically the preferred language for this type of use cases.

There’s also the type of analytics professional who prefers to move beyond data analysis as the main area of focus and use programming skills to accomplish a variety of other tasks such as web crawling, natural language processing, developing web apps or automating various other tasks. Again, Python, being a general purpose language, is recommended for these use cases, many of which fall within the broader data science area.

What about machine learning? Machine learning and AI in the digital analytics world is currently something that mainly happens behind the scenes at the side of the platform providers, Google, Adobe etc. rather than in-house. But if there is scope for machine learning in your organisation for it to become a significant part of your role, then Python with scikit-learn offers a very solid and consistent API for machine learning work, which is often quoted as one of Python’s  strengths.

How about learning both languages?

Even though I wouldn’t recommend learning the two languages simultaneously (unless you are in college of course), I do believe that being able to navigate code in both R and Python is a useful skill to have. If you choose R then becoming familiar with Python and being able to read and use Python code could help you solve a broader range of problems faster.  Open platforms like the powerful Jupyter Lab allow users to combine R, Python and in fact more languages within a single environment. In the long term being able to just use the right tool for the task at hand every time could be the winning strategy.

Closing thoughts

It is fascinating how open source and open knowledge has allowed many individuals, regardless of where they are located or where they work, to access powerful tools like Python and R and to create great impact within their teams and organisations. Let’s remember though that this openness wasn’t always available and that the use of advanced analytics until recently was a privilege of those large enterprises that could afford the high costs associated with proprietary technology.

So, no matter whether you choose R or Python, now is a great time to embark on this journey – the tools have developed so much and there is no shortage of opportunities to learn. Last but not least, there are very active local and global communities for both R and Python, like #pydata and #rstats which can be great sources of support and inspiration.  Similarly the #data-science channel on measure slack is the home of many interesting discussions between digital analysts, around R, Python and beyond.

Alexandros Papageorgiou

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *

© 2018 MeasureCamp. All rights reserved.