A few weeks ago I had the opportunity to speak at SciPy about how we use both Python and R at Civis. Why go all the way to a Python conference to talk about R? Was I fanning the flames of yet another Python vs R language war? No! It turns out arguing which language is “better” is not a very good use of our time. At Civis, we happily work in both languages—not only for our daily work solving data science problems, but also in writing tools. The how and why of this is what I discussed in my SciPy talk.
My colleagues at Civis come from many different academic backgrounds. The R&D team I work on is composed of a physicist, an economist, two statisticians, and a civil engineer. Everyone at Civis learned the tools and techniques of data science in different contexts, some were formally trained in fields where R is the popular language, others Python, and some even used Matlab. Officially supporting a single language seems a silly choice in this context. Developing expertise in a new language takes quite a bit of time, throwing away skills gained over many years in academia or industry, allowing people to use the tools they are most comfortable with allows us to be generally more productive.
The other reason we use both languages is the availability of statistical tools/packages. In the course of solving a data science problem, we often encounter steps in the pipeline that require a tool that happens to be available in only one language. Our survey pipeline is a perfect example. Ensuring a random sample is representative of a desired universe requires a process called raking, and traditionally, Python has not been popular in the social sciences, so the weighting tools we use are only implemented in R. Surveys also include free-text response questions on occasion—you can see where this is going—R has not been popular with the NLP community, so the tools we use here are implemented in Python. And analyzing survey data is just one of the many steps in solving problems at Civis.
Stringing together workflows implemented in many different languages is challenging. Our data science platform helps in that we can submit a sequence of jobs/steps and the infrastructure takes care of calling each step in turn, passing the data emitted by the previous steps. But this is not the ideal situation for a two reasons. First, it is a bit fragile, more moving parts always means more opportunities for failure, and anyone who has done work in (or used) distributed systems knows this fact only too well. Second, this is inefficient. Switching between languages in the middle of a pipeline requires dumping writing the data to a format that can be read by both—csv files have been the least bad option. Not only are these expensive to parse, they also lose type information.
I’ve outlined the problem at Civis, but what is the ideal situation? In a perfect world, we would use any tool from any language. People comfortable with Python could do all their data analysis using Python and similar for R users. It turns out this is entirely possible to achieve, there are quite a number of projects that have started as cross-language tools: TensorFlow, XGBoost, and Stan to name a few of the popular ones at Civis. Porting or generalizing an existing tool is also possible, and we successfully did this with the R package glmnet.
For readers who write data science tools for others, consider making these cross-language from the start. There are a few ways to do this, but my personal favorite (and the bulk of my SciPy talk) focuses on using the C language and writing bindings/wrappers for Python and R using their respective C APIs. Both Python and R are actually implemented in C, making this the path of least resistance. Although C is a very old language, the community has made some tremendous strides in tooling the past few years. Gone are the days of obscure compiler error messages, both GCC and Clang (the most popular compilers) give nice messages (see the clang website for examples). There are also various “sanitizers” that help catch common bugs like memory leaks or undefined behavior (llvm docs).
Below we’ll work through a small example, writing a function in C and making this callable from both Python and R. The code as well as the slides from my SciPy talk are on GitHub.
We are going to convert the following Python function to C:
This is what the same function looks like in C:
Notice it does not look all that different from the Python function. Of course, there are some type annotations and extra syntactic noise like curly braces and we have to keep track of the length of the array, but the overall logic is the same.
Next, we need to implement a Python binding, allowing users to call this function as they would any other Python function.
There’s a lot going on here, but most of it is just boilerplate code that’s part of any Python module. At the very top, we have a function that takes a Python object, checks that it’s an array of the proper type, calls our tally function, and then returns the result. The rest of the code is the module definition, telling the Python interpreter the name of our tally function and its argument types.
The process for R is very similar:
There is a bit less code here because R doesn’t have as extensive type system as Python, there are not really scalar types so we do not need to do the same level of checking/verifying user input as we do in the Python example above. The remainder of the code is roughly the same, we define a table of functions we want to be available in the R interpreter.
A real world example will necessarily be more complex, but the overall process is not all that difficult. A few things to keep in mind when writing cross-language tools that will make things easier:
- Don’t rely on APIs from the host language (R or Python) in the code you intend to be shared between the two.
- Use error codes to communicate exceptions, don’t call exit or abort as this will take down the process running the host language.
- It’s best to make the host language responsible for memory allocation and deallocation, this means your C/C++ code should be passed pre-allocated space for writing output.
- Trust the compiler, you should treat compiler errors and warnings the same. If your code does not compile without warnings, it’s not finished.
Python and R are both going to remain important in the data science world, neither language is going to “win” the language war. What this means for tool builders and package authors is that we can’t ignore the other language, building useful tools that have a big impact requires making them usable in both languages. One easy route is writing the bulk of the code in C or C++ and using the C API of both languages to provide native bindings.
You can watch my full talk below or view the slides on Github.
We’ll be sharing my colleague Katie’s talk from SciPy in the coming weeks. Be sure to check back to learn how to give your data an entrance exam.