This post is part of our Bookshelf series organized by the Data Science R&D department at Civis Analytics. In this series, Civis data scientists share links to interesting software tools, blog posts, scientific articles, and other things that they have read about recently, along with a little commentary about why these things are worth checking out. Are you reading anything interesting? We’d love to hear from you on Twitter.
For this week’s Bookshelf post, I thought I’d do something a little different and dive — not into a book or blog — but a different kind of reading material: source code! As data scientists, we’re expected to be as comfortable working with data and algorithms as with a wide variety of tools and software. I thought I’d do a deeper dive into some corners of the Python data science stack from which we can learn important things and check out a couple of gotchas. Ultimately, I hope to convince you that you should read more source code — it’s fun, interesting, and ultimately incredibly useful!
I’m sure you’re all familiar with the awesome scikit-learn package and its wide range of statistical models and algorithms. It’s often the case, however, that we need to write our own algorithm (preferably following the scikit-learn API to leverage useful native features) which is composed of other algorithms combined in some way. In the language of scikit-learn, this is called a “MetaEstimator.” What can be tricky in these situations is handling random state so that our experiments are reproducible. There are some important considerations we want to bear in mind when setting seeds. Firstly, we should set the seed once and only once per estimator fit. Secondly, we should seed all subsequent calls from one random number generator. Scikit-learn handles this intelligently with a “_set_random_states“ function which dispatches different, deterministic random states to each sub-estimator. This nifty piece of code is an excellent example of ensuring your algorithms are reproducible.
A significant change in Python 3.6 was that dictionaries became ordered by default! The (very sensible) proposal appeared on the Python-Dev mailing list and ended up in the Python github repo in this patch. This is a fun read (if you’re willing to dive deep into the Python internals) because it’s almost all changes to the C code. It’s worth noting that if this significant change had gone from ordered to unordered it potentially could have constituted a major, API-breaking change. Crucially for us, it’s important to note that if you’re writing software in Python 3.6, Python 3.5 users won’t have ordered dictionaries by default; though, of course, the “OrderedDict” class can always be imported from the amazing collections module. This can cause some gnarly bugs if you’re not paying close attention.
Multiprocessing in Python is often a great way to speed up serial computations. The scikit-learn library makes heavy use of a library called joblib to dispatch computations across processes. The library is so useful we’ve even extended it for our platform here at Civis. It does, however, have an interesting property which isn’t found in the native Python multiprocessing libraries. Here, we don’t need to go farther than the docstring of the “Parallel” class to see that, if we set the number of jobs to be one, we do not create new processes or threads (you can also check out the code for this here). This model is different from the “concurrent.futures” processing model (see this for a contrast) and can — if you’re not careful — lead to surprising results like your model training on your laptop instead of a server in the cloud as you intended.
Understanding the data model behind any programming language can be tricky for data scientists. Many of us come from statistics or other scientific disciplines, and it’s often only through trial and error that we build up an accurate mental model of the structure of the language. Recently, we were puzzling over Python’s internal data model over lunch, and after some digging I found this incredibly useful comment in the CPython source code. This dense description touches on everything from how Python handles types to reference counting and garbage collection. It’s worth taking some time to read up on the language first if you’re new to it but what struck me was quality and the brevity of the description of the core parts of the Python language: objects and types. You won’t find this in too many places on the web, and it is yet another great reason to read up on some awesome source code on GitHub!