This post is part of our Bookshelf series organized by the Data Science R&D department at Civis Analytics. In this series, Civis data scientists share links to interesting software tools, blog posts, scientific articles, and other things that they have read about recently, along with a little commentary about why these things are worth checking out. Are you reading anything interesting? We’d love to hear from you on Twitter.
Following up on their previous post about Python’s exploding popularity, the folks over at Stack Overflow put together another analysis to see which types of users are driving Python’s growth. The short answer is that it’s mostly data scientists — the python-related question tags which show the most growth are ‘pandas’, ‘numpy’, and ‘matplotlib’ which are all libraries used frequently in data science. One particularly noteworthy tidbit: pandas is responsible for almost 1% of all Stack Overflow traffic, despite being almost entirely the result of volunteer labor. We use pandas a lot here at Civis, which is why we’re proud to be a corporate partner of NumFOCUS, a 501(c)(3) nonprofit which supports the development of pandas and other projects.
Adam Pearce of the NYTimes graphics department wrote up a step-by-step description of how he and his colleagues created a compelling graphic showing rainfall rates and accumulation during hurricane Harvey. The write-up is a neat summary of the mix of tools that went into making the graphic, from old school bash scripting to fancy web visualization tools like d3 and canvas. It’s also a fun behind-the-scenes look at what it’s like to work on a newspaper deadline.
Courses on fairness in machine learning
We’ve talked before about the importance of considering questions of fairness in data science, and it’s heartening to see how the topic is gaining traction in the public consciousness. In particular, we came across two courses on this topic, one at UC Berkeley (which we mentioned before) and another at Cornell, both of which have nicely curated and organized reading lists. The readings are a mix of academic research articles and ‘medium’-style blog posts, so there’s something there for everyone.
The hip-ly named Count Bayesie blog has a nice, minimally-mathematical introduction to the concepts of dataset entropy and Kullback-Leibler divergence. Hang around data scientists for long enough and you’re likely to hear these terms thrown around, but if you haven’t encountered the concepts before they can be intimidatingly mathy. This post does a good job of presenting the intuition behind these ideas without getting bogged down in too much technical detail.