This post is part of our Bookshelf series organized by the Data Science R&D department at Civis Analytics. In this series, Civis data scientists share links to interesting software tools, blog posts, scientific articles, and other things that they have read about recently, along with a little commentary about why these things are worth checking out. Are you reading anything interesting? We’d love to hear from you on Twitter.
There is a growing body of evidence that suggests that food insecurity is correlated with poor educational outcomes. This paper is an interesting addition. It looks at the lag time between the receipt of Supplemental Nutrition Assistance Program (SNAP) benefits, also known as “food stamps,” and the administration of standardized tests to young students. Due to random shifts in food stamp receipt dates, they were able to track how this lag is linked to testing outcomes for individual children over time. They find that test scores tend to be lower towards the end of the benefits cycle or when the benefits are distributed over the weekend. We care deeply about this issue at Civis. For info on how we built models for the Robin Hood Foundation (an organization that advocates for low-income New Yorkers) to get SNAP and Earned Income Tax Credit benefits to those who are eligible, check out this report.
We use the Python data analysis library pandas extensively at Civis. We even sponsor its development (in addition to other open source projects) through NumFOCUS. Its main author, Wes McKinney, has not been shy about discussing its limitations over the past few years. This post is rather long, but it’s an interesting dive into the work that McKinney and others have been doing at Apache Arrow to fix what he dubs the 10 (well, really 11) things he hates about pandas. These include issues we’ve had to solve here, such as the need to “have 5 to 10 times as much RAM as the size of your dataset.” If you’re interested in supporting their work, consider donating or contributing to the pandas project.
Sometimes simple approaches can yield impressive results. This paper presents a new pipeline for reconstructing images from incomplete or low-resolution images. The pipeline makes use of the extremely simple nearest neighbors algorithm, after a first-stage convolutional neural network, to overcome some of the cited drawbacks of the state-of-the-art generative adversarial networks. In particular, they claim that their results are more interpretable.
Complementary to our work at Civis building tools to study posts and interests on Twitter, FiveThirtyEight has built a tool to track the popularity of terms on Reddit, the Internet home of some of Donald Trump’s “more rabid followers.” The data set goes through July of this year, thus including the first six months of the new presidency. The link above provides several examples, all showing the relative popularity of several different keywords (for example “covfefe” vs “bigly” or “MAGA” vs “build the wall”) as a function of time over the course of about a year.