This post is part of our Bookshelf series organized by the Data Science R&D department at Civis Analytics. In this series, Civis data scientists share links to interesting software tools, blog posts, scientific articles, and other things that they have read about recently, along with a little commentary about why these things are worth checking out. Are you reading anything interesting? We’d love to hear from you on Twitter.
In this blog post, Francois Chollet (author of the Keras neural network library) offers some thoughts on designing APIs that are user-friendly. His motivating claim that “every design decision should be made with the user in mind” is a strong reminder that the value of data science tools comes from how they’re used. This feels like an extension of the old adage that code is read more often than it is written; when the code in question is an API, the point is even more relevant. The post also contains some good, concrete suggestions for how to improve user experience when designing your own APIs.
This fantastically cool paper that was recently released by scholars at the National Bureau of Economic Research (and summarized in this Washington Post article) is an awesome combination of data archaeology and literal archaeology. The authors used transcriptions of cuneiform tablets to build up a record of trade between ancient Assyrian cities, then used the relative trade volumes between different pairs of cities to estimate the distance between those cities, and finally used these distances along with the known locations of a few cities to triangulate the locations of the remaining, hitherto lost cities. The authors note that their work is built on twenty years of work transcribing the tablets—and I thought my ETL was time-consuming!
Leland McInnes (author of the hdbscan implementation in scikit-learn-contrib) recently released a dimensionality reduction library called UMAP that is a promising alternative to t-SNE for large datasets. The project is still young so documentation is sparse and there are a few rough edges, but the performance is promising: when given the 70,000-sample MNIST digit dataset, UMAP ran in about two and a half minutes, compared to about 45 minutes for most t-SNE implementations. There is also a neat, interactive tool by Leon Fadden which compares the results of running UMAP, t-SNE, and PCA against a couple example datasets.