When it comes to data science code, there are two sides to the story. There’s the part where you do the exploratory science — and there’s the part where you put it in production. Last month, we announced that Civis Platform now allows data scientists to work in Jupyter notebooks, which is all about making data science exploration scalable, shareable, and secure. But we know from our experience at Civis that exploration is only half of the story.
The best data science teams know that the job of the data scientist isn’t done until the exploratory work affects decision making in the organization. The challenge is that the code needs to move into production—it needs to make it out of the sandbox and into the rest of the organization. That means it needs to run at scale and be accessible to the people who are counting on it.
So today, we want to talk about the other half of the story: Scripts in Civis Platform.
One of the coolest features of Scripts is that data scientists can turn them into tools for non-technical users. By adding parameters to Scripts, data scientists can surface just those parameters in a form template that others can run without having to interact with the underlying code.
(We use Docker to make this possible. You can watch a Civis data scientist explain Docker at Strata+Hadoop World in 2016.)
We use Scripts inside our own walls to move innovations from notebooks into production. April Chen is a data scientist on Civis’s Data Science Research and Development team, and she used Scripts to build an Outlier Analysis tool, which has become an easy way for other analysts to make reports on the outliers in their datasets.
I was working on a project where I needed to repeatedly check for outliers in a dataset, so I wrote some functions in my Python notebook to help. It turned out that other people on my team also wanted an effective way to find outliers for a bunch of different reasons. Some common ones included checking for data issues, model preprocessing, and finding potentially interesting records or events.
Instead of rerunning the same functions in my Python notebook to continually check for outliers, I built a reusable tool that automatically detects outliers in a dataset. The tool uses Kernel Density Estimation (KDE), which is a non-parametric approach that evaluates the density of data points. Unlike other methods, such as standard deviation from the mean or interquartile range, a KDE approach does not assume that your data follows a normal distribution and is able to address clusters in data, such as data with a bimodal distribution.
I moved my Python code from a notebook into a Container Script and then parameterized it. The result is a template that other analysts can use to automatically build an Outlier Analysis report that visually shows them the outliers in their dataset. They just fill in which table to analyze, set a couple parameters, and push run.
The data shown here comes from the UCI Machine Learning Repository.
April’s project shows one example of how Civis Platform can help a data scientist go from an exploratory environment, like Jupyter Notebooks, into a production environment to get their results into the hands of the people who need to use them.
There are a million ways to use Scripts. We’ve seen data scientists use Scripts to back interactive applications, to parallelize model building with CivisML, to distribute automated reports, and even to build and kick off other Scripts. Stay tuned to read about more ways that data scientists are making an impact by getting their work into production.