A couple of weeks ago, I had the opportunity to host a workshop at the Open Data Science Conference in San Francisco. During the workshop, I shared the process of rapid prototyping followed by iterating on the model I’ve built. When I’m building a machine learning model in scikit-learn, I usually don’t know exactly what my final model will look like at the outset. Instead, I’ve developed a workflow that focuses on getting a quick-and-dirty model up and running as quickly as possible, and then going back to iterate on the weak points until the model seems to be converging on an answer.
This process has three phases, which I’ll highlight in an example I created to predict failures of wells in Africa. In this blog post, I’ll show how I got the raw data machine-learning ready and build a few quick models. In subsequent posts, I’ll revisit some of the choices made in the first model, effectively cleaning up some messes that I made in the interest of moving quickly. Lastly, I’ll introduce scikit-learn Pipelines and GridSearchCV, a pair of tools for quickly attaching pieces of data science machinery and comprehensively searching for the best model.
The example problem is solving the “Pump it Up: Mining the Water Table” challenge on drivendata.org, which has examples of wells in Africa, their characteristics and whether they are functional, non-functional, or functional but in need of repair. My goal is to build a model that will take the characteristics of a well and predict correctly which category that well falls into. A quick print statement on the labels shows that the labels are strings:
import pandas as pd import numpy as np features_df = pd.DataFrame.from_csv("well_data.csv") labels_df = pd.DataFrame.from_csv("well_labels.csv") print( labels_df.head(20) ) status_group id
69572 functional 8776 functional 34310 functional 67743 non functional 19728 functional 9944 functional 19816 non functional 54551 non functional 53934 non functional 46144 functional 49056 functional 50409 functional 36957 functional 50495 functional 53752 functional 61848 functional 48451 non functional 58155 non functional 34169 functional needs repair 18274 functional
The machine learning algorithms downstream are not going to handle it well if the class labels used for training are strings; instead, I’ll want to use integers. The mapping that I’ll use is that “non functional” will be transformed to 0, “functional needs repair” will be 1, and “functional” becomes 2. When I want a specific mapping between strings and integers, like here, doing it manually is usually the way I go. In cases where I’m more flexible, there’s also the sklearn LabelEncoder.
There are a number of ways to do the transformation here; the framework below uses applymap() in pandas. Here’s the documentation for applymap(); in the code below, I have filled in the function body for label_map(y) so that if y is “functional”, label_map returns 2; if y is “functional needs repair” then it should return 1, and “non functional” is 0.
As an aside, I could also use apply() here if I like. The difference between apply() and applymap() is that applymap() operates on a whole dataframe while apply() operates on a series (or you can think of it as operating on one column of your dataframe). Since labels_df only has one column (aside from the index column), either one will work here.
def label_map(y): if y=="functional": return 2 elif y=="functional needs repair": return 1 else: return 0 labels_df = labels_df.applymap(label_map) print( labels_df.head() ) status_group id
69572 2 8776 2 34310 2 67743 0 19728 2
Now that the labels are ready, I’ll turn my attention to the features. Many of the features are categorical, where a feature can take on one of a few discrete values, which are not ordered. In transform_feature( df, column ), I take features_df and the name of a column in that dataframe, and return the same dataframe but with the indicated feature encoded with integers rather than strings. This is something I’ll revisit in the next post, where I talk about dummying out categorical features with OneHotEncoder in sklearn or get_dummies() in pandas.
Just a couple last steps to get everything ready for sklearn. The features and labels are taken out of their dataframes and put into a numpy.ndarray and list, respectively.
The cheapest and easiest way to train on one portion of my dataset and test on another, and to get a measure of model quality at the same time, is to use sklearn.cross_validation.cross_val_score(). This splits my data into three equal portions, trains on two of them, and tests on the third. This process repeats three times. That’s why three numbers get printed in the code block below.
import sklearn.linear_model import sklearn.cross_validation "‹ clf = sklearn.linear_model.LogisticRegression() score = sklearn.cross_validation.cross_val_score( clf, X, y ) print( score ) [ 0.65363636 0.6569697 0.6560101 ]
I have a baseline logistic regression model for well failures. There’s an assumption implicit in this model (and the other classifiers below) that classification is the correct approach to take here. Classification is designed for unordered categorical tasks, like predicting whether my favorite ice cream flavor is chocolate, vanilla or strawberry. Regression gives a continuous output which also implies a built-in order to the answers that it gets; an example would be predicting my age or my income. The task of predicting well failures could be modeled either way; it has discrete categories for answers (functional/functional needs repair/non functional) but there’s also an ordering to the categories that a classifier isn’t necessarily going to pick up on. I have the choice of modeling with a classifier and potentially getting slightly worse performance, or building a regression but needing to add a post-processing step that turns my continuous (i.e. float) predictions into integer category labels. I’ve decided to go with the classification approach for this example, but this is a decision made for convenience that I could revisit when improving my model down the road.
I started with a simple logistic regression above (despite the name, this is a classification algorithm) and now I’ll compare to a couple of other classifiers, a decision tree classifier and a random forest classifier, to see which one seems to do the best.
import sklearn.tree import sklearn.ensemble "‹ clf = sklearn.tree.DecisionTreeClassifier() score = sklearn.cross_validation.cross_val_score( clf, X, y ) print( score ) "‹ clf = sklearn.ensemble.RandomForestClassifier() score = sklearn.cross_validation.cross_val_score( clf, X, y ) print( score ) "‹ [ 0.73590909 0.73691919 0.73005051] [ 0.78777778 0.7889899 0.78409091]
And the winner appears to be the random forest, not really a surprise but you’ll have to wait for the next post to learn why the random forest is such a strong algorithm.
This brings me to the end of the “getting started” portion of this analysis. I now have a working data science setup, in which I have:
- read in data
- transformed features and labels to make the data amenable to machine learning
- picked a modeling strategy (classification)
- made a train/test split (this was done implicitly when I called cross_val_score)
- evaluated several models for identifying wells that are failed or in danger of failing
In the next post I’ll clean up some of the technical debt that I’ve accrued by moving so quickly toward getting a model working.