The last two posts in this series have been about getting a data science analysis quickly up and running, and then circling back to improve it or understand the patterns I find, for example, which algorithms are working best and why. The upshot was a better handle on my workflow, but I’m left with a lot of free parameters of my algorithms to tune, and messing around with my workflow often leads to spaghetti code that becomes less and less understandable/easy to experiment with as I go. Enter the scikit-learn Pipeline and GridSearchCV objects: two tools that effectively allow me to pour gasoline on my data science fire, tightening up the code and doing parameter scans in just a few lines of code.
First up is Pipeline. There are a number of tools that I’ve chained together to get where I am now, like SelectKBest and RandomForestClassifier. After selecting the 100 best features, the natural next step is to run my random forest again to see if it does a little better with fewer features. In this case, I have SelectKBest doing selection, with the output of that process going straight into a classifier. Pipeline packages the transformation step of SelectKBest with the estimation step of RandomForestClassifier into a coherent workflow.
Why might I want to use Pipeline instead of keeping the steps separate?
- It makes code more readable (or, if you like, it makes the intent of the code clearer).
- I don’t have to worry about keeping track data during intermediate steps, for example between transforming and estimating.
- It makes it trivial to move ordering of the pipeline pieces, or to swap pieces in and out.
- It allows you to do GridSearchCV on your workflow
This last point is, in my opinion, the most important. I will get to that point very soon, but first I’ll get a Pipeline up and running that does SelectKBest followed by RandomForestClassifier.
I make a list of steps, each of which is a transformer (like SelectKBest) or, for the last one in the list, an estimator (RandomForestClassifier), and then turn that list into a Pipeline. Then the Pipeline is a single coherent workflow, with the transformed data from SelectKBest being seamlessly passed along the RandomForestClassifier. Depending on exactly what I want to do in a given case, I could have many transformers strung together, with or without an estimator at the end.
By the way””I’ve slightly changed the way that I am evaluating my model, using classification_report. It gives me more information than cross_val_score, which I was using before, although it’s a little more involved to use (I am responsible for doing the training/testing split now, whereas cross_val_score did that automatically).
Now to GridSearchCV. When I decided to select the 100 best features, setting that number to 100 was kind of a hand-wavey decision. Similarly, the RandomForestClassifier that I’m using right now has all its parameters set to their default values, which might not be optimal.
So, a straightforward thing to do now is to try different values of k (the number of features being used in the model) and any RandomForestClassifier parameters I want to tune (for the sake of concreteness, I’ll play with n_estimators and min_samples_split). Trying lots of values for each of these free parameters is tedious, and there can sometimes be interactions between the choices I make in one step and the optimal value for a downstream step. In other words, to avoid local optima, I should try all the combinations of parameters, and not just vary them independently. If I want to try 5 different values each for k, n_estimators and min_samples_split, that means 5 x 5 x 5 = 125 different combinations to try. Not something I want to do by hand.
GridSearchCV allows me to construct a grid of all the combinations of parameters, tries each combination, and then reports back the best combination/model.
GridSearchCV seems a little scary at first, because the parameter grid is easy to mess up. There’s a particular convention being followed in the way that the parameters are named in the parameters dictionary; I need to have the name of the Pipeline step (e.g. feature_selection, not select; or random_forest, not clf), followed by two underscores, followed by the name of the parameter (in sklearn parlance) that I want to vary. To put this all together in a painfully simple example:
But once I’ve got the parameter grid set up properly, the power of GridSearchCV is that it multiplies out all the combinations of parameters and tries each one, making a 3-fold cross-validated model for each combination. Then I can ask for predictions from my GridSearchCV object and it will automatically return to me the “best” set of predictions (that is, the predictions from the best model that it tried), or I can explicitly ask for the best model/best parameters using methods associated with GridSearchCV. Of course, trying tons of models can be kind of time-consuming, but the outcome is a much better understanding of how my model performance depends on parameters.
I should also mention that I can also use GridSearchCV on just a single object, rather than a full Pipeline. For example, I can optimize SelectKBest or the RandomForestClassifier on their own and that will work just fine. But since there can sometimes be interactions between various steps in the analysis, being able to optimize over the full Pipeline is really useful. It’s also trickier to do, which makes it a good example for teaching. Last, GridSearchCV will automatically cross validate all steps of the analysis, such as the feature selection–it’s not just the final algorithm that should be cross-validated, but the upstream transforms as well!
This brings me to the end of this series, about end-to-end data analysis in scikit-learn and pandas. My goal in these posts is not to show a perfect analysis, or even one that demonstrates all the steps one might try, but instead to focus on the process. If I can get something up and running quickly, even if it’s imperfect, I’m in a much better position to understand later on how much my refinements are indeed improving the analysis. At the same time, there are definitely best practices and tools (like Pipeline and GridSearchCV) that will make my life much easier as my work expands. Having a great set of tools in the python data science stack, and knowing when and how to deploy them, leaves me free to spend my time and energy on the most interesting, important and difficult-to-automate tasks–like trying to find the uninsured.