This is the second post in a series about end-to-end data analysis in Python using scikit-learn Pipeline and GridSearchCV. In the first post, I got my data formatted for machine learning by encoding string features as integers, and then used the data to build several different models. I got things running really fast, which is great, but at the cost of being a little quick-and-dirty about some details. First, I got the features encoded as integers, but they really should be dummy variables. Second, it’s worth going through the models a little more thoughtfully, to try to understand their performance and if there’s any more juice I can get out of them.
I’ll start with revisiting the way that I transformed my features from strings to integers. Recall that the strings were generally identifying categorical data, like the water source of a well or the village where the well is built. A problem with representing categorical variables as integers is that integers are ordered, while categories are not. The standard way to deal with this is to use dummy variables; one-hot encoding is a very common way of dummying. Each possible category becomes a new boolean feature. For example, if my dataframe looked like this:
This type of dummying is called one-hot encoding, because the categories are expanded over several boolean columns, only one of which is true (hot). I’ll write a one-hot-encoder function that takes the data frame and the title of a column, and returns the same data frame but one-hot encoding performed on the indicated feature. I’m using the scikit-learn OneHotEncoder object, but pandas also has a function called get_dummies() that does effectively the same thing. In fact, I find get_dummies() easier to use in many cases, but I still find it worthwhile to see a more “manual” version of the transformation at least once.
Now I’ll iterate through a list of the columns that I want to one-hot encode, transforming each one as I go, with the final output of that process being a dataframe where all the categorical features are encoded as booleans.
One note before I code that up: one-hot encoding comes with the baggage that it makes my dataset bigger–sometimes a lot bigger. In the countries example above, one column that encoded the country has now been expanded out to three columns. You can imagine that this can sometimes get really, really big (imagine a column encoding all the counties in the United States, for example).
There are some columns in this example that will really blow up the dataset, so I’ll remove them before proceeding with the one-hot encoding.
In practice, I found that dummying my features didn’t make a huge difference in performance when I got to the modeling stage, although this is the kind of thing you generally don’t know before trying it.
I found that dummying had a huge effect on my dataset size in this case””I went from 39 features to over 3 thousand! And that takes into account aggressive trimming of the features that blew up the most. Having so many features invites problems with overfitting, slow and memory-intensive training, and I almost certainly don’t need all 3 thousand features to capture the patterns in my dataset. This is a perfect use case for feature selection, which is supported in scikit-learn by e.g. SelectKBest(), which will do univariate feature selection to get the k features (where k is a number which I have to tell the algorithm). Making a guess, I can ask for the top 100 features, which doesn’t make my performance much worse and speeds things up a lot:
Now I’ll turn my attention back to the machine learning algorithms””there can be theoretical reasons to suspect that a particular algorithm will do better or worse. I found in the last post that a random forest classifier did the best of all the models I tried, beating a logistic regression (by a lot) and a decision tree classifier (by a slimmer margin). This doesn’t come as a surprise to me, and here are a few reasons why:
- A logistic regression is an example of a linear model, which (unless you make special adaptations, which I’ll detail in a moment) assumes that the relationship between each of my features and the output class is a linear one. For example, if one of the features is the depth of a well, a linear model will assume that (all other things being equal) the difference in functionality between a 20-foot-deep well and a 40-foot-deep one will be the same as the difference between 40 feet and 60 feet. This isn’t always a valid assumption. One way to address it is to add extra features like depth squared and the logarithm of depth, which helps a linear model capture nonlinearities, but might not still allow me to get all the nuances of nonlinear relationships.
- A logistic regression also doesn’t capture interactions between features, for example that deep wells might be largely functional and wells drilled in rock are largely functional but deep wells in rocky places are largely non-functional. Again, I can explicitly add interaction terms to the logistic regression, but this gets unwieldy fast when I have many features.
- A decision tree can capture interactions and nonlinearities much more naturally than logistic regression, because of the binary tree structure of the decision tree algorithm itself. The downside of decision trees is that they can be harder to interpret or assign uncertainties to their predictions.
- A random forest is a collection of decision trees, each of which is trained on a subset of the rows/columns of the training data. The randomness in the training set means that the individual trees in a random forest are high-variance, but low-bias, and the final prediction is made by having each tree classify a given event and then using their predictions as “votes,” with the majority opinion being assigned as the label. I have the nonlinearities and interactions being captured by the individual trees, but ensembling many trees into a random forest tends to cancel out the biases/shortcomings of any one tree and I get a stronger predictor overall.
- In empirical studies of many algorithms being applied to many supervised learning problems, random forests often come out on top overall. So when in doubt, or if I only have the time/resources to try one model, a random forest is likely to get at or near the peak performance of all the algorithms on the market.
- If it was tricky to interpret or compute errors for a decision tree, a random forest is only going to be worse because there’s now 50-100 decision trees to worry about.
With these points in mind, it makes sense that my random forest did so well on this task, although one of the catches with random forests is that they have lots of parameters to optimize. How many trees should there be? How does each tree get trained? How many features get used in training each tree? There usually aren’t formulaic answers to these questions, and part of the craft of machine learning is tuning these parameters to get the best performance that I can out of my model. But with so many parameters, which sometimes interact with each other in complex ways, parameter tuning can be a huge hassle. In the next post, I’ll talk about an extremely powerful pair of tools in scikit-learn, the Pipeline and GridSearchCV, that allow crazy powerful parameter tuning in just a few lines of code.