I’ve learned many things since I joined Civis. Least expected though is a new appreciation for simple linear regression and classification models. Shortly after I started, I was asked to evaluate a collection of modeling pipelines on a sample of typical prediction problems at Civis. Glancing at the list, I distinctly remember thinking the real task here was to measure how much better the tree-based models like XGBoost would be compared to some of the “lesser” models like logistic regression.
I set to work installing and running each of the tools on the list, collecting accuracy and runtime metrics. While most were written in Python, using Scikit-Learn, NumPy and SciPy, there was a two-stage model that used the R package glmnet and Scikit-Learn. Thinking it would be easier to have a tool that was written in a single language, I started looking for the Scikit-Learn analog of glmnet, specifically the cv.glmnet function in this R package. Unfortunately, I could not find anything written in Python that emulated the functionality we needed from glmnet.
Still, calling R from Python bothered me, so I started reading the glmnet code to see if I could copy and paste my way to success. As it turns out, the majority of the code was written in FORTRAN—a language so old it’s from a time when computers had only uppercase letters1. Believe it or not, this was actually good news, I just needed to write some Python code to convert the data to the shape and format expected by the FORTRAN glmnet functions and everything would just work. A few weeks later and we had a Python implementation of glmnet. And today, we’re open sourcing the code.
So what is glmnet and why do we like it at Civis? From the package documentation:
“glmnet is a package that fits a generalized linear model via penalized maximum likelihood.”
In less formal terms, glmnet fits lasso, ridge, and elastic net versions of linear and logistic regression. For the moment, we are just going to discuss lasso and its feature selection properties, but for a more complete explanation, see two of our favorte text books ESL and ISL.
Feature selection – the process of identifying which features to include in a modeling task – is a challenging problem. Often, the problem is sufficiently complex that the solution to this is not at all obvious. My colleague Katie discussed feature selection in depth while building a model to predict whether water wells need maintenance. In this particular example, she used SelectKBest which selects the best features based on univariate statistical tests—a measure of how much each feature is related to the outcome of interest.
The lasso is another tool we can use to accomplish the same objective. The lasso works by fitting a normal logistic regression, but imposing a penalty on the absolute value of the coefficients of the model. This has the effect of shrinking the coefficients, some all the way to zero, meaning they are effectively excluded from the model, producing a sparse model. We could stop here if our task was simply producing a classification model, or we could use the selected features—those with non-zero coefficients—in a Scikit-Learn pipeline as one step in a modeling process.
I omitted one important detail above, and this turns out to be one of the key reasons we use glmnet instead of other lasso solvers. I mentioned that lasso imposes a cost on including features in the model, but not how we balance this against the desire to find an accurate model. This balance—the regularization strength—is actually a parameter of the model we must choose. Typically we run grid search, fitting many models to different values of the regularization strength. As you can imagine, this isn’t particularly fast; one of the innovations made by the glmnet authors was making this process of fitting many models to different values of the regularization strength fast and efficient through some clever math tricks. A single call to the glmnet solver returns many model solutions for a range of values for the regularization strength (referred to as the regularization path). Scikit-Learn has a few solvers that are similar to glmnet, ElasticNetCV and LogisticRegressionCV, but they have some limitations. The first one only works for linear regression and the latter does not handle the elastic net penalty. They also require the user to supply the full sequence of regularization parameters whereas glmnet will determine a suitable sequence from the input data.
A brief example with synthetic data:
Generate some synthetic data:
Fitting a model should be familiar to anyone who has used Scikit-Learn. First, we instantiate the estimator, supplying any data-independent parameters. In our case, a few relevant options are:
alpha: the lasso vs ridge strength, 1 being lasso, 0 ridge
n_folds: the number of cross validation folds for computing model performance
Additional options are documented in the class docstring.
?LogitNet will show this if you happen to be using the IPython interpreter or a Jupyter notebook. The
ElasticNet functions in the Python package are similar to
cv.glmnet in the R package in that they run k-fold cross-validation to evaluate the model performance for each value of regularization parameter and automatically select the best.
Next, we call the fit method of the estimator, passing our covariates X and our labels y.
We can also plot the coefficient path. This is a plot of the coefficient of each feature in the model as a function of the regularization parameter. When this parameter is very big, all the coefficients are zero, as it’s lowered, features start entering the model. Note, the x-axis here is reversed.
glmnet may be installed from PyPI and coming soon to conda-forge. The source code is also available on GitHub. We encourage you to try it for your projects, and welcome issues or pull requests.