Here at Civis, we love models. But not the ones that wear swimsuits or walk down runways: the kind of curves we care about are mathematical representations of human behavior, and we employ models to predict everything from political beliefs to consumer preferences.
Imagine that it’s your friend’s birthday, but you don’t know what present to buy for her.
If you’re anything like us, you might first try to find data on what gifts other adults like and, from that, make a prediction about what your friend will like. A simple way to do that is through a cross-tabulation, which compares responses across two or more dimensions:
|18-29||Electronics||Books and Music|
|30-44||Books and Music||Babycare / Children’s Toys|
|45-65||Sports and Outdoors||Home and Garden|
While cross-tabs are great for comparing a few dimensions, they become overwhelming if you have dozens. (What if preferences differ by age, marital status, region, income, and occupation?) More importantly, once you add many dimensions, how can you make sure that there are enough people in each combination of dimensions to be confident about your prediction?
Instead of having to interpret many complicated crosstabs, we can use models to give us estimates of how likely someone will prefer a given gift. More flexible and precise than crosstabs, models take into account hundreds of dimensions at once and reveal hidden relationships across them. When you build a cross-tab, you have to select exactly which dimensions you will analyze; by contrast, models rely on a combination of human and machine intelligence to “pick out” which characteristics have the strongest relationship to the outcome of interest (in our case, someone’s preferred gifts) and leverage mathematical relationships to generate predictions from those characteristics from them.
Modeling for Clean Energy
One of our clients is a company that connects consumers with green energy providers. Recently, they asked us to help them build a model to predict how likely someone is to sign up for clean energy if asked. The company had a huge mailing list, and prior experience had shown that the vast majority of people who received their mail weren’t going to sign up. In a world of limited resources, how could they identify the most likely people to sign up and make sure to reach out to them?
That’s where modeling comes in. Looking at records from past mail campaigns, Civis analyzed which people were most likely to respond to their mailings. The idea is that if we can identify patterns in the data we have about who responds, we can expand them to predict future responses.
A conventional approach towards this problem is to segment existing data into groups by age, gender, or other demographic information and then look at response rates across those groups. That can be a good first step, but to build a more precise model, we want to consider a greater number of characteristics. In our case, we looked at academic and business literature to see what traits might be associated with energy preferences. This led us to gather data on everything from weather patterns to average commute time to home ownership in addition to the typical demographic data. As we gathered this data, we also had to clean all of it and conduct checks to make sure the data was captured accurately.
Once we had the data, we used our custom-built software to construct a few different models and compare their performances. In the end, our best models confounded some of our initial intuitions: for example, we were pretty sure that household income would be related to energy preferences, but we found out this relationship changes based on where you live. Also, it’s not just about household income: education and marital status also play a large role. Our final model leveraged these insights and more.
After we had a model, we applied it to all potential mail targets and created a score that assigned them each a likelihood of responding to a future mailing. Then, we divided the mailing list we sent to our client into 10 groups based on how likely people were to respond: Group 1 were the people were predicted were least likely to respond and Group 10 were the most likely. By sorting their list with the model scores, our client could prioritize their top targets.
Working with our client, we analyzed the results of our model by sending mail and recording the response rates. We didn’t send mail to every group (we’re just testing, after all), but we did send mail to five:
Exactly as our model predicted, the individuals in Group 10 responded at the highest rate—double those in Group 8 and five times those in Group 4 (unfortunately, we can’t show the actual response rates, but rest assured that our bar graph starts at 0!).
This next plot shows how much higher the response rate is compared to the expected response rate of a “control” list that any company might use. Our top decile responded about 3 times more often than the recipients that were in the control list.
Ultimately, our model helped the renewable energy company identify those most likely to respond to their mail and, indirectly, promote the use of clean energy in the US! Another victory for modeling!
Written by Annie Wang. This post was co-authored with Elaine Lee.