At Civis, we are continuing to work more and more with Apache Spark, a tool which is almost synonymous with #BigData these days. With Thanksgiving coming up, let’s use Spark to analyze some tweets about Thanksgiving from last year (2016).
Goals of this post:
- Do some fun and interesting analyses of Twitter data in time for Turkey Day.
- Provide an introductory example for Spark that’s a bit more complex than simple word counting.
- Bonus: show how to make state-level choropleth maps in Python (surprisingly, this was almost as tricky as Spark).
First, take a look at this map because choropleth maps are pure eye-candy, possibly even more delicious than pumpkin pie.
This map shows the rate for each state of Twitter users tweeting explicitly about turkey (in particular, by using either the word “turkey” or “#turkey”) on last Thanksgiving, adjusted (or “normalized”) based on an estimate of the number of generally active Twitter users in each state. Note that the coloring for these maps is based on a linear scale where light gray corresponds to zero activity and deep blue corresponds to the maximum adjusted activity level for a topic.
Next, let’s look at a map for users who explicitly mentioned Thanksgiving (“thanksgiving”, “#thanksgiving” or “#happythanksgiving”):
The map looks broadly similar in that both turkey and Thanksgiving were tweeted about all over the country. However, there were a lot more people tweeting “Happy Thanksgiving” and such than there are people talking about turkey: about 500,000 versus about 80,000, respectively. Note: these numbers are for users with U.S. profile locations from midnight to midnight Pacific time last Thanksgiving, not including retweets.
If you want to see some clearly regional topics, look no further than the following maps for a couple of the football teams that play on Thanksgiving:
The approximately 15,000 tweets about the Detroit Lions (“lions”, “@lions”, “#lions”) were largely from folks in Michigan, and the approximately 34,000 tweets about the Dallas Cowboys (“cowboys”, “@dallascowboys”, “#cowboys”) were largely from folks in Texas. Cowboys-related tweets had a slightly broader geographic distribution, which might be why some people (Cowboys fans, in particular) call them “America’s team”. For reference, the Lions played against the Vikings, and the Cowboys played against the Redskins, which explains the activity around Minnesota and Virginia, respectively.
Moving on to dessert, here’s a map of tweets about pie (“pie” or “#pie”), which about 24,000 people tweeted about. Folks around Kansas and Nebraska seem really into pie (and rightly so!).
Finally, let’s not forget that right after Thanksgiving is Black Friday (“black friday” or “#blackfriday”). Here’s a map of where the approximately 50,000 users tweeting about that were from:
Now, let’s get back to Apache Spark and talk about how those maps were created!
A lot of intro materials for Spark use counting words as a first example, and that’s essentially what I did for this post, but I needed some slightly more interesting transformations and aggregations than just splitting strings and summing up counts. Being a Python programmer, I used PySpark, which has a really nice DataFrame API that seems like it’d be intuitive for anybody who knows how to use pandas or SQLAlchemy. For this post, I didn’t need to run my code on a Spark cluster, but running on my laptop didn’t really seem slower or more difficult than using pandas or built-in Python tools. I suspect that would not have been the case even a year ago: PySpark seems to have come a long way very quickly.
Although I can’t share the data, which was downloaded through Twitter’s enterprise data service GNIP, I hope the following notebook is useful for people trying to learn Spark. It also includes code for producing the maps above in Python, which took a little while to figure out (tl;dr: right now, I’d recommended geopandas for static plots or bqplot for dynamic ones).
#HappyThanksgiving from Civis!