Some of our work on social media analytics was highlighted in a recent Wall Street Journal article, and it gives us a great opportunity to talk about some of the methods we use for making sense of the Twitter firehose.
Civis Analytics takes a different approach from many companies in that we have a general methodology for imposing interpretable structure onto the unruly beast that is a Twitter conversation. We don’t just monitor keywords or hashtags that relate to themes we already know (or assume) are important — we let our algorithms decide what the most important topics of conversation are based on how often they come up and how strongly they cluster together. Letting the data speak for itself helps us ensure that we aren’t missing important themes in the conversation, or influencing our findings based on our prior assumptions. Before I get into some of the methodology and findings, we’ll a look at one of the most eye-opening topics that emerged.
A Trending Topic: Threats to Move
When we set our algorithm loose on millions of tweets about the US presidential primaries, it learned that there was a “topic” of conversation that was strongly associated with the words below:
This topic piqued our interest especially because it hit an all-time high after the Iowa caucuses:
What this topic is actually about becomes clear once you start looking at tweets that are strongly associated with it, for example, this one. That’s right, it’s all about people threatening to leave the country (if their nightmare candidate is elected). That these sorts of tweets make up such a big part of the conversation this election season is interesting, but I’ll leave it to you to determine whether it says more about the candidates or the mood of the electorate.
Some follow-up analysis may help to sway your decision. For your consideration, here’s the breakdown by candidate of people’s threats to move 1:
And, since Mr. Trump is by far the biggest driver of this topic, here is the breakdown of locations to which people threaten to move, should he be elected:
Predictably, given its geographical and linguistic convenience, Canada is the most common refuge. Other North American and/or English-speaking countries are also well-represented on the list. Perhaps more surprisingly, a fair number of people threaten to move to other US states such as Alaska or Hawaii if Trump is elected; it’s an open question why exactly they think this would improve their personal situation. And a few Twitter users are contemplating more extreme measures (moving to Mars, Pluto, or North Korea). Only time will tell if they have to make good on those threats!
Major Themes in the Discussion of US Presidential Candidates
Our methodology for analyzing Twitter conversations can be applied in almost any domain, but let me go into a bit of greater detail about our analysis of the 2015-16 US presidential primary as an example. We analyzed about 15 million tweets spanning the time frame from May 1, 2015 until today, in which one or more of the declared presidential candidates were referenced. We then used unsupervised learning to identify the most salient themes in the social media conversation, and the most important segments of users participating in the discussion.
Civis uses neural networks to identify latent topics in the conversation; some tweets may be associated with multiple topics, others with just one, or even with no general topic that is important enough to track independently. While the model doesn’t tell us the “name” of a topic, it does tell us what words and tweets are most strongly associated with it, which makes it fairly straightforward to assign interpretable labels to them.
For example, in the presidential campaign analysis, many of the 44 topics identified by our model had to do with the candidates themselves. This is clear from the high activation that candidates’ names and hashtags have with the topic (and the other words that show up in the list can be revealing about candidate perceptions!).
Ted Cruz’ birthplace clearly figures prominently in tweets about him, as do his Simpsons impressions and his non-traditional bacon preparation. Trump’s topic is dominated by supportive hashtags and contains less specific terms (in part because he is discussed so frequently and in so many contexts). Rubio’s topic shows evidence of interest in his family’s Cuban background and traffic citations, as well as an incomplete pass he threw in Iowa. And when people talk about Chris Christie on Twitter, they typically aren’t focusing on his experience as a federal prosecutor.
Some other topics identified by the model have to do with major issues that have been debated by the candidates or events that happened during the campaign. Of course, immigration policy and border security has been a major theme of the Republican primary. Funding for Planned Parenthood has been discussed in both contests, as has Hillary Clinton’s use of a private server for e-mail during her tenure at the State Department. The focus of these topics is immediately apparent from the sort of words associated with each:
Among the other major themes in the presidential primary conversation are one about general campaign announcements, one on poll results, one on the primary debates, and one consisting entirely of profanity. (We’ll omit the list of associated words for that last one.)
Who is Talking about this Stuff?
Figuring out what people are talking about on Twitter is helpful, but we can take things a step further by segmenting the users who are actually engaging in the discussion. Civis uses a method for automatic clustering of Twitter users based on their relationships with others in the social network that defines a conversation. As with the identification of topics, actually assigning labels to these user groups (or communities) is typically fairly straightforward.
The diagram below illustrates the three user communities that our algorithm identified as most separable in the discussion of US presidential candidates. In the diagram, each node represents a user account, the color indicates the user’s community, and the size how many followers the user has. Nodes that are closer together, generally, are more similar in the types of users that follow them and the types of users they follow themselves.
Since the topic under discussion is political, it is probably not surprising that the user groupings that emerge are politically-based. There is a Conservative cluster including accounts such as the NRA, the Cato Institute, and most of the GOP presidential candidates; a Progressive cluster with accounts like the Nation, Occupy Wall Street, and Bernie Sanders, and a group of accounts we’ve labeled as the Media. The Media cluster includes accounts like The Wall Street Journal, POLITICO, and (yes) The Onion, but also some more centrist organizations and politicians (including Hillary Clinton and Jeb Bush).
Now that we have categories associated with some of the most important and active users in the conversation, we could do all sorts of crosstabbing and analysis of these communities across topics, time and space.
The presidential primary campaign has been a wild ride so far, and we’ve had fun applying our models in this area, but the methodology can be applied to any topic or brand to gain insights and strategically join conversations.
Written by Derrick Higgins
Tweets were filtered to ensure they included a reference to a particular candidate and matched the regular expression pattern “m moving to ([A-Z][a-z]+(?: [A-Z]+[a-z]+)?)” indicating a threat to move to a specific location. ↩