Barack Obama recently gave his final State of the Union address, and since we’re interested in analyzing text data at Civis Analytics, I figured I ought to see if I could discover anything interesting. Rather than trying to understand the conversation on social media as we’ve done in previous work, I decided to take a somewhat longer view, comparing the text of this year’s speech to the texts of all of the previous addresses, starting with George Washington’s first address in 1790.
It turns out that one can use data science to get some pretty interesting insights out of State of the Union addresses with just some very simple text analysis methods.
- There appear to be a few major turning points in the State of the Union timeline where there are large, lasting shifts in the language used. In particular, the period from 1815 until World War I and the period after World War II form coherent blocks of time during which addresses are similar to each other but dissimilar from other time periods.
- Speeches near each other in time tend to be more similar, but the words that make them similar differ: for example, Barack Obama’s 2016 speech was similar to George W. Bush’s because of discussion of terrorism, but it was similar to Clinton’s speeches because of discussion about jobs, young people, and technology.
- The salient topics in Barack Obama’s addresses are jobs, kids, college, clean energy, terrorism, Iraq, and Afghanistan.
- Looking at a few key topics from the past 40 years, we see that Bill Clinton spoke a lot about kids and families compared to other recent presidents, George W. Bush spoke a lot about terrorism, and Barack Obama spoke a lot about jobs and businesses. Also, mentions of oil and energy fell off after Jimmy Carter’s addresses but have increased in the past few years.
Here’s a quick visualization for that last point. I’ll explain it in more detail later.
In the rest of the post, I’ll explain a simple method for computing similarities between addresses before presenting a broad historical overview with analyses of topical and linguistic change over time. I’ll then focus in on Barack Obama’s addresses and how they related to some trends in the past 40 years, before offering a few parting thoughts.
Computing similarities between addresses
A lot of this post is about comparing different State of the Union addresses to
see how similar they are. Here’s the (relatively simple) text analysis methodology for doing that.
I took the data set of State of the Union address texts and performed the following steps:
- split each address into sentences, and each sentence into words;
- combined the list of sentences for each address into a bag of words;
- removed stopwords (e.g., “the”, “a”, “an”), very rare words, numbers, and punctuation;
- created a sparse matrix with the word counts for each speech (number of addresses by number of words);
- weighted the words using log entropy weighting;
- and finally computed the cosine similarity between each pair of address row vectors.
This results in a matrix of similarity scores between addresses, which can be
visualized as follows.
In this similarity matrix visualization, each address corresponds to a row and a column. The similarity between address i and address j is shown at the cell at row i and column j (or row j and column i). Darker cells indicate higher similarities, with the diagonal being maximally dark because it corresponds to an address’s similarity to itself. Note that the cells above the diagonal are a reflection of those below it. This makes it easier to compare a single address across time by looking across a single row or column.
Click here for a version with a label for each address.
Looking across the history of the State of the Union
There are probably many observations to be made about the similarity matrix above, but the thing that seems most salient is that there are large blocks of similar speeches, perhaps representing important eras of American history: 1815 to 1912 (pre-WWI), 1923 to 1932 (Coolidge and Hoover), 1946 to 2016 (post-WWII). (Note: these blocks probably also to some extent represent changes in dialect, both as American English changed over time and since the State of the Union changed from being frequently written to being delivered orally.)
The pre-1815 speeches varied quite a lot from each other, though each presidency makes up a little block. Jefferson’s addresses in particular form a dark block of similar cells along the diagonal in the upper left of the plot.
Focusing on post-WWII addresses, the post-Reagan addresses appear to make up a block, perhaps because they shift away from talking about the international issues such as the Cold War and focus a lot on jobs and families. (I’ll try to interpret this shift a bit more below.)
There are also some outliers in the above plots that are interesting to explore. For example, Bush’s post 9/11 speech, largely about terrorism, is dissimilar to everything except Bush’s subsequent speeches. Carter’s 1981 address is the longest address at over 33,000 words, many times longer than most speeches since 1900 (e.g., Barack Obama’s 2016 address had about 5,200 words). The 1981 address was a written rather than spoken (as were many State of the Union addresses in America’s early years), and though I normalized for the length of speeches in our analyses, its extreme length probably resulted in partial overlap with a lot of other addresses.
We can also “zoom in” by restricting the matrix plot to only include addresses from a particular time period. Note that this causes the mapping from similarity scores to colors to change a bit because the general level of similarity is a bit higher for smaller time periods. This may allow finer-grained distinctions to be made. In the plot below, we’ll zoom in on the post-WWII period.
Examining language and topical change over time
To help better understand the structure of the matrix visualization, I computed the mean of the log entropy scores for each word during various time periods (e.g., pre-WWI). I then ranked words for several time periods in attempt to get the most salient or interesting words for those time periods. For lack of a better term, we’ll call these the most “salient” words.
For example, there is a stark contrast in the salient words before and after Franklin Delano Roosevelt’s (FDR) presidency, which spanned some of the most difficult years the nation has faced because of the Great Depression and WWII. There appears to be a turning point in the language of addresses around WWI or WWII. Much of the language that shows up as particular to the period of time prior to this turning point pertains to the growth of the country, its relationship with colonial powers in Europe, treaties, territories, etc. During and following FDR’s presidency, the language shifts to focus on government programs, jobs, the economy, the Cold War, energy. There also appears to be a larger focus on statistics (e.g., “millions” and “billions” show up as salient words), which at a glance appears related to increased discussion about jobs and government revenues and expenditures.
Salient words for addresses by era
|Pre-FDR period (1790 – 1932)||FDR’s presidency (1934 – 1945)||Post-FDR period (1946 – 2016)|
Barack Obama’s State of the Union addresses
We can also zoom in on just President Obama’s speeches.
Obama’s addresses are all relatively similar to each other. However, not surprisingly, similarity is generally highest between addresses in subsequent years. Looking at the words that are strongly associated with Obama’s addresses, we see a focus on jobs, kids, college, clean energy, terrorism, Iraq, and Afghanistan. Comparing the top words for his first term and second term, the most notable thing seems to be a shift away from talking about Iraq and toward talking about terrorism and Afghanistan. (Note that “al” and “qaeda” show up as different words because I didn’t do any detection of multiword expressions.)
Focusing just on Obama’s 2016 speech, note the use of the word “voices”, which appeared 10 times in singular or plural form (e.g., “democracy breaks down when the average person feels their voice doesn’t matter”). The string “voice” only appears 94 times total in the other 229 addresses.
Salient words for Obama’s addresses
|Obama’s presidency (2009 – 2016)||Obama’s first term (2009 – 2012)||Obama’s second term (2013 – 2016)||Obama’s final address (2016)|
We can also look at which specific words make Obama’s 2016 speech similar to previous speeches. To do this, I took the log entropy-weighted word vector for the 2016 speech and computed the elementwise product with the vectors for each previous speech, respectively. I then found the words for each speech with the largest magnitude for that product. These are essentially the salient or interesting words that overlapped between the 2016 speech and the previous speech. To avoid information overload (if we aren’t there already), the table below just shows the results going back to President Jimmy Carter’s addresses, and just the top three overlapping salient words. One thing to note is that similarity generally decreases as we go back in time, as can be seen in the similarity matrices plotted above (e.g., the 2016 speech was most similar to Obama’s other speeches as well as the other speeches in the last 20 years or so).
|Address||Top overlapping words with 2016|
|Carter 1978||jobs, oil, hardworking|
|Carter 1979||tonight, jobs, commitment|
|Carter 1980||oil, iran, afghanistan|
|Carter 1981||oil, sector, solar|
|Reagan 1982||voices, jobs, sits|
|Reagan 1983||jobs, job, sector|
|Reagan 1984||voices, tougher, tonight|
|Reagan 1985||jobs, pushing, tonight|
|Reagan 1986||tonight, planet, commitment|
|Reagan 1987||syria, tonight, kids|
|Reagan 1988||fighters, tonight, talk|
|Bush 1989||tonight, voices, kids|
|Bush 1990||kids, tonight, got|
|Bush 1991||voices, iraq, tonight|
|Bush 1992||big, jobs, tonight|
|Clinton 1993||got, jobs, cuts|
|Clinton 1994||kids, everybody, got|
|Clinton 1995||voices, kids, got|
|Clinton 1996||harder, businesses, voices|
|Clinton 1997||internet, tonight, college|
|Clinton 1998||got, internet, college|
|Clinton 1999||tonight, computer, iraq|
|Clinton 2000||internet, big, college|
|Bush 2001||tonight, big, energy|
|Bush 2001 #2||terrorists, terrorist, tonight|
|Bush 2002||terrorist, terrorists, coalition|
|Bush 2003||al, terrorist, terrorists|
|Bush 2004||terrorists, iraq, terrorist|
|Bush 2005||terrorists, iraq, got|
|Bush 2006||terrorist, qaeda, terrorists|
|Bush 2007||qaeda, al, terrorists|
|Bush 2008||qaeda, al, terrorists|
|Obama 2009||jobs, college, businesses|
|Obama 2010||kids, businesses, jobs|
|Obama 2011||internet, kids, qaeda|
|Obama 2012||kids, jobs, got|
|Obama 2013||qaeda, kids, al|
|Obama 2014||kids, businesses, jobs|
|Obama 2015||kids, terrorists, networks|
This suggests that the similarity between Obama’s 2016 speech and his previous speeches was because of his discussion of kids, college, jobs, and terrorism. Its (more moderate) similarity to George W. Bush’s addresses was due to discussion of terrorism, whereas the similarity to Bill Clinton’s addresses was due to college, jobs, and technology. We can even go back and compare Obama’s 2016 speech to Carter’s speeches. Though there is less similarity there compared to more recent speeches, there is some interesting overlap in discussion about energy and the Arab world. Across this time period, we also see some overlap in the use of modern colloquial language (e.g., “got”, as in Obama’s 2016 statement that “we’ve actually got to cut the cost of college”).
Recent trends in selected topics
We can also find some interesting trends in the discussion of particular topics.
From looking at the tables of words above as well as the words that were salient
in addresses from the past 40 years, a few topics jumped
out at me, so I decided to take a closer look by plotting topical frequency
Individual words are rare, and so plots of word frequencies can
show a lot of variance, but grouping related words together can give us a clearer picture.
While unsupervised learning techniques such as topic modeling
can be used to automatically find groups of words, here I decided for simplicity
to manually group small groups of closely related words, as follows.
|families||kid(s), parent(s), child(ren), family, families|
|terrorism||terror, terrorism, terrorist(s)|
|jobs||job(s), business(es), worker(s)|
|energy||oil, gas, solar, energy, coal, petroleum, fuel(s)|
The results of these analyses are the plots at the beginning of the post.
They show the percentage of the total number of words in each
address that belong to each topical group.
Note that this analysis doesn’t use log entropy weighting or stopword removal described above.
Please also note that a lot has happened
since Jimmy Carter became president, and so for brevity I’m going to omit some
really important trends (e.g., the end of the Cold War).
The plots also take the words out of context, and
in an effort to keep them focused and avoid polysemy, I have omitted potentially
related words (e.g., the word “power” could be put in the energy-related topic,
but it would also include the sense of “power” related to influence).
Despite the shortcomings of the simple methodology here, we see some interesting
trends. Each topic has a fairly distinct peak during one of the presidencies
on the timeline. Bill Clinton devoted a relatively large fraction of his speeches to
the “families” topic compared other presidents. George W. Bush spoke a lot about
terrorism, and Barack Obama spoke the most about the jobs topic.
The energy topic in particular shows an interesting trend: a few of Carter’s addresses focused
a lot on energy (e.g., due to the oil crisis), and then it was mentioned less
frequently for many years until Obama’s recent addresses.
We also saw this trend to some extent in the previous section, where the salient words
that led to similarity between Obama’s 2016 address and Carter’s addresses included
the words “oil” and “solar”.
In this post, I’ve presented some pretty simple but hopefully compelling (hey, you read this far) analyses of State of the Union addresses. The State of the Union addresses represent the president’s outlook on the past, present, and future of America, and historical analyses provide us with a glimpse of how the country is evolving. The analyses also help to put President Obama’s recent address in a broader context.
Of course, while this is super interesting, an analysis of political addresses over many scores of years aren’t particularly “actionable” (e.g., “Mr. President, other presidents who mentioned jobs and families were also interested in this product”). However, similar historical analyses can be performed on other text data, which may help us gain insights about other types of conversations (e.g., analyses of tweets over days or hours instead of years, or analyses of customer feedback over time, etc.) and thereby help us make practical decisions. Each new data set brings new challenges (especially when there is text), and whether the data involves the future of America or a more practical goal, we at Civis Analytics are interested in using the latest and greatest data science methods for modeling, visualization, etc. to address those challenges.
Update (Jan. 25, 2016): A similar, recently published analysis by Rule, Cointet, and Bearman (2015) was brought to our attention after releasing this post. If you’re interested in a really nice deep dive into the history of the State of the Union, please check it out.
- The corpus of address was adapted from this site, whose author gathered the texts from Project Gutenberg and updated them with more recent addresses from the White House, and various news sources, as discussed here. I added the 2016 address text from the White House website.
- A few addresses to Congress with different titles (e.g., “Address on Administration Goals”, “Address to Joint Session of Congress”) were included in these analyses.
- It is likely that there are many other such analyses of State of the Union addresses in the academic literature and elsewhere (e.g., here, and here). I apologize if I am missing references to related work. We’ve come across a few other similar analyses since the initial release of this post: this one from 2013, this one from early 2016, this 2016 Washington Post infographic, and the work of Rule, Cointet, and Bearman mentioned above.
- In writing this post, as in just about all of work, I relied heavily on open source projects. The main ones I used here: the SciPy stack, gensim, seaborn, matplotlib, segtok