Corpus Structure

An Introduction to the Corpus

In Getting Started, I showed a brief tour into how you can use text_data to identify potential problems in your analysis. Now, I want to go over how you can address those.

In addition to allowing you to enter a list of text documents, Corpus objects allow you to enter tokenizers when you initialize them. These tokenizers — so called because they convert strings into a list of “tokens” — are fairly picky in how they have to be initialized because of some of the search features in Corpus. For full details, see the text_data.tokenize module.

But the configuration you should generally be able to get away with is illustrated in text_data.tokenize.corpus_tokenizer(). This function accepts a regular expression pattern that will split your text into a list of strings and a list of postprocessing functions that will alter each item in that list of strings, including by removing them. In our case, we noticed that the default tokenizer Corpus uses, which just splits on r"\w+" regular expressions kept onto a bunch of numbers that we didn’t want. So let’s change the regular expression to only hold onto alphabetic words.

In addition there were a a few 1- or 2-letter words that didn’t really seem to convey much meaning and that felt to me like they were possibly artifacts of bad tokenizing. (Specifically, the default tokenizer will often handle apostrophes poorly.) I’m going to address those by removing them from the data. If any of your postprocessing functions returns None, text_data.tokenize.corpus_tokenizer() will simply remove them from the final list of words.

So now, we’re going to re-tokenize that original database, using a custom tokenizer:

sotu_tokenizer = text_data.tokenize.corpus_tokenizer(
    [str.lower, lambda x: x if len(x) > 2 else None]
sotu_corpus = Corpus(list(sotu_data.speech), sotu_tokenizer)

Now, we can do the same thing we did before, looking at the TF-IDF values across the corpus.

tfidf = sotu_corpus.tfidf_matrix()
top_words, top_scores = sotu_corpus.get_top_words(tfidf, top_n=5)

There’s still more tinkering that you could do — in a real project, I might consider using WordNet, a tool that helps you reduce words like “dogs” and “cats” into their root forms — but the results I got look pretty decent, and are certainly better than what we had before.

So with that in mind, I want to get started on the analysis task I have. In particular, I want to see how Abraham Lincoln’s speeches differed from his predecessor, James Buchanan’s. In order to do this, we’re going to use two functions offered by text_data that help you morph your index into another index that’s more suitable for your analyis task.

This is really useful in text analysis, because you’re often dealing with vague and changing definitions of what counts as a corpus. Sometimes, you want to compare a document to all other documents in a corpus; sometimes, you want to compare it to just one other document. And other times, as we’re going to do, you want to group a bunch of documents together and treat them as if they’re a single document.

We’re going to use one function called text_data.index.WordIndex.slice() and another called text_data.multi_corpus.flat_concat() to do this. text_data.index.WordIndex.slice() creates a new Corpus object with the indexes we specify, while text_data.multi_corpus.flat_concat() combines and flattens a bunch of Corpus objects.

To start, let’s find all of the speeches that either Obama or Bush gave:

lincoln = sotu_data[sotu_data.president == "Lincoln"]
buchanan = sotu_data[sotu_data.president == "Buchanan"]

We could technically just instantiate these corpuses, much as we did to get our entire corpus. But doing so would require tokenizing the corpuses again, which would be slow. So let’s instead create them using text_data.index.WordIndex.slice():

buchanan_corpus = sotu_corpus.slice(set(buchanan.index))
lincoln_corpus = sotu_corpus.slice(set(lincoln.index))

And finally, let’s combine these into a class called a WordIndex. Essentially, this is the same thing as a Corpus, with the caveat that we can’t use the search functions I’ll write about later.

both = text_data.multi_corpus.flat_concat(lincoln_corpus, buchanan_corpus)

Now, we can see what words distinguish Lincoln’s State of the Union speeches from Buchanan’s.

To conduct the analysis, I’m going to use something called a log-odds ratio. It’s explained really well in this paper. (That paper also conveys its limits; specifically, log-odds ratios do a poor job of representing variance, but it’s a decent metric for an introductory analysis.)

There’s a bit more explanation of what a log-odds ratio is in the documentation for text_data.index.WordIndex.odds_word(). But making the computation itself is easy:

log_odds = both.odds_matrix(sublinear=True)
log_odds_ratio = log_odds[:,0] - log_odds[:,1]

And from there, we can visualize our findings by viewing the top 10 scoring results from each candidate:

words, sorted_log_odds = both.get_top_words(log_odds_ratio)
lincoln_words, top_lincoln = words[:10], sorted_log_odds[:10]
buchanan_words, top_buchanan = words[-10:], sorted_log_odds[-10:]
    "Words Buchanan Used Disproportionately"

Words Buchanan Used Disproportionately

    "Words Lincoln Used Disproportionately"

Words Lincoln Used Disproportionately


You can see the difference between the two presidents immediately. One of the words Buchanan uses disproportionately is “paraguay,” likely a reference to Buchanan’s attempt to annex Paraguay. Meanwhile, one of Lincoln’s most disproportionately used words is “emancipation,” for obvious reasons.

But we can extend this analysis further by looking at bi-grams. In natural language processing, a “bigram,” is a two-word phrase that’s treated like a word.

Using text_data, we can create indexes for any ngram we want from within a Corpus object. We can then access the n-grams from within the corpus’s ngram_indexes attribute.

both_bigram = text_data.multi_corpus.flat_concat(
log_odds_bigram = both_bigram.odds_matrix(sublinear=True)
log_odds_ratio_bigram = log_odds_bigram[:,0] - log_odds_bigram[:,1]
bigrams, sorted_log_odds_bigram = both_bigram.get_top_words(log_odds_ratio_bigram)
lincoln_bigrams, top_lincoln_bigrams = bigrams[:10], sorted_log_odds_bigram[:10]
buchanan_bigrams, top_buchanan_bigrams = bigrams[-10:], sorted_log_odds_bigram[-10:]
    "Bigrams Lincoln Used Disproportionately"

Bigrams Lincoln Used Disproportionately

1.the measure2.159969068846104 colored2.159969068846104
3.population and2.159969068846104
4.the railways2.159969068846104
5.which our2.159969068846104
6.the price2.159969068846104
7.the foreign2.074724307333117
8.agriculture the2.074724307333117
9.products and2.074724307333117
10.white labor2.074724307333117
    "Bigrams Buchanan Used Disproportionately"

Bigrams Buchanan Used Disproportionately

1.president and-2.3591582598672822
2.june the-2.3591582598672822
3.the ordinary-2.3591582598672822
4.three hundred-2.407380077115034
5.the capital-2.4491916115755874
6.hundred and-2.4859986620392682
7.present fiscal-2.5188003424008407
8.the constitutional-2.548330416191556
9.the island-2.5751425427204833
10.ending june-2.6976458268603114

Now, we can clearly see the influence of the Civil War in the differences between the two presidents’ speeches, with Licoln clearly making repeated references to the war.


This illustrates how you can analyze text data to compare the language across two sets of documents. text_data offers a large number of tools for concatenating and slicing data, making it easy to explore data and compare the language used in a document set between different groups of people.

In the next section, I’ll talk about how you can search through results to get a better sense of the context in which certain language was used.