Corpus Structure¶
An Introduction to the Corpus
¶
In Getting Started, I showed a brief tour into how you
can use text_data
to identify potential problems in your
analysis. Now, I want to go over how you can address those.
In addition to allowing you to enter a list of text documents,
Corpus
objects allow you to enter tokenizers
when you initialize them. These tokenizers — so called because
they convert strings into a list of “tokens” — are fairly picky
in how they have to be initialized because of some of the search
features in Corpus
. For full details,
see the text_data.tokenize
module.
But the configuration you should generally be able to get away with
is illustrated in text_data.tokenize.corpus_tokenizer()
.
This function accepts a regular expression pattern that will
split your text into a list of strings and a list of postprocessing
functions that will alter each item in that list of strings,
including by removing them. In our case, we noticed
that the default tokenizer Corpus
uses,
which just splits on r"\w+"
regular expressions
kept onto a bunch of numbers that we didn’t want. So let’s change the
regular expression to only hold onto alphabetic words.
In addition there were a a few 1- or 2-letter words that didn’t
really seem to convey much meaning and that felt to me like
they were possibly artifacts of bad tokenizing. (Specifically,
the default tokenizer will often handle apostrophes poorly.)
I’m going to address those by removing them from the data. If any
of your postprocessing functions returns None
,
text_data.tokenize.corpus_tokenizer()
will simply remove them
from the final list of words.
So now, we’re going to re-tokenize that original database, using a custom tokenizer:
sotu_tokenizer = text_data.tokenize.corpus_tokenizer(
r"[A-Za-z]+",
[str.lower, lambda x: x if len(x) > 2 else None]
)
sotu_corpus = Corpus(list(sotu_data.speech), sotu_tokenizer)
Now, we can do the same thing we did before, looking at the TF-IDF values across the corpus.
tfidf = sotu_corpus.tfidf_matrix()
top_words, top_scores = sotu_corpus.get_top_words(tfidf, top_n=5)
list(np.unique(top_words.flatten()))
There’s still more tinkering that you could do — in a real project, I might consider using WordNet, a tool that helps you reduce words like “dogs” and “cats” into their root forms — but the results I got look pretty decent, and are certainly better than what we had before.
So with that in mind, I want to get started on the analysis task I have.
In particular, I want to see how Abraham Lincoln’s speeches differed from his predecessor,
James Buchanan’s. In order to do this, we’re going to use two functions offered
by text_data
that help you morph your index into another index that’s more
suitable for your analyis task.
This is really useful in text analysis, because you’re often dealing with vague and changing definitions of what counts as a corpus. Sometimes, you want to compare a document to all other documents in a corpus; sometimes, you want to compare it to just one other document. And other times, as we’re going to do, you want to group a bunch of documents together and treat them as if they’re a single document.
We’re going to use one function called text_data.index.WordIndex.slice()
and another called text_data.multi_corpus.flat_concat()
to do this.
text_data.index.WordIndex.slice()
creates a new Corpus
object with the indexes we specify, while text_data.multi_corpus.flat_concat()
combines and flattens a bunch of Corpus
objects.
To start, let’s find all of the speeches that either Obama or Bush gave:
lincoln = sotu_data[sotu_data.president == "Lincoln"]
buchanan = sotu_data[sotu_data.president == "Buchanan"]
We could technically just instantiate these corpuses, much as we did to get
our entire corpus. But doing so would require tokenizing the corpuses again,
which would be slow. So let’s instead create them using text_data.index.WordIndex.slice()
:
buchanan_corpus = sotu_corpus.slice(set(buchanan.index))
lincoln_corpus = sotu_corpus.slice(set(lincoln.index))
And finally, let’s combine these into a class called a WordIndex
.
Essentially, this is the same thing as a Corpus
, with the caveat
that we can’t use the search functions I’ll write about later.
both = text_data.multi_corpus.flat_concat(lincoln_corpus, buchanan_corpus)
Now, we can see what words distinguish Lincoln’s State of the Union speeches from Buchanan’s.
To conduct the analysis, I’m going to use something called a log-odds ratio. It’s explained really well in this paper. (That paper also conveys its limits; specifically, log-odds ratios do a poor job of representing variance, but it’s a decent metric for an introductory analysis.)
There’s a bit more explanation of what a log-odds ratio is in the documentation
for text_data.index.WordIndex.odds_word()
. But making the computation itself
is easy:
log_odds = both.odds_matrix(sublinear=True)
log_odds_ratio = log_odds[:,0] - log_odds[:,1]
And from there, we can visualize our findings by viewing the top 10 scoring results from each candidate:
words, sorted_log_odds = both.get_top_words(log_odds_ratio)
lincoln_words, top_lincoln = words[:10], sorted_log_odds[:10]
buchanan_words, top_buchanan = words[-10:], sorted_log_odds[-10:]
text_data.display.display_score_table(
buchanan_words,
top_buchanan,
"Words Buchanan Used Disproportionately"
)
Words Buchanan Used Disproportionately
Order | Word | Score |
---|---|---|
1. | applied | -2.357841079138746 |
2. | conferred | -2.357841079138746 |
3. | silver | -2.357841079138746 |
4. | estimates | -2.4060753023089525 |
5. | company | -2.4060753023089525 |
6. | five | -2.4060753023089525 |
7. | employ | -2.4478979344312997 |
8. | whilst | -2.4847150241025275 |
9. | gold | -2.4847150241025275 |
10. | paraguay | -2.5738843074966002 |
text_data.display.display_score_table(
lincoln_words,
top_lincoln,
"Words Lincoln Used Disproportionately"
)
Words Lincoln Used Disproportionately
Order | Word | Score |
---|---|---|
1. | emancipation | 2.570928440185467 |
2. | space | 2.4761184382773784 |
3. | agriculture | 2.4465802463270254 |
4. | production | 2.4137697411322385 |
5. | forward | 2.335130804296506 |
6. | wages | 2.335130804296506 |
7. | above | 2.335130804296506 |
8. | run | 2.2868970418386674 |
9. | propose | 2.2868970418386674 |
10. | length | 2.2868970418386674 |
You can see the difference between the two presidents immediately. One of the words Buchanan uses disproportionately is “paraguay,” likely a reference to Buchanan’s attempt to annex Paraguay. Meanwhile, one of Lincoln’s most disproportionately used words is “emancipation,” for obvious reasons.
But we can extend this analysis further by looking at bi-grams. In natural language processing, a “bigram,” is a two-word phrase that’s treated like a word.
Using text_data
, we can create indexes for any ngram we want
from within a Corpus
object. We can then access
the n-grams from within the corpus’s ngram_indexes
attribute.
lincoln_corpus.add_ngram_index(n=2)
buchanan_corpus.add_ngram_index(n=2)
both_bigram = text_data.multi_corpus.flat_concat(
lincoln_corpus.ngram_indexes[2],
buchanan_corpus.ngram_indexes[2]
)
log_odds_bigram = both_bigram.odds_matrix(sublinear=True)
log_odds_ratio_bigram = log_odds_bigram[:,0] - log_odds_bigram[:,1]
bigrams, sorted_log_odds_bigram = both_bigram.get_top_words(log_odds_ratio_bigram)
lincoln_bigrams, top_lincoln_bigrams = bigrams[:10], sorted_log_odds_bigram[:10]
buchanan_bigrams, top_buchanan_bigrams = bigrams[-10:], sorted_log_odds_bigram[-10:]
text_data.display.display_score_table(
lincoln_bigrams,
top_lincoln_bigrams,
"Bigrams Lincoln Used Disproportionately"
)
Bigrams Lincoln Used Disproportionately
Order | Word | Score |
---|---|---|
1. | the measure | 2.159969068846104 |
2. | free colored | 2.159969068846104 |
3. | population and | 2.159969068846104 |
4. | the railways | 2.159969068846104 |
5. | which our | 2.159969068846104 |
6. | the price | 2.159969068846104 |
7. | the foreign | 2.074724307333117 |
8. | agriculture the | 2.074724307333117 |
9. | products and | 2.074724307333117 |
10. | white labor | 2.074724307333117 |
text_data.display.display_score_table(
buchanan_bigrams,
top_buchanan_bigrams,
"Bigrams Buchanan Used Disproportionately"
)
Bigrams Buchanan Used Disproportionately
Order | Word | Score |
---|---|---|
1. | president and | -2.3591582598672822 |
2. | june the | -2.3591582598672822 |
3. | the ordinary | -2.3591582598672822 |
4. | three hundred | -2.407380077115034 |
5. | the capital | -2.4491916115755874 |
6. | hundred and | -2.4859986620392682 |
7. | present fiscal | -2.5188003424008407 |
8. | the constitutional | -2.548330416191556 |
9. | the island | -2.5751425427204833 |
10. | ending june | -2.6976458268603114 |
Now, we can clearly see the influence of the Civil War in the differences between the two presidents’ speeches, with Licoln clearly making repeated references to the war.
Conclusion¶
This illustrates how you can analyze text data to compare the language across
two sets of documents. text_data
offers a large number of tools for
concatenating and slicing data, making it easy to explore data
and compare the language used in a document set between different groups of people.
In the next section, I’ll talk about how you can search through results to get a better sense of the context in which certain language was used.