.. _corpus: Corpus Structure ================= An Introduction to the :class:`~text_data.index.Corpus` ------------------------------------------------------- In :ref:`getting_started`, I showed a brief tour into how you can use :code:`text_data` to identify potential problems in your analysis. Now, I want to go over how you can address those. In addition to allowing you to enter a list of text documents, :class:`~text_data.index.Corpus` objects allow you to enter tokenizers when you initialize them. These tokenizers — so called because they convert strings into a list of "tokens" — are fairly picky in how they have to be initialized because of some of the search features in :class:`~text_data.index.Corpus`. For full details, see the :mod:`text_data.tokenize` module. But the configuration you should generally be able to get away with is illustrated in :func:`text_data.tokenize.corpus_tokenizer`. This function accepts a regular expression pattern that will split your text into a list of strings and a list of postprocessing functions that will alter each item in that list of strings, including by removing them. In our case, we noticed that the default tokenizer :class:`~text_data.index.Corpus` uses, which just splits on :code:`r"\w+"` regular expressions kept onto a bunch of numbers that we didn't want. So let's change the regular expression to only hold onto alphabetic words. In addition there were a a few 1- or 2-letter words that didn't really seem to convey much meaning and that felt to me like they were possibly artifacts of bad tokenizing. (Specifically, the default tokenizer will often handle apostrophes poorly.) I'm going to address those by removing them from the data. If any of your postprocessing functions returns :code:`None`, :func:`text_data.tokenize.corpus_tokenizer` will simply remove them from the final list of words. So now, we're going to re-tokenize that original database, using a custom tokenizer: .. code-block:: sotu_tokenizer = text_data.tokenize.corpus_tokenizer( r"[A-Za-z]+", [str.lower, lambda x: x if len(x) > 2 else None] ) sotu_corpus = Corpus(list(sotu_data.speech), sotu_tokenizer) Now, we can do the same thing we did before, looking at the TF-IDF values across the corpus. .. code-block:: tfidf = sotu_corpus.tfidf_matrix() top_words, top_scores = sotu_corpus.get_top_words(tfidf, top_n=5) list(np.unique(top_words.flatten())) There's still more tinkering that you could do — in a real project, I might consider using WordNet, a tool that helps you reduce words like "dogs" and "cats" into their root forms — but the results I got look pretty decent, and are certainly better than what we had before. So with that in mind, I want to get started on the analysis task I have. In particular, I want to see how Abraham Lincoln's speeches differed from his predecessor, James Buchanan's. In order to do this, we're going to use two functions offered by :code:`text_data` that help you morph your index into another index that's more suitable for your analyis task. This is really useful in text analysis, because you're often dealing with vague and changing definitions of what counts as a corpus. Sometimes, you want to compare a document to all other documents in a corpus; sometimes, you want to compare it to just one other document. And other times, as we're going to do, you want to group a bunch of documents together and treat them as if they're a single document. We're going to use one function called :meth:`text_data.index.WordIndex.slice` and another called :meth:`text_data.multi_corpus.flat_concat` to do this. :meth:`text_data.index.WordIndex.slice` creates a new :class:`~text_data.index.Corpus` object with the indexes we specify, while :meth:`text_data.multi_corpus.flat_concat` combines and flattens a bunch of :class:`~text_data.index.Corpus` objects. To start, let's find all of the speeches that either Obama or Bush gave: .. code-block:: lincoln = sotu_data[sotu_data.president == "Lincoln"] buchanan = sotu_data[sotu_data.president == "Buchanan"] We could technically just instantiate these corpuses, much as we did to get our entire corpus. But doing so would require tokenizing the corpuses again, which would be slow. So let's instead create them using :meth:`text_data.index.WordIndex.slice`: .. code-block:: buchanan_corpus = sotu_corpus.slice(set(buchanan.index)) lincoln_corpus = sotu_corpus.slice(set(lincoln.index)) And finally, let's combine these into a class called a :class:`~text_data.index.WordIndex`. Essentially, this is the same thing as a :class:`~text_data.index.Corpus`, with the caveat that we can't use the search functions I'll write about later. .. code-block:: both = text_data.multi_corpus.flat_concat(lincoln_corpus, buchanan_corpus) Now, we can see what words distinguish Lincoln's State of the Union speeches from Buchanan's. To conduct the analysis, I'm going to use something called a log-odds ratio. It's explained really well in `this paper `_. (That paper also conveys its limits; specifically, log-odds ratios do a poor job of representing variance, but it's a decent metric for an introductory analysis.) There's a bit more explanation of what a log-odds ratio is in the documentation for :meth:`text_data.index.WordIndex.odds_word`. But making the computation itself is easy: .. code-block:: log_odds = both.odds_matrix(sublinear=True) log_odds_ratio = log_odds[:,0] - log_odds[:,1] And from there, we can visualize our findings by viewing the top 10 scoring results from each candidate: .. code-block:: words, sorted_log_odds = both.get_top_words(log_odds_ratio) lincoln_words, top_lincoln = words[:10], sorted_log_odds[:10] buchanan_words, top_buchanan = words[-10:], sorted_log_odds[-10:] text_data.display.display_score_table( buchanan_words, top_buchanan, "Words Buchanan Used Disproportionately" ) .. raw:: html

Words Buchanan Used Disproportionately

OrderWordScore
1.applied-2.357841079138746
2.conferred-2.357841079138746
3.silver-2.357841079138746
4.estimates-2.4060753023089525
5.company-2.4060753023089525
6.five-2.4060753023089525
7.employ-2.4478979344312997
8.whilst-2.4847150241025275
9.gold-2.4847150241025275
10.paraguay-2.5738843074966002
.. code-block:: text_data.display.display_score_table( lincoln_words, top_lincoln, "Words Lincoln Used Disproportionately" ) .. raw:: html

Words Lincoln Used Disproportionately

OrderWordScore
1.emancipation2.570928440185467
2.space2.4761184382773784
3.agriculture2.4465802463270254
4.production2.4137697411322385
5.forward2.335130804296506
6.wages2.335130804296506
7.above2.335130804296506
8.run2.2868970418386674
9.propose2.2868970418386674
10.length2.2868970418386674
You can see the difference between the two presidents immediately. One of the words Buchanan uses disproportionately is "paraguay," likely a reference to Buchanan's attempt to annex Paraguay. Meanwhile, one of Lincoln's most disproportionately used words is "emancipation," for obvious reasons. But we can extend this analysis further by looking at bi-grams. In natural language processing, a "bigram," is a two-word phrase that's treated like a word. Using :code:`text_data`, we can create indexes for any ngram we want from within a :class:`Corpus` object. We can then access the n-grams from within the corpus's :code:`ngram_indexes` attribute. .. code-block:: lincoln_corpus.add_ngram_index(n=2) buchanan_corpus.add_ngram_index(n=2) both_bigram = text_data.multi_corpus.flat_concat( lincoln_corpus.ngram_indexes[2], buchanan_corpus.ngram_indexes[2] ) log_odds_bigram = both_bigram.odds_matrix(sublinear=True) log_odds_ratio_bigram = log_odds_bigram[:,0] - log_odds_bigram[:,1] bigrams, sorted_log_odds_bigram = both_bigram.get_top_words(log_odds_ratio_bigram) lincoln_bigrams, top_lincoln_bigrams = bigrams[:10], sorted_log_odds_bigram[:10] buchanan_bigrams, top_buchanan_bigrams = bigrams[-10:], sorted_log_odds_bigram[-10:] text_data.display.display_score_table( lincoln_bigrams, top_lincoln_bigrams, "Bigrams Lincoln Used Disproportionately" ) .. raw:: html

Bigrams Lincoln Used Disproportionately

OrderWordScore
1.the measure2.159969068846104
2.free colored2.159969068846104
3.population and2.159969068846104
4.the railways2.159969068846104
5.which our2.159969068846104
6.the price2.159969068846104
7.the foreign2.074724307333117
8.agriculture the2.074724307333117
9.products and2.074724307333117
10.white labor2.074724307333117
.. code-block:: text_data.display.display_score_table( buchanan_bigrams, top_buchanan_bigrams, "Bigrams Buchanan Used Disproportionately" ) .. raw:: html

Bigrams Buchanan Used Disproportionately

OrderWordScore
1.president and-2.3591582598672822
2.june the-2.3591582598672822
3.the ordinary-2.3591582598672822
4.three hundred-2.407380077115034
5.the capital-2.4491916115755874
6.hundred and-2.4859986620392682
7.present fiscal-2.5188003424008407
8.the constitutional-2.548330416191556
9.the island-2.5751425427204833
10.ending june-2.6976458268603114
Now, we can clearly see the influence of the Civil War in the differences between the two presidents' speeches, with Licoln clearly making repeated references to the war. Conclusion ---------- This illustrates how you can analyze text data to compare the language across two sets of documents. :code:`text_data` offers a large number of tools for concatenating and slicing data, making it easy to explore data and compare the language used in a document set between different groups of people. In the next section, I'll talk about how you can search through results to get a better sense of the context in which certain language was used.