Getting Started

Getting Started

To get started, I’m going to assume you followed the Installation guide, both installing text_data and its optional dependencies and downloading the State of the Union Corpus. I’m also going to assume you’re using Jupyter or some other interactive Python environment that lets you visually render the results of your code. (Assuming you plan on following along.)

Setting up the Data

Let’s get started by loading the State of the Union data into pandas. This isn’t strictly necessary for text_data, but it will make our lives a little bit easier.

The State of the Union speeches are each held in separate files within the directory, each in the format "name_year.text". So here’s what the code loading the files in looks like:

import glob
import re

import pandas as pd

def load_data():
    files = glob.glob("sotu-data/*.txt")
    path_desc = re.compile(r"sotu-data/([A-Za-z]+)_([0-9]{4})\.txt")
    for filepath in files:
        with open(filepath, "r") as f:
            raw_text = f.read()
        president, year = path_desc.search(filepath).groups()
        yield {"president": president, "year": year, "speech": raw_text}

sotu_data = pd.DataFrame(load_data())

From here, we can get started using text_data.

Creating a Corpus

Over the rest of this tutorial, I’ll be going over what a Corpus is in more detail. But essentially, it operates as an index that stores our text data in a way that makes it easy to quickly compute statistics about the how language is used in a set of documents and search through the documents themselves to hopefully find interesting patterns.

The only requirement to instantiate it is a list of documents. Now that you have a set of documents, you can form a Corpus.

import text_data

sotu_corpus = text_data.Corpus(list(sotu_data))

This indexed the State of the Union speeches. From here, you can conduct some introductory exploratory analysis.

To start off, let’s just compute a couple of simple statistics to learn more about our data. It will be helpful, for instance, if we know how many words our corpus has:

>>> sotu_corpus.num_words
1786621

Similarly, here’s how many unique words there are:

>>> sotu_corpus.vocab_size
24927

And here are the five most common words:

>>> sotu_corpus.most_common(5)
[('the', 149615), ('of', 96394), ('and', 60703), ('to', 60642), ('in', 38521)]

All of this stuff pans out; our most common words are, in fact, common and there are far more words in our corpus than there are unique words.

But the core of this library, and the core of text analysis, lies in analyzing more than a couple of words.

text_data offers a number of ways to analyze a million words of text, but I’ll start with one of its graphical tools. If you have the optional dependencies installed, you can create a histogram. This code will build a histogram showing the number of words in all of the documents.

>>> text_data.display.histogram(
        list(sotu_corpus.doc_lengths.values()),
        x_label="Document Length"
    )
A historgram showing the lengths of documents across the corpus.

There’s a lot to go over in this graphic, but something that should stick out eventually is that one of the values appears to be 0, meaning there are no words in the entire document.

You can further validate this and pin down the document causing the problem:

>>> sorted(sotu_corpus.doc_lengths.items(), key=lambda x: x[1])[:3]
[(80, 0), (214, 1374), (62, 1505)]

There’s a document with the index of 80 that has 0 words in it. If you go to the original data on Kaggle, you can see that the data is blank there.

Since there’s nothing we can do to fix this issue, let’s just delete this record. We should also delete the record from pandas:

>>> sotu_corpus.split_off({80})
>>> sotu_data = sotu_data[~sotu_data.index.isin({80})]

But as we’ll soon see, there are problems on our end, as well. In order to illustrate those, I’m going to compute something called a term-document matrix of TF-IDF scores across the corpus. Roughly speaking, this finds how frequently words occur in each of the documents in our corpus and normalizes those frequencies based on how often the words appear in other documents. By doing this, we can generally gauge what makes each document distinct from the rest of the documents in the corpus.

>>> import numpy as np
>>> tfidf = sotu_corpus.tfidf_matrix()
>>> top_words, _top_scores = sotu_corpus.get_top_words(tfidf, top_n=5)
>>> list(np.unique(top_words.flatten()))

I’m not going to show the entire list, because it’s very long. But suffice to say there are a lot of words that look like this:

  • 1924

  • 1958

  • 2005

In other words, the thing we’re using to split up words is holding onto way too many years. If we’re trying to figure out what makes one president’s speeches different from another’s, what distinguishes one speech from another, or even what makes two documents similar, words like this risk getting in our way.

Conclusion

This kind of exploratory analysis — running quick spot checks to identify problem spots in how you’ve tokenized text and to identify places where, say, a document just appears blank for some reason — is an important step in analyzing text data. text_data tries to make this process as easy as possible, by providing graphical tools to allow you to visualize you findings, statistical calculations to help you conduct your analysis, and search tools to help you make sense of the text you’re reading.

In the next part, I’ll go over how you can write tokenizers to better handle your text data and how you can split up your corpus so you can analyze parts of it separately.