How text_data is organized

In the next few parts of the tutorial, I’m going to introduce a bunch of the tools that text_data has to offer within the context of a mock project. My hope is that they’ll give you a sense of how you might introduce the library into your analysis.

But before I do that, I want to introduce the structure of the library.

text_data is built upon two classes, WordIndex and Corpus. The WordIndex class indexes your documents in an efficient data structure and provides a number of ways for you to perform statistical lookups. You can see how often words are used across your entire set of documents or within a single option. And it supports matrix and vector operations that allow you to see those same statistics at a much broader scale. Because it uses an efficient data structure and parallelized Rust code, those matrix and vector operations run rapidly.

In addition, WordIndex offers a number of ways to split up or concatenate individual indexes you have so you can compare how a portion of documents compares to another portion of documents. This can be useful as you’re conducting a standalone analysis, or it can be useful if you’re trying to debug why a machine learning model is incorectly classifying some of your documents.

Corpus builds upon the WordIndex to support searching through documents. You can look up arbitrarily long phrases and conduct boolean AND, NOT, and OR queries. In addition, Corpus offers an easy way to index multi-word phrases.

For the most part, the other portions of the library work around these two classes. text_data.query holds the internal support for building queries. It’s mainly meant as an internal data structure, although there are cases when you might want to use it to debug search results that don’t seem to be working. text_data.tokenize provides easy ways to write tokenizers (or functions that split up a string into a list of words, or “tokens”) that you can plug directly into a Corpus. And text_data.multi_corpus offers two simple functions for building Corpus or WordIndex objects from a list of other Corpus or WordIndex objects.

The only partial exception to this rule is the text_data.display module, which offers features for displaying data visualizations and top values along numpy matrixes and arrays. For the most part, this, too, is designed to work with the WordIndex and Corpus. But it’s flexible and accepts other numpy matrixes and arrays as its inputs.