How text_data
is organized¶
In the next few parts of the tutorial, I’m going to introduce
a bunch of the tools that text_data
has to offer
within the context of a mock project. My hope is that they’ll
give you a sense of how you might introduce the library
into your analysis.
But before I do that, I want to introduce the structure of the library.
text_data
is built upon two classes, WordIndex
and Corpus
. The WordIndex
class indexes your documents in an efficient data structure
and provides a number of ways for you to perform statistical lookups.
You can see how often words are used across your entire set of documents
or within a single option. And it supports matrix and vector operations
that allow you to see those same statistics at a much broader scale.
Because it uses an efficient data structure and parallelized
Rust code, those matrix and vector operations run rapidly.
In addition, WordIndex
offers a number of ways
to split up or concatenate individual indexes you have
so you can compare how a portion of documents compares to another portion
of documents. This can be useful as you’re conducting a standalone
analysis, or it can be useful if you’re trying to debug
why a machine learning model is incorectly classifying
some of your documents.
Corpus
builds upon the WordIndex
to support searching through documents. You can look up arbitrarily
long phrases and conduct boolean AND
, NOT
, and OR
queries. In addition, Corpus
offers an easy
way to index multi-word phrases.
For the most part, the other portions of the library work around these
two classes. text_data.query
holds the internal support for
building queries. It’s mainly meant as an internal data structure,
although there are cases when you might want to use it to
debug search results that don’t seem to be working. text_data.tokenize
provides easy ways to write tokenizers (or functions that split up
a string into a list of words, or “tokens”) that you can plug
directly into a Corpus
. And text_data.multi_corpus
offers two simple functions for building Corpus
or WordIndex
objects from a list of other Corpus
or WordIndex
objects.
The only partial exception to this rule is the text_data.display
module, which offers features for displaying data visualizations
and top values along numpy matrixes and arrays. For the most part,
this, too, is designed to work with the WordIndex
and Corpus
.
But it’s flexible and accepts other numpy matrixes and arrays as its inputs.