text_data is organized¶
In the next few parts of the tutorial, I’m going to introduce
a bunch of the tools that
text_data has to offer
within the context of a mock project. My hope is that they’ll
give you a sense of how you might introduce the library
into your analysis.
But before I do that, I want to introduce the structure of the library.
text_data is built upon two classes,
class indexes your documents in an efficient data structure
and provides a number of ways for you to perform statistical lookups.
You can see how often words are used across your entire set of documents
or within a single option. And it supports matrix and vector operations
that allow you to see those same statistics at a much broader scale.
Because it uses an efficient data structure and parallelized
Rust code, those matrix and vector operations run rapidly.
WordIndex offers a number of ways
to split up or concatenate individual indexes you have
so you can compare how a portion of documents compares to another portion
of documents. This can be useful as you’re conducting a standalone
analysis, or it can be useful if you’re trying to debug
why a machine learning model is incorectly classifying
some of your documents.
Corpus builds upon the
to support searching through documents. You can look up arbitrarily
long phrases and conduct boolean
queries. In addition,
Corpus offers an easy
way to index multi-word phrases.
For the most part, the other portions of the library work around these
text_data.query holds the internal support for
building queries. It’s mainly meant as an internal data structure,
although there are cases when you might want to use it to
debug search results that don’t seem to be working.
provides easy ways to write tokenizers (or functions that split up
a string into a list of words, or “tokens”) that you can plug
directly into a
offers two simple functions for building
objects from a list of other
The only partial exception to this rule is the
module, which offers features for displaying data visualizations
and top values along numpy matrixes and arrays. For the most part,
this, too, is designed to work with the
But it’s flexible and accepts other numpy matrixes and arrays as its inputs.