text_data package — Text Data 0.1.0 documentation

Subpackages¶

text_data.core package
- Module contents

Submodules¶

text_data.display module¶

Renders data visualizations on text_data.index.WordIndex objects.

The graphics in this module are designed to work across different metrics. You just have to pass them 1- or 2-dimensional numpy arrays.

This enables you to take the outputs from any functions inside of text_data.index.WordIndex and visualize them.

text_data.display.display_score_table(words, scores, table_name='Top Scores')[source]¶

Returns the top (or bottom scores) as a table.

It requires a 1-dimensional numpy array of the scores and the words, much as you would receive from text_data.index.WordIndex.get_top_words(). For a 2-dimensional equivalent, use display_score_tables().

Parameters

words (array) – A 1-dimensional numpy array of words.
scores (array) – A 1-dimensional numpy array of corresponding scores.
table_name (str) – The name to give your table.

Raises

ValueError – If you did not use a 1-dimensional array, or if the two arrays don’t have identical shapes.

Return type

str

text_data.display.display_score_tables(words, scores, table_names=None)[source]¶

Renders two score tables.

This is the 2-dimensional equivalent of display_score_table() for details.

Parameters

words (array) – A 2-dimensional matrix of words
scores (array) – A 2-dimensional matrix of scores
table_names (Optional[List[str]]) – A list of names for your corresponding tables.

Raises

ValueError – If words and scores aren’t both 2-dimensional arrays of the same shape, or if table_names isn’t of the same length as the number of documents.

text_data.display.frequency_map(index, word_vector, x_label='Word Frequency', y_label='Score')[source]¶

A scatterplot scores over a corpus to their underlying frequencies.

I cribbed this idea from Monroe et al 2008, a great paper that uses it to show distributional problems in metrics that are trying to compare two things.

The basic idea is that by creating a scatter plot mapping the frequencies of words to scores, you can both figure out which scores are disproportionately high or low and identify bias in whether your metric is excessively favoring common or rare words.

In order to render this graphic, your word vector has to conform to the number of words in your index. If you feel the need to remove words to make the graphic manageable to look at, consider using text_data.index.WordIndex.skip_words().

Parameters

index (WordIndex) – A text_data.index.WordIndex object. This is used to get the overall frequencies.
word_vector (array) – A 1-dimensional numpy array with floating point scores.
x_label (str) – The name of the x label for your graphic.
y_label (str) – The name of the y label for your graphic.

Raises

ValueError – If the word_vector doesn’t have 1 dimension or if the vector isn’t the same length as your vocabulary.

text_data.display.heatmap(distance_matrix, left_indexes=None, right_indexes=None, left_name='Left', right_name='Right', metric_name='Similarity')[source]¶

Displays a heatmap displaying scores across a 2-dimensional matrix.

The purpose of this is to visually gauge which documents are closest to each other given two sets of documents. (If you only have one set of documents, the left and right can be the same.) The visual rendering here is inspired by tensorflow’s Universal Sentence Encoder documentation. But, while you can use a universal sentence encoder to create the heatmap, you can also easily use any of the metrics in scikit’s pairwise_distances function. Or, indeed, any other 2-dimensional matrix of floats will do the trick.

Note that the left_name and right_name must be different. In order to account for this, this function automatically adds a suffix to both names if they are the same.

Parameters

distance_matrix (array) – A distance matrix of size M x N where M is the number of documents on the left side and N is the number of documents on the right side.
left_indexes (Optional[List[Any]]) – Labels for the left side (the Y axis)
right_indexes (Optional[List[Any]]) – Labels for the right side (the X axis)
left_name (str) – The Y axis label
right_name (str) – The X axis label

Raises

ValueError – If the side of the indexes doesn’t match the shape of the matrix of if there are not 2 dimensions in the distance matrix.

text_data.display.histogram(values, x_label='Score', y_label='Number of Documents', x_scale='linear', y_scale='linear', max_bins=100)[source]¶

Displays a histogram of values.

This can be really useful for debugging the lengths of documents.

Parameters

values (array) – A numpy array of quantitative values.
x_label (str) – A label for the x-axis.
y_label (str) – A label for the y-axis.
x_scale (str) – A continuous scale type, defined by altair.
y_scale (str) –
A continuous scale type, defined by altair.
max_bins (int) – The maximum number of histogram bins.

text_data.display.render_bar_chart(labels, vector_data, x_label='Score', y_label='Word')[source]¶

Renders a bar chart given a 1-dimensional numpy array.

Parameters

vector_data (array) – A 1-dimensional numpy array of floating point scores.
labels (array) – A 1-dimensional numpy array of labels for the bar chart (e.g. words)
x_label (str) – The label for your x-axis (the score).
y_label (str) – The label for the y-axis (the words).

Raises

ValueError – If the numpy arrays have more than 1 dimension.

text_data.display.render_multi_bar_chart(labels, matrix_scores, document_names, y_label='Score')[source]¶

This renders a bar chart, grouped by document, showing word-document statistics.

It’s essentially the 2-dimensional matrix equivalent of render_bar_chart().

Parameters

labels (array) – A 2-dimensional numpy array of words, like those passed from text_data.index.get_top_scores().
matrix_scores (array) – A 2-dimensional numpy array of scores, like those passed from text_data.index.get_top_scores().
document_names (Optional[List[str]]) – A list of names for the documents. If None, this will display numbers incrementing from 0.
y_label (str) – The name for the y label (where the scores go).

Raises

ValueError – If your labels or your axes aren’t 2 dimensional or aren’t of the same size.

text_data.index module¶

This module handles the indexing of text_data.

Its two classes — WordIndex and Corpus — form the central part of this library.

text_data.index.WordIndex indexes lists of documents — which themselves form lists of words or phrases — and offers utilities for performing statistical calculations on your data.

Using the index, you can find out how many times a given word appeared in a document or do more complicated things, like finding the TF-IDF values for every single word across all of the documents in a corpus. In addition to offering a bunch of different ways to compute statistics, WordIndex also offers capabilities for creating new WordIndex objects — something that can be very helpful if you’re trying to figure out what makes a set of documents different from some other documents.

The text_data.index.Corpus, meanwhile, is a wrapper over WordIndex that offers tools for searching through sets of documents. In addition, it offers tools for visually seeing the results of search queries.

class text_data.index.Corpus(documents, tokenizer=functools.partial(<function postprocess_positions>, [<method 'lower' of 'str' objects>, ]functools.partial(<function tokenize_regex_positions>, '\\\\w+', inverse_match=False)), sep=None, prefix=None, suffix=None)[source]¶

Bases: text_data.index.WordIndex

This is probably going to be your main entrypoint into text_data.

The corpus holds the raw text, the index, and the tokenized text of whatever you’re trying to analyze. Its primary role is to extend the functionality of WordIndex to support searching. This means that you can use the Corpus to search for arbitrarily long phrases using boolean search methods (AND, NOT, BUT).

In addition, it allows you to add indexes so you can calculate statistics on phrases. By using add_ngram_index(), you can figure out the frequency or TF-IDF values of multi-word phrases while still being able to search through your normal index.

Initializing Data

To instantiate the corpus, you need to include a list of documents where each document is a string of text and a tokenizer. There is a default tokenizer, which simply lowercases words and splits documents on r"\w+". For most tasks, this will be insufficient. But text_data.tokenize offers convenient ways that should make building the vast majority of tokenizers easy.

The Corpus can be instantiated using __init__ or by using chunks(), which yields a generator, adding a mini-index. This allows you to technically perform calculations in-memory on larger databases.

You can also initialize a Corpus object by using the slice(), copy(), split_off(), or concatenate() methods. These methods work identically to their equivalent methods in text_data.index.WordIndex while updating extra data that the corpus has, updating n-gram indexes, and automatically re-indexing the corpus.

Updating Data

There are two methods for updating or adding data to the Corpus. update() allows you to add new documents to the corpus. add_ngram_index() allows you to add multi-word indexes.

Searching

There are a few methods devoted to searching. search_documents() allows you to find all of the individual documents matching a query. search_occurrences() shows all of the individual occurrences that matched your query. ranked_search() finds all of the individual occurrences and sorts them according to a variant of their TF-IDF score.

Statistics

Three methods allow you to get statistics about a search. search_document_count() allows you to find the total number of documents matching your query. search_document_freq() shows the proportion of documents matching your query. And search_occurrence_count() finds the total number of matches you have for your query.

Display

There are a number of functions designed to help you visually see the results of your query. display_document() and display_documents() render your documents in HTML. display_document_count(), display_document_frequency(), and display_occurrence_count() all render bar charts showing the number of query results you got. And display_search_results() shows the result of your search.

documents¶: A list of all the raw, non-tokenized documents in the corpus.

tokenizer¶: A function that converts a list of strings (one of the documents from documents into a list of words and a list of the character-level positions where the words are located in the raw text). See text_data.tokenize for details.

tokenized_documents¶: A list of the tokenized documents (each a list of words)

ngram_indexes¶: A list of WordIndex objects for multi-word (n-gram) indexes. See add_ngram_index() for details.

ngram_sep¶: A separator in between words. See add_ngram_index() for details.

ngram_prefix¶: A prefix to go before any n-gram phrases. See add_ngram_index() for details.

ngram_suffix¶: A suffix to go after any n-gram phrases. See add_ngram_index() for details.

Parameters

documents (List[str]) – A list of the raw, un-tokenized texts.
tokenizer (Callable[[str], Tuple[List[str], List[Tuple[int, int]]]]) – A function to tokenize the documents. See text_data.tokenize for details.
sep (Optional[str]) – The separator you want to use for computing n-grams. See add_ngram_index() for details.
prefix (Optional[str]) – The prefix you want to use for n-grams. See add_ngram_index() for details.
suffix (Optional[str]) – The suffix you want to use for n-grams. See add_ngram_index() for details.

add_documents(tokenized_documents, indexed_locations=None)[source]¶

This overrides the add_documents() method.

Because Corpus() objects can have n-gram indices, simply running add_documents would cause n-gram indices to go out of sync with the overall corpus. In order to prevent that, this function raises an error if you try to run it.

Raises

NotImplementedError – Warns you to use ~text_data.index.Corpus.update
instead. –

add_ngram_index(n=1, default=True, sep=None, prefix=None, suffix=None)[source]¶

Adds an n-gram index to the corpus.

This creates a WordIndex object that you can access by typing self.ngram_indexes[n].

There are times when you might want to compute TF-IDF scores, word frequency scores or similar scores over a multi-word index. For instance, you might want to know how frequently someone said ‘United States’ in a speech, without caring how often they used the word ‘united’ or ‘states’.

This function helps you do that. It automatically splits up your documents into an overlapping set of n-length phrases.

Internally, this takes each of your tokenized documents, merges them into lists of n-length phrases, and joins each of those lists by a space. However, you can customize this behavior. If you set prefix, each of the n-grams will be prefixed by that string; if you set suffix, each of the n-grams will end with that string. And if you set sep, each of the words in the n-gram will be separated by the separator.

Example

Say you have a simple four word corpus. If you use the default settings, here’s what your n-grams will look like:

>>> corpus = Corpus(["text data is fun"])
>>> corpus.add_ngram_index(n=2)
>>> corpus.ngram_indexes[2].vocab_list
['data is', 'is fun', 'text data']

By altering sep, prefix, or suffix, you can alter that behavior. But, be careful to set default to False if you want to change the behavior from something you set up in __init__. If you don’t, this will use whatever settings you instantiated the class with.

>>> corpus.add_ngram_index(n=2, sep="</w><w>", prefix="<w>", suffix="</w>", default=False)
>>> corpus.ngram_indexes[2].vocab_list
['<w>data</w><w>is</w>', '<w>is</w><w>fun</w>', '<w>text</w><w>data</w>']

Parameters

n (int) – The number of n-grams (defaults to unigrams)
default (bool) – If true, will keep the values stored in init (including defaults)
sep (Optional[str]) – The separator in between words (if storing n-grams)
prefix (Optional[str]) – The prefix before the first word of each n-gram
suffix (Optional[str]) – The suffix after the last word of each n-gram

classmethod chunks(documents, tokenizer=functools.partial(<function postprocess_positions>, [<method 'lower' of 'str' objects>, ]functools.partial(<function tokenize_regex_positions>, '\\\\w+', inverse_match=False)), sep=None, prefix=None, suffix=None, chunksize=1000000)[source]¶

Iterates through documents, yielding a Corpus with chunksize documents.

This is designed to allow you to technically use Corpus on large document sets. However, you should note that searching for documents will only work within the context of the current chunk.

The same is true for any frequency metrics. As such, you should probably limit metrics to raw counts or aggregations you’ve derived from raw counts.

Example

>>> for docs in Corpus.chunks(["chunk one", "chunk two"], chunksize=1):
...     print(len(docs))
1
1

Parameters

documents (Iterator[str]) – A list of raw text items (not tokenized)
tokenizer (Callable[[str], Tuple[List[str], List[Tuple[int, int]]]]) – A function to tokenize the documents
sep (Optional[str]) – The separator you want to use for computing n-grams.
prefix (Optional[str]) – The prefix for n-grams.
suffix (Optional[str]) – The suffix for n-grams.
chunksize (int) – The number of documents in each chunk.

Return type

Generator[~CorpusClass, None, None]

concatenate(other)[source]¶

This combines two Corpus objects into one, much like text_data.index.WordIndex.concatenate().

However, the new Corpus has data from this corpus, including n-gram data. Because of this, the two Corpus objects must have the same keys for their n-gram dictionaries.

Example

>>> corpus_1 = Corpus(["i am an example"])
>>> corpus_2 = Corpus(["i am too"])
>>> corpus_1.add_ngram_index(n=2)
>>> corpus_2.add_ngram_index(n=2)
>>> combined_corpus = corpus_1.concatenate(corpus_2)
>>> combined_corpus.most_common()
[('am', 2), ('i', 2), ('an', 1), ('example', 1), ('too', 1)]
>>> combined_corpus.ngram_indexes[2].most_common()
[('i am', 2), ('am an', 1), ('am too', 1), ('an example', 1)]

Parameters: other (Corpus) – another Corpus object with the same ngram indexes.
Raises: ValueError – if the n-gram indexes between the two corpuses are not the same.
Return type: Corpus

copy()[source]¶

This creates a shallow copy of a Corpus object.

It extends the contents of Corpus to also store data about the objects themselves.

Return type: Corpus

display_document(doc_idx)[source]¶

Print an entire document, given its index.

Parameters: doc_idx (int) – The index of the document
Return type: HTML

display_document_count(queries, query_tokenizer=<method 'split' of 'str' objects>)[source]¶

Returns a bar chart (in altair) showing the queries with the largest number of documents.

Note

This method requires that you have altair installed. To install, type pip install text_data[display] or poetry add text_data -E display.

Parameters

queries (List[str]) – A list of queries (in the same form you use to search for things)
query_tokenizer (Callable[[str], List[str]]) – The tokenizer for the query

display_document_frequency(queries, query_tokenizer=<method 'split' of 'str' objects>)[source]¶

Displays a bar chart showing the percentages of documents with a given query.

Note

This method requires that you have altair installed. To install, type pip install text_data[display] or poetry add text_data -E display.

Parameters

queries (List[str]) – A list of queries
query_tokenizer (Callable[[str], List[str]]) – A tokenizer for each query

display_documents(documents)[source]¶

Display a number of documents, at the specified indexes.

Parameters: documents (List[int]) – A list of document indexes.
Return type: HTML

display_occurrence_count(queries, query_tokenizer=<method 'split' of 'str' objects>)[source]¶

Display a bar chart showing the number of times a query matches.

Note

This method requires that you have altair installed. To install, type pip install text_data[display] or poetry add text_data -E display.

Parameters

queries (List[str]) – A list of queries
query_tokenizer (Callable[[str], List[str]]) – The tokenizer for the query

display_search_results(search_query, query_tokenizer=<method 'split' of 'str' objects>, max_results=None, window_size=None)[source]¶

Shows the results of a ranked query.

This function runs a query and then renders the result in human-readable HTML. For each result, you will get a document ID and the count of the result.

In addition, all of the matching occurrences of phrases or words you searched for will be highlighted in bold. You can optionally decide how many results you want to return and how long you want each result to be (up to the length of the whole document).

Parameters

search_query (str) – The query you’re searching for
query_tokenizer (Callable[[str], List[str]]) – The tokenizer for the query
max_results (Optional[int]) – The maximum number of results. If None, returns all results.
window_size (Optional[int]) – The number of characters you want to return around the matching phrase.
None (If) –
the entire document. (returns) –

Return type

HTML

flatten()[source]¶

Flattens a Corpus, converting it into a WordIndex.

Return type: WordIndex

ranked_search(query_string, query_tokenizer=<method 'split' of 'str' objects>)[source]¶

This produces a list of search responses in ranked order.

More specifically, the documents are ranked in order of the sum of the TF-IDF scores for each word in the query (with the exception of words that are negated using a NOT operator).

To compute the TF-IDF scores, I simply have computed the dot products between the raw query counts and the TF-IDF scores of all the unique words in the query. This is roughly equivalent to the ltn.lnn normalization scheme described in Manning. (The catch is that I have normalized the term-frequencies in the document to the length of the document.)

Each item in the resulting list is a list referring to a single item. The items inside each of those lists are of the same format you get from search_occurrences(). The first item in each list is either an item having the largest number of words in it or is the item that’s the nearest to another match within the document.

Parameters

query – Query string
query_tokenizer (Callable[[str], List[str]]) – Function for tokenizing the results.

Return type

List[List[PositionResult]]

Returns

A list of tuples, each in the same format as search_occurrences().

search_document_count(query_string, query_tokenizer=<method 'split' of 'str' objects>)[source]¶

Finds the total number of documents matching a query.

By entering a search, you can get the total number of documents that match the query.

Example

>>> corpus = Corpus(["the cow was hungry", "the cow likes the grass"])
>>> corpus.search_document_count("cow")
2
>>> corpus.search_document_count("grass")
1
>>> corpus.search_document_count("the")
2

Parameters

query_string (str) – The query you’re searching for
query_tokenizer (Callable[[str], List[str]]) – The tokenizer for the query

Return type

int

search_document_freq(query_string, query_tokenizer=<method 'split' of 'str' objects>)[source]¶

Finds the percentage of documents that match a query.

Example

>>> corpus = Corpus(["the cow was hungry", "the cow likes the grass"])
>>> corpus.search_document_freq("cow")
1.0
>>> corpus.search_document_freq("grass")
0.5
>>> corpus.search_document_freq("the grass")
0.5
>>> corpus.search_document_freq("the OR nonsense")
1.0

Parameters

query_string (str) – The query you’re searching for
query_tokenizer (Callable[[str], List[str]]) – The tokenizer for the query

Return type

float

search_documents(query, query_tokenizer=<method 'split' of 'str' objects>)[source]¶

Search documents from a query.

In order to figure out the intracacies of writing queries, you should view text_data.query.Query. In general, standard boolean (AND, OR, NOT) searches work perfectly reasonably. You should generally not need to set query_tokenizer to anything other than the default (string split).

This produces a set of unique documents, where each document is the index of the document. To view the documents by their ranked importance (ranked largely using TF-IDF), use ranked_search().

Example

>>> corpus = Corpus(["this is an example", "here is another"])
>>> assert corpus.search_documents("is") == {0, 1}
>>> assert corpus.search_documents("example") == {0}

Parameters

query (str) – A string boolean query (as defined in text_data.query.Query)
query_tokenizer (Callable[[str], List[str]]) – A function to tokenize the words in your query. This allows you to optionally search for words in your index that include spaces (since it defaults to string.split).

Return type

Set[int]

search_occurrence_count(query_string, query_tokenizer=<method 'split' of 'str' objects>)[source]¶

Finds the total number of occurrences you have for the given query.

This just gets the number of items in search_occurrences(). As a result, searching for occurrences where two separate words occur will find the total number of places where either word occurs within the set of documents where both words appear.

Example

>>> corpus = Corpus(["the cow was hungry", "the cow likes the grass"])
>>> corpus.search_occurrence_count("the")
3
>>> corpus.search_occurrence_count("the cow")
5
>>> corpus.search_occurrence_count("'the cow'")
2

Parameters

query_string (str) – The query you’re searching for
query_tokenizer (Callable[[str], List[str]]) – The tokenizer for the query

Return type

int

search_occurrences(query, query_tokenizer=<method 'split' of 'str' objects>)[source]¶

Search for matching positions within a search.

This allows you to figure out all of the occurrences matching your query. In addition, this is used internally to display search results.

Each matching position comes in the form of a tuple where the first field doc_id refers to the position of the document, the second field first_idx refers to the starting index of the occurrence (among the tokenized documents), last_idx refers to the last index of the occurrence, raw_start refers to the starting index of the occurrence from within the raw, non-tokenized documents. raw_end refers to the index after the last character of the matching result within the non-tokenized documents. There is not really a reason behind this decision.

Example

>>> corpus = Corpus(["this is fun"])
>>> result = list(corpus.search_occurrences("'this is'"))[0]
>>> result
PositionResult(doc_id=0, first_idx=0, last_idx=1, raw_start=0, raw_end=7)
>>> corpus.documents[result.doc_id][result.raw_start:result.raw_end]
'this is'
>>> corpus.tokenized_documents[result.doc_id][result.first_idx:result.last_idx+1]
['this', 'is']

Parameters

query (str) – The string query. See text_data.query.Query for details.
query_tokenizer (Callable[[str], List[str]]) – The tokenizing function for the query. See text_data.query.Query or search_documents() for details.

Return type

Set[PositionResult]

slice(indexes)[source]¶

This creates a Corpus object only including the documents listed.

This overrides the method in text_data.index.WordIndex(), which does the same thing (but without making changes to the underlying document set). This also creates slices of any of the n-gram indexes you have created.

Note

This also changes the indexes for the new corpus so they all go from 0 to len(indexes).

Parameters: indexes (Set[int]) – A set of document indexes you want to have in the new index.

Example

>>> corpus = Corpus(["example document", "another example", "yet another"])
>>> corpus.add_ngram_index(n=2)
>>> sliced_corpus = corpus.slice({1})
>>> len(sliced_corpus)
1
>>> sliced_corpus.most_common()
[('another', 1), ('example', 1)]
>>> sliced_corpus.ngram_indexes[2].most_common()
[('another example', 1)]

Return type: Corpus

split_off(indexes)[source]¶

This operates like split_off().

But it additionally maintains the state of the Corpus data, similar to how slice() works.

Example

>>> corpus = Corpus(["i am an example", "so am i"])
>>> sliced_data = corpus.split_off({0})
>>> corpus.documents
['so am i']
>>> sliced_data.documents
['i am an example']
>>> corpus.most_common()
[('am', 1), ('i', 1), ('so', 1)]

Return type: Corpus

to_index()[source]¶

Converts a Corpus object into a WordIndex object.

Corpus objects are convenient because they allow you to search across documents, in addition to computing statistics about them. But sometimes, you don’t need that, and the added convenience comes with extra memory requirements.

Return type: WordIndex

update(new_documents)[source]¶

Adds new documents to the corpus’s index and to the n-gram indices.

Parameters: new_documents (List[str]) – A list of new documents. The tokenizer used is the same tokenizer used to initialize the corpus.

class text_data.index.PositionResult(doc_id, first_idx, last_idx, raw_start, raw_end)¶

Bases: tuple

This represents the position of a word or phrase within a document.

See text_data.index.Corpus.search_occurrences() for more details and an example.

Parameters

doc_id (int) – The index of the document within the index
first_idx (int) – The index of the first word within the tokenized document at corpus.tokenized_documents[doc_id].
last_idx (int) – The index of the last word within the tokenized document at corpus.tokenized_documents[doc_id].
raw_start (Optional[int]) – The starting character-level index within the raw string document at corpus.documents[doc_id].
raw_end (Optional[int]) – The index after the ending character-level index within the raw string document at corpus.documents[doc_id].

doc_id¶: Alias for field number 0

first_idx¶: Alias for field number 1

last_idx¶: Alias for field number 2

raw_end¶: Alias for field number 4

raw_start¶: Alias for field number 3

class text_data.index.WordIndex(tokenized_documents, indexed_locations=None)[source]¶

Bases: object

An inverted, positional index containing the words in a corpus.

This is designed to allow people to be able to quickly compute statistics about the language used across a corpus. The class offers a couple of broad strategies for understanding the ways in which words are used across documents.

Manipulating Indexes These functions are designed to allow you to create new indexes based on ones you already have. They operate kind of like slices and filter functions in pandas, where your goal is to be able to create new data structures that you can analyze independently from ones you’ve already created. Most of them can also be used with method chaining. However, some of these functions remove positional information from the index, so be careful.

copy() creates an identical copy of a WordIndex object.
slice(), slice_many(), and split_off() all take sets of document indexes and create new indexes with only those documents.
add_documents() allows you to add new documents into an existing WordIndex object. concatenate() similarly combines WordIndex objects into a single WordIndex.
flatten() takes a WordIndex and returns an identical index that only has one document.
skip_words() takes a set of words and returns a WordIndex that does not have those words.
reset_index() changes the index of words.

Corpus Information

A number of functions are designed to allow you to look up information about the corpus. For instance, you can collect a sorted list or a set of all the unique words in the corpus. Or you can get a list of the most commonly appearing elements:

vocab and vocab_list both return the unique words or phrases appearing in the index.
vocab_size gets the number of unique words in the index.
num_words gets the total number of words in the index.
doc_lengths gets a dictionary mapping documents to the number of tokens, or words, they contain.

Word Statistics

These allow you to gather statistics about single words or about word, document pairs. For instance, you can see how many words there are in the corpus, how many unique words there are, or how often a particular word appears in a document.

The statistics generally fit into four categories. The first category computes statistics about how often a specific word appears in the corpus as a whole. The second category computes statistics about how often a specific word appears in a specific document. The third and fourth categories echo those first two categories but perform the statistics efficiently across the corpus as a whole, creating 1-dimensional numpy arrays in the case of the word-corpus statistics and 2-dimensional numpy arrays in the case of the word-document statistics. Functions in these latter two categories all end in _vector and _matrix respectively.

Here’s how those statistics map to one another:

Word Statistics¶
Word-Corpus	Word-Document	Vector	Matrix
`word_count()`	`term_count()`	`word_count_vector()`	`count_matrix()`
`word_frequency()`	`term_frequency()`	`word_freq_vector()`	`frequency_matrix()`
`document_count()`		`doc_count_vector()`
`document_frequency()`		`doc_freq_vector()`
`idf()`		`idf_vector()`
`odds_word()`	`odds_document()`	`odds_vector()`	`odds_matrix()`
`__contains__`	`doc_contains()`		`one_hot_matrix()`
			`tfidf_matrix()`

In the case of the vector and matrix calculations, the arrays represent the unique words of the vocabulary, presented in sorted order. As a result, you can safely run element-wise calculations over the matrices.

In addition to the term vector and term-document matrix functions, there is get_top_words(), which is designed to allow you to find the highest or lowest scores and their associated words along any term vector or term-document matrix you please.

Note

For the most part, you will not want to instantiate WordIndex directly. Instead, you will likely use Corpus, which subclasses WordIndex.

That’s because Corpus offers utilities for searching through documents. In addition, with the help of tools from text_data.tokenize, instantiating Corpus objects is a bit simpler than instantiating WordIndex objects directly.

I particularly recommend that you do not instantiate the indexed_locations directly (i.e. outside of Corpus). The only way you can do anything with indexed_locations from outside of Corpus is by using an internal attribute and hacking through poorly documented Rust code.

Parameters

tokenized_documents (List[List[str]]) – A list of documents where each document is a list of words.
indexed_locations (Optional[List[Tuple[int, int]]]) – A list of documents where each documents contains a list of the start end positions of the words in tokenized_documents.

add_documents(tokenized_documents, indexed_locations=None)[source]¶

This function updates the index with new documents.

It operates similarly to text_data.index.Corpus.update(), taking new documents and mutating the existing one.

Example

>>> tokenized_words = ["im just a simple document".split()]
>>> index = WordIndex(tokenized_words)
>>> len(index)
1
>>> index.num_words
5
>>> index.add_documents(["now im an entire corpus".split()])
>>> len(index)
2
>>> index.num_words
10

concatenate(other, ignore_index=True)[source]¶

Creates a WordIndex object with the documents of both this object and the other.

See text_data.multi_corpus.concatenate() for more details.

Parameters: ignore_index (bool) – If set to True, which is the default, the document indexes will be re-indexed starting from 0.
Raises: ValueError – If ignore_index is set to False and some of the indexes overlap.
Return type: WordIndex

copy()[source]¶

This creates a copy of itself.

Return type: WordIndex

count_matrix()[source]¶

Returns a matrix showing the number of times each word appeared in each document.

Example

>>> corpus = Corpus(["example", "this example is another example"])
>>> corpus.count_matrix().tolist() == [[0., 1.], [1., 2.], [0., 1.], [0., 1.]]
True

Return type: array

doc_contains(word, document)[source]¶

States whether the given document contains the word.

Example

>>> corpus = Corpus(["words", "more words"])
>>> corpus.doc_contains("more", 0)
False
>>> corpus.doc_contains("more", 1)
True

Parameters

word (str) – The word you’re looking up.
document (int) – The index of the document.

Raises

ValueError – If the document you’re looking up doesn’t exist.

Return type

bool

doc_count_vector()[source]¶

Returns the total number of documents each word appears in.

Example

>>> corpus = Corpus(["example", "another example"])
>>> corpus.doc_count_vector()
array([1., 2.])

Return type: array

doc_freq_vector()[source]¶

Returns the proportion of documents each word appears in.

Example

>>> corpus = Corpus(["example", "another example"])
>>> corpus.doc_freq_vector()
array([0.5, 1. ])

Return type: array

property doc_lengths¶

Returns a dictionary mapping the document indices to their lengths.

Example

>>> corpus = Corpus(["a cat and a dog", "a cat", ""])
>>> assert corpus.doc_lengths == {0: 5, 1: 2, 2: 0}

Return type: Dict[int, int]

docs_with_word(word)[source]¶

Returns a list of all the documents containing a word.

Example

>>> corpus = Corpus(["example document", "another document"])
>>> assert corpus.docs_with_word("document") == {0, 1}
>>> assert corpus.docs_with_word("another") == {1}

Parameters: word (str) – The word you’re looking up.
Return type: Set[int]

document_count(word)[source]¶

Returns the total number of documents a word appears in.

Example

>>> corpus = Corpus(["example document", "another example"])
>>> corpus.document_count("example")
2
>>> corpus.document_count("another")
1

Parameters: word (str) – The word you’re looking up.
Return type: int

document_frequency(word)[source]¶

Returns the percentage of documents that contain a word.

Example

>>> corpus = Corpus(["example document", "another example"])
>>> corpus.document_frequency("example")
1.0
>>> corpus.document_frequency("another")
0.5

Parameters: word (str) – The word you’re looking up.
Return type: float

flatten()[source]¶

Flattens a multi-document index into a single-document corpus.

This creates a new WordIndex object stripped of any positional information that has a single document in it. However, the list of words and their indexes remain.

Example

>>> corpus = Corpus(["i am a document", "so am i"])
>>> len(corpus)
2
>>> flattened = corpus.flatten()
>>> len(flattened)
1
>>> assert corpus.most_common() == flattened.most_common()

Return type: WordIndex

frequency_matrix()[source]¶

Returns a matrix showing the frequency of each word appearing in each document.

Example

>>> corpus = Corpus(["example", "this example is another example"])
>>> corpus.frequency_matrix().tolist() == [[0.0, 0.2], [1.0, 0.4], [0.0, 0.2], [0.0, 0.2]]
True

Return type: array

get_top_words(term_matrix, top_n=None, reverse=True)[source]¶

Get the top values along a term matrix.

Given a matrix where each row represents a word in your vocabulary, this returns a numpy matrix of those top values, along with an array of their respective words.

You can choose the number of results you want to get by setting top_n to some positive value, or you can leave it be and return all of the results in sorted order. Additionally, by setting reverse to False (instead of its default of True), you can return the scores from smallest to largest.

Parameters

term_matrix (array) – a matrix of floats where each row represents a word
top_n (Optional[int]) – The number of values you want to return. If None, returns all values.
reverse (bool) – If true (the default), returns the N values with the highest scores. If false, returns the N values with the lowest scores.

Return type

Tuple[array, array]

Returns

A tuple of 2-dimensional numpy arrays, where the first item is an array of the top-scoring words and the second item is an array of the top scores themselves. Both arrays are of the same size, that is min(self.vocab_size, top_n) by the number of columns in the term matrix.

Raises

ValueError – If top_n is less than 1, if there are not the same number of rows in the matrix as there are unique words in the index, or if the numpy array doesn’t have 1 or 2 dimensions.

Example

The first thing you need to do in order to use this function is create a 1- or 2-dimensional term matrix, where the number of rows corresponds to the number of unique words in the corpus. Any of the functions within WordIndex that ends in _matrix(**kwargs) (for 2-dimensional arrays) or _vector(**kwargs) (for 1-dimensional arrays) will do the trick here. I’ll show an example with both a word count vector and a word count matrix:

>>> corpus = Corpus(["The cat is near the birds", "The birds are distressed"])
>>> corpus.get_top_words(corpus.word_count_vector(), top_n=2)
(array(['the', 'birds'], dtype='<U10'), array([3., 2.]))
>>> corpus.get_top_words(corpus.count_matrix(), top_n=1)
(array([['the', 'the']], dtype='<U10'), array([[2., 1.]]))

Similarly, you can return the scores from lowest to highest by setting reverse=False. (This is not the default.):

>>> corpus.get_top_words(-1. * corpus.word_count_vector(), top_n=2, reverse=False)
(array(['the', 'birds'], dtype='<U10'), array([-3., -2.]))

idf(word)[source]¶

Returns the inverse document frequency.

If the number of documents in your WordIndex index is \(N\) and the document frequency from document_frequency() is \(df\), the inverse document frequency is \(\frac{N}{df}\).

Example

>>> corpus = Corpus(["example document", "another example"])
>>> corpus.idf("example")
1.0
>>> corpus.idf("another")
2.0

Parameters: word (str) – The word you’re looking for.
Return type: float

idf_vector()[source]¶

Returns the inverse document frequency vector.

Example

>>> corpus = Corpus(["example", "another example"])
>>> corpus.idf_vector()
array([2., 1.])

Return type: array

max_word_count()[source]¶

Returns the most common word and the number of times it appeared in the corpus.

Returns None if there are no words in the corpus.

Example

>>> corpus = Corpus([])
>>> corpus.max_word_count() is None
True
>>> corpus.update(["a bird a plane superman"])
>>> corpus.max_word_count()
('a', 2)

Return type: Optional[Tuple[str, int]]

most_common(num_words=None)[source]¶

Returns the most common items.

This is nearly identical to collections.Counter.most_common. However, unlike collections.Counter.most_common, the values that are returned appear in alphabetical order.

Example

>>> corpus = Corpus(["i walked to the zoo", "i bought a zoo"])
>>> corpus.most_common()
[('i', 2), ('zoo', 2), ('a', 1), ('bought', 1), ('the', 1), ('to', 1), ('walked', 1)]
>>> corpus.most_common(2)
[('i', 2), ('zoo', 2)]

Parameters: num_words (Optional[int]) – The number of words you return. If you enter None or you enter a number larger than the total number of words, it returns all of the words, in sorted order from most common to least common.
Return type: List[Tuple[str, int]]

property num_words¶

Returns the total number of words in the corpus (not just unique).

Example

>>> corpus = Corpus(["a cat and a dog"])
>>> corpus.num_words
5

Return type: int

odds_document(word, document, sublinear=False)[source]¶

Returns the odds of finding a word in a document.

This is the equivalent of odds_word(). But insteasd of calculating items at the word-corpus level, the calculations are performed at the word-document level.

Example

>>> corpus = Corpus(["this is a document", "document two"])
>>> corpus.odds_document("document", 1)
1.0
>>> corpus.odds_document("document", 1, sublinear=True)
0.0

Parameters

word (str) – The word you’re looking up
document (int) – The index of the document
sublinear (bool) – If True, returns the log-odds of finding the word in the document.

Raises

ValueError – If the document doesn’t exist.

Return type

float

odds_matrix(sublinear=False, add_k=None)[source]¶

Returns the odds of finding a word in a document for every possible word-document pair.

Because not all words are likely to appear in all of the documents, this implementation adds 1 to all of the numerators before taking the frequencies. So

\(O(w) = \frac{c_{i} + 1}{N + \vert V \vert}\)

where \(\vert V \vert\) is the total number of unique words in each document, \(N\) is the total number of total words in each document, and \(c_i\) is the count of a word in a document.

Example

>>> corpus = Corpus(["example document", "another example"])
>>> corpus.odds_matrix()
array([[0.33333333, 1.        ],
       [1.        , 0.33333333],
       [1.        , 1.        ]])
>>> corpus.odds_matrix(sublinear=True)
array([[-1.5849625,  0.       ],
       [ 0.       , -1.5849625],
       [ 0.       ,  0.       ]])

Parameters

sublinear (bool) – If True, computes the log-odds.
add_k (Optional[float]) – This adds k to each of the non-zero elements in the matrix. Since \(\log{1} = 0\), this prevents 50 percent probabilities from appearing to be the same as elements that don’t exist.

Return type

array

odds_vector(sublinear=False)[source]¶

Returns a vector of the odds of each word appearing at random.

Example

>>> corpus = Corpus(["example", "this example is another example"])
>>> corpus.odds_vector()
array([0.2, 1. , 0.2, 0.2])
>>> corpus.odds_vector(sublinear=True)
array([-2.32192809,  0.        , -2.32192809, -2.32192809])

Parameters: sublinear (bool) – If true, returns the log odds.
Return type: array

odds_word(word, sublinear=False)[source]¶

Returns the odds of seeing a word at random.

In statistics, the odds of something happening are the probability of it happening, versus the probability of it not happening, that is \(\frac{p}{1 - p}\). The “log odds” of something happening — the result of using self.log_odds_word — is similarly equivalent to \(log_{2}{\frac{p}{1 - p}}\).

(The probability in this case is simply the word frequency.)

Example

>>> corpus = Corpus(["i like odds ratios"])
>>> np.isclose(corpus.odds_word("odds"), 1. / 3.)
True
>>> np.isclose(corpus.odds_word("odds", sublinear=True), np.log2(1./3.))
True

Parameters

word (str) – The word you’re looking up.
sublinear (bool) – If true, returns the

Return type

float

one_hot_matrix()[source]¶

Returns a matrix showing whether each given word appeared in each document.

For these matrices, all cells contain a floating point value of either a 1., if the word is in that document, or a 0. if the word is not in the document.

These are sometimes referred to as ‘one-hot encoding matrices’ in machine learning.

Example

>>> corpus = Corpus(["example", "this example is another example"])
>>> np.array_equal(
...     corpus.one_hot_matrix(),
...     np.array([[0., 1.], [1., 1.], [0., 1.], [0., 1.]])
... )
True

Return type: array

reset_index(start_idx=None)[source]¶

An in-place operation that resets the document indexes for this corpus.

When you reset the index, all of the documents change their values, starting at start_idx (and incrementing from there). For the most part, you will not need to do this, since most of the library does not give you the option to change the document indexes. However, it may be useful when you’re using slice() or split_off().

Parameters: start_idx (Optional[int]) – The first (lowest) document index you want to set. Values must be positive. Defaults to 0.

skip_words(words)[source]¶

Creates a WordIndex without any of the skipped words.

This enables you to create an index that does not contain rare words, for example. The index will not have any positions associated with them, so be careful when implementing it on a text_data.index.Corpus object.

Example

>>> skip_words = {"document"}
>>> corpus = Corpus(["example document", "document"])
>>> "document" in corpus
True
>>> without_document = corpus.skip_words(skip_words)
>>> "document" in without_document
False

Return type: WordIndex

slice(indexes)[source]¶

Returns an index that just contains documents from the set of words.

Parameters: indexes (Set[int]) – A set of index values for the documents.

Example

>>> index = WordIndex([["example"], ["document"], ["another"], ["example"]])
>>> sliced_idx = index.slice({0, 2})
>>> len(sliced_idx)
2
>>> sliced_idx.most_common()
[('another', 1), ('example', 1)]

Return type: WordIndex

slice_many(indexes_list)[source]¶

This operates like slice() but creates multiple WordIndex objects.

Example

>>> corpus = Corpus(["example document", "another example", "yet another"])
>>> first, second, third = corpus.slice_many([{0}, {1}, {2}])
>>> first.documents
['example document']
>>> second.documents
['another example']
>>> third.documents
['yet another']

Parameters: indexes_list (List[Set[int]]) – A list of sets of indexes. See text_data.index.WordIndex.slice() for details.
Return type: List[WordIndex]

split_off(indexes)[source]¶

Returns an index with just a set of documents, while removing them from the index.

Parameters: indexes (Set[int]) – A set of index values for the documents.

Note

This removes words from the index inplace. So be make sure you want to do that before using this function.

Example

>>> index = WordIndex([["example"], ["document"], ["another"], ["example"]])
>>> split_idx = index.split_off({0, 2})
>>> len(split_idx)
2
>>> len(index)
2
>>> split_idx.most_common()
[('another', 1), ('example', 1)]
>>> index.most_common()
[('document', 1), ('example', 1)]

Return type: WordIndex

term_count(word, document)[source]¶

Returns the total number of times a word appeared in a document.

Assuming the document exists, returns 0 if the word does not appear in the document.

Example

>>> corpus = Corpus(["i am just thinking random thoughts", "am i"])
>>> corpus.term_count("random", 0)
1
>>> corpus.term_count("random", 1)
0

Parameters

word (str) – The word you’re looking up.
document (int) – The index of the document.

Raises

ValueError – If you selected a document

Return type

int

term_frequency(word, document)[source]¶

Returns the proportion of words in document document that are word.

Example

>>> corpus = Corpus(["just coming up with words", "more words"])
>>> np.isclose(corpus.term_frequency("words", 1), 0.5)
True
>>> np.isclose(corpus.term_frequency("words", 0), 0.2)
True

Parameters

word (str) – The word you’re looking up
document (int) – The index of the document

Raises

ValueError – If the document you’re looking up doesn’t exist

Return type

float

tfidf_matrix(norm='l2', use_idf=True, smooth_idf=False, sublinear_tf=True, add_k=1)[source]¶

This creates a term-document TF-IDF matrix from the index.

In natural language processing, TF-IDF is a mechanism for finding out which words are distinct across documents. It’s used particularly widely in information retrieval, where your goal is to rank documents that you know match a query by how relevant you think they’ll be.

The basic intuition goes like this: If a word appears particularly frequently in a document, it’s probably more relevant to that document than if the word occurred more rarely. But, some words are simply common: If document X uses the word ‘the’ more often than the word ‘idiomatic,’ that really tells you more about the words ‘the’ and ‘idiomatic’ than it does about the document.

TF-IDF tries to balance these two competing interests by taking the ‘term frequency,’ or how often a word appears in the document, and normalizing it by the ‘document frequency,’ or the proportion of documents that contain the word. This has the effect of reducing the weights of common words (and even setting the weights of some very common words to 0 in some implementations).

It should be noted that there are a number of different implementations of TF-IDF. Within information retrieval, TF-IDF is part of the ‘SMART Information Retrieval System’. Although the exact equations can vary considerably, they typically follow the same approach: First, they find some value to represent the frequency of each word in the document. Often (but not always), this is just the raw number of times in which a word appeared in the document. Then, they normalize that based on the document frequency. And finally, they normalize those values based on the length of the document, so that long documents are not weighted more favorably (or less favorably) than shorter documents.

The approach that I have taken to this is shamelessly cribbed from scikit’s TfidfTransformer. Specifically, I’ve allowed for some customization of the specific formula for TF-IDF while not including methods that require access to the raw documents, which would be computationally expensive to perform. This allows for the following options:

You can set the term frequency to either take the raw count of the word in the document (\(c_{t,d}\)) or by using sublinear_tf=True and taking \(1 + \log_{2}{c_{t,d}}\)
You can skip taking the inverse document frequency \(df^{-1}\) altogether by setting use_idf=False or you can smooth the inverse document frequency by setting smooth_idf=True. This adds one to the numerator and the denominator. (Note: Because this method is only run on a vocabulary of words that are in the corpus, there can’t be any divide by zero errors, but this allows you to replicate scikit’s TfidfTransformer.)
You can add some number to the logged inverse document frequency by setting add_k to something other than 1. This is the only difference between this implementation and scikits, as scikit automatically setts k at 1.
Finally, you can choose how to normalize the document lengths. By default, this takes the L-2 norm, or \(\sqrt{\sum{w_{i,k}^{2}}}\), where \(w_{i,k}\) is the weight you get from multiplying the term frequency by the inverse document frequency. But you can also set the norm to 'l1' to get the L1-norm, or \(\sum{\vert w_{i,k} \vert}\). Or you can set it to None to avoid doing any document-length normalization at all.

Examples

To get a sense of the different options, let’s start by creating a pure count matrix with this method. To do that, we’ll set norm=None so we’re not normalizing by the length of the document, use_idf=False so we’re not doing anything with the document frequency, and sublinear_tf=False so we’re not taking the logged counts:

>>> corpus = Corpus(["a cat", "a"])
>>> tfidf_count_matrix = corpus.tfidf_matrix(norm=None, use_idf=False, sublinear_tf=False)
>>> assert np.array_equal(tfidf_count_matrix, corpus.count_matrix())

In this particular case, setting sublinear_tf to True will produce the same result since all of the counts are 1 or 0 and \(\log{1} + 1 = 1\):

>>> assert np.array_equal(corpus.tfidf_matrix(norm=None, use_idf=False), tfidf_count_matrix)

Now, we can incorporate the inverse document frequency. Because the word ‘a’ appears in both documents, its inverse document frequency in is 1; the inverse document frequency of ‘cat’ is 2, since ‘cat’ appears in half of the documents. We’re additionally taking the base-2 log of the inverse document frequency and adding 1 to the final result. So we get:

>>> idf_add_1 = corpus.tfidf_matrix(norm=None, sublinear_tf=False, smooth_idf=False)
>>> assert idf_add_1.tolist() == [[1., 1.], [2.,0.]]

Or we can add nothing to the logged values:

>>> idf = corpus.tfidf_matrix(norm=None, sublinear_tf=False, smooth_idf=False, add_k=0)
>>> assert idf.tolist() == [[0.0, 0.0], [1.0, 0.0]]

The L-1 norm normalizes the results by the sum of the absolute values of their weights. In the case of the count matrix, this is equivalent to creating the frequency matrix:

>>> tfidf_freq_mat = corpus.tfidf_matrix(norm="l1", use_idf=False, sublinear_tf=False)
>>> assert np.array_equal(tfidf_freq_mat, corpus.frequency_matrix())

Parameters

norm (Optional[str]) – Set to ‘l2’ for the L2 norm (square root of the sums of the square weights), ‘l1’ for the l1 norm (the summed absolute value, or None for no normalization).
use_idf (bool) – If you set this to False, the weights will only include the term frequency (adjusted however you like)
smooth_idf (bool) – Adds a constant to the numerator and the denominator.
sublinear_tf (bool) – Computes the term frequency in log space.
add_k (int) – This adds k to every value in the IDF. scikit adds 1 to all documents, but this allows for more variable computing (e.g. adding 0 if you want to remove words appearing in every document)

Return type

array

property vocab¶

Returns all of the unique words in the index.

Example

>>> corpus = Corpus(["a cat and a dog"])
>>> corpus.vocab == {"a", "cat", "and", "dog"}
True

Return type: Set[str]

property vocab_list¶

Returns a sorted list of the words appearing in the index.

This is primarily intended for use in matrix or vector functions, where the order of the words matters.

Example

>>> corpus = Corpus(["a cat and a dog"])
>>> corpus.vocab_list
['a', 'and', 'cat', 'dog']

Return type: List[str]

property vocab_size¶

Returns the total number of unique words in the corpus.

Example

>>> corpus = Corpus(["a cat and a dog"])
>>> corpus.vocab_size
4

Return type: int

word_count(word)[source]¶

Returns the total number of times the word appeared.

Defaults to 0 if the word never appeared.

Example

>>> corpus = Corpus(["this is a document", "a bird and a plane"])
>>> corpus.word_count("document")
1
>>> corpus.word_count("a")
3
>>> corpus.word_count("malarkey")
0

Parameters: word (str) – The string word (or phrase).
Return type: int

word_count_vector()[source]¶

Returns the total number of times each word appeared in the corpus.

Example

>>> corpus = Corpus(["example", "this example is another example"])
>>> corpus.word_count_vector()
array([1., 3., 1., 1.])

Return type: array

word_counter(word)[source]¶

Maps the documents containing a word to the number of times the word appeared.

Examples

>>> corpus = Corpus(["a bird", "a bird and a plane", "two birds"])
>>> corpus.word_counter("a") == {0: 1, 1: 2}
True

Parameters

word (str) – The word you’re looking up

Return type

Dict[int, int]

Returns

A dictionary mapping the document index of the word to the number of times: it appeared in that document.

word_freq_vector()[source]¶

Returns the frequency in which each word appears over the corpus.

Example

>>> corpus = Corpus(["example", "this example is another example"])
>>> corpus.word_freq_vector()
array([0.16666667, 0.5       , 0.16666667, 0.16666667])

Return type: array

word_frequency(word)[source]¶

Returns the frequency in which the word appeared in the corpus.

Example

>>> corpus = Corpus(["this is fun", "or is it"])
>>> np.isclose(corpus.word_frequency("fun"), 1. / 6.)
True
>>> np.isclose(corpus.word_frequency("is"), 2. / 6.)
True

Parameters: word (str) – The string word or phrase.
Return type: float

text_data.multi_corpus module¶

Tools and displays for handling multiple document sets.

These are primarily designed to provide features for merging sets of documents so you can easy compute statistics on them.

text_data.multi_corpus.concatenate(*indexes, ignore_index=True)[source]¶

Concatenates an arbitrary number of text_data.index.WordIndex objects.

Parameters: ignore_index (bool) – If set to True, which is the default, the resulting index has a reset index beginning at 0.
Raises: ValueError – If ignore_index is set to False and there are overlapping document indexes.

Example

>>> corpus_1 = WordIndex([["example"], ["document"]])
>>> corpus_2 = WordIndex([["second"], ["document"]])
>>> corpus_3 = WordIndex([["third"], ["document"]])
>>> concatenate().most_common()
[]
>>> concatenate(corpus_1).most_common()
[('document', 1), ('example', 1)]
>>> concatenate(corpus_1, corpus_2).most_common()
[('document', 2), ('example', 1), ('second', 1)]
>>> concatenate(corpus_1, corpus_2, corpus_3).most_common()
[('document', 3), ('example', 1), ('second', 1), ('third', 1)]

Return type: WordIndex

text_data.multi_corpus.flat_concat(*indexes)[source]¶

This flattens a sequence of text_data.index.WordIndex objects and concatenates them.

This does not preserve any information about text_data.index.Corpus objects.

Example

>>> corpus_1 = WordIndex([["example"], ["document"]])
>>> corpus_2 = WordIndex([["another"], ["set"], ["of"], ["documents"]])
>>> len(corpus_1)
2
>>> len(corpus_2)
4
>>> len(concatenate(corpus_1, corpus_2))
6
>>> len(flat_concat(corpus_1, corpus_2))
2

Parameters: indexes (WordIndex) – A sequence of text_data.index.Corpus or text_data.index.WordIndex objects.
Return type: WordIndex

text_data.query module¶

This builds and runs search queries for text_data.index.Corpus.

For the most part, you won’t be using this directly. Instead, you’ll likely be using text_data.index.Corpus. However, viewing the __repr__ for the query you’re running can be helpful for debugging or validating queries.

class text_data.query.Query(query_string, query_tokenizer=<method 'split' of 'str' objects>)[source]¶

Bases: object

Represents a query. This is used internaly by text_data.index.Corpus to handle searching.

The basic formula for writing queries should be familiar; all of the queries are simple boolean phrases. But here are more complete specifications:

In order to search for places where two words appeared, you simply need to type the two words:

Query("i am")

Searches using this query will look for documents where the words “i” and “am” both appeared. To have them look for places where either word appeared, use an “OR” query:

Query("i OR am")

Alternatively, you can look for documents where one word occurred but the other didn’t using a NOT query:

Query("i NOT am")

To search for places where the phrase “i am” appeared, use quotes:

Query("'i am'")

You can use AND queries to limit the results of previous sets of queries. For instance:

Query("i OR am AND you")

will find places where “you” and either “I” or “am” appeared.

In order to search for the literal words ‘AND’, ‘OR’, or ‘NOT’, you must encapsulate them in quotes:

Query("'AND'")

Finally, you may customize the way your queries are parsed by passing a tokenizer. By default, Query identifies strings of text that it needs to split and uses str.split to split the strings. But you can change how to split the text, which can be helpful/necessary if the words you’re searching for have spaces in them. For instance, this will split the words you’re querying by spaces, unless the words are ‘united states’:

>>> import re
>>> us_phrase = re.compile(r"(united states|\S+)")
>>> Query("he is from the united states", query_tokenizer=us_phrase.findall)
<Query ([[QueryItem(words=['he', 'is', 'from', 'the', 'united states'], exact=False, modifier='OR')]])>

Parameters

query_string (str) – The human-readable query
query_tokenizer (Callable[[str], List[str]]) – A function to tokenize phrases in the query (Defaults to string.split). Note: This specifically tokenizes individual phrases in the query. As a result, the function does not need to handle quotations.

class text_data.query.QueryItem(words, exact, modifier)¶

Bases: tuple

This represents an set of words you want to search for.

Each query item has attached to it a set of words, an identifier stating whether the query terms are part of an exact phrase (i.e. whether the order matters) and what kind of query (a boolean AND query, a boolean OR query, or a boolean NOT query), is being performed on the query.

Parameters

words (List[str]) – A list of words representing all of the words that will be searched for.
exact (bool) – Whether the search terms are part of an exact phrase match
modifier (str) – The boolean query (AND, OR, or NOT)

exact¶: Alias for field number 1

modifier¶: Alias for field number 2

words¶: Alias for field number 0

text_data.tokenize module¶

This is a module for tokenizing data.

The primary motivation behind this module is that effectively presenting search results revolves around knowing the positions of the words prior to tokenization. In order to handle these raw positions, the index text_data.index.Corpus uses stores the original character-level positions of words.

This module offers a default tokenizer that you can use for text_data.index.Corpus. However, you’ll likely need to customize them for most applications. That said, doing so should not be difficult.

One of the functions in this module, corpus_tokenizer(), is designed specifically to create tokenizers that can be used directly by text_data.index.Corpus. All you have to do is create a regular expression that splits words from nonwords and then create a series of postprocessing functions to clean the text (including, optionally, removing tokens). If possible, I would recommend taking this approach, since it allows you to mostly ignore the picky preferences of the underlying API.

text_data.tokenize.corpus_tokenizer(regex_patten, postprocess_funcs, inverse_match=False)[source]¶

This is designed to make it easy to build a custom tokenizer for text_data.index.Corpus.

It acts as a combination of tokenize_regex_positions() and postprocess_positions(), making it simple to create tokenizers for text_data.index.Corpus.

In other words, if you pass the tokenizer a regular expression pattern, set inverse_match as you would for tokenize_regex_positions(), and add a list of postprocessing functions as you would for postprocess_positions(), this tokenizer will return a function that you can use directly as an argument in text_data.index.Corpus.

Examples

Let’s say that we want to build a tokenizing function that splits on vowels or whitespace. We also want to lowercase all of the remaining words:

>>> split_vowels = corpus_tokenizer(r"[aeiou\s]+", [str.lower], inverse_match=True)
>>> split_vowels("Them and you")
(['th', 'm', 'nd', 'y'], [(0, 2), (3, 4), (6, 8), (9, 10)])

You can additionally use this function to remove stopwords, although I generally would recommend against it. The postprocessing functions optionally return a string or a NoneType, and None values simply don’t get tokenized:

>>> skip_stopwords = corpus_tokenizer(r"\w+", [lambda x: x if x != "the" else None])
>>> skip_stopwords("I ran to the store")
(['I', 'ran', 'to', 'store'], [(0, 1), (2, 5), (6, 8), (13, 18)])

Return type: Callable[[str], Tuple[List[str], List[Tuple[int, int]]]]

text_data.tokenize.default_tokenizer(document: str) → Tuple[List[str], List[Tuple[int, int]]]¶

This is the default tokenizer for text_data.index.Corpus.

It simply splits on words ("\w+") and lowercases words.

text_data.tokenize.postprocess_positions(postprocess_funcs, tokenize_func, document)[source]¶

Runs postprocessing functions to produce final tokenized documents.

This function allows you to take tokenize_regex_positions() (or something that has a similar function signature) and run postprocessing on it. It requires that you also give it a document, which it will tokenize using the tokenizing function you give it.

These postprocessing functions should take a string (i.e. one of the individual tokens), but they can return either a string or None. If they return None, the token will not appear in the final tokenized result.

Parameters

postprocess_funcs (List[Callable[[str], Optional[str]]]) – A list of postprocessing functions (e.g. str.lower)
tokenize_func (Callable[[str], Tuple[List[str], List[Tuple[int, int]]]]) – A function that takes raw text and converts it into a list of strings and a list of character-level positions (e.g. the output of text_data.tokenize.tokenize_regex_positions())
document (str) – The (single) text you want to tokenize.
tokenized_docs – The tokenized results (e.g. the output of text_data.tokenize.tokenize_regex_positions())

Return type

Tuple[List[str], List[Tuple[int, int]]]

text_data.tokenize.tokenize_regex_positions(pattern, document_text, inverse_match=False)[source]¶

Finds all of the tokens matching a regular expression.

Returns the positions of those tokens along with the tokens themselves.

Parameters

pattern (str) – A raw regular expression string
document_text (str) – The raw document text
inverse_match (bool) – If true, tokenizes the text between matches.

Return type

Tuple[List[str], List[Tuple[int, int]]]

Returns

A tuple consisting of the list of words and a list of tuples, where each tuple represents the start and end character positions of the phrase.

Module contents¶

Top-level package for Text Data.

class text_data.Corpus(documents, tokenizer=functools.partial(<function postprocess_positions>, [<method 'lower' of 'str' objects>, ]functools.partial(<function tokenize_regex_positions>, '\\\\w+', inverse_match=False)), sep=None, prefix=None, suffix=None)[source]¶

Bases: text_data.index.WordIndex

This is probably going to be your main entrypoint into text_data.

The corpus holds the raw text, the index, and the tokenized text of whatever you’re trying to analyze. Its primary role is to extend the functionality of WordIndex to support searching. This means that you can use the Corpus to search for arbitrarily long phrases using boolean search methods (AND, NOT, BUT).

In addition, it allows you to add indexes so you can calculate statistics on phrases. By using add_ngram_index(), you can figure out the frequency or TF-IDF values of multi-word phrases while still being able to search through your normal index.

Initializing Data

To instantiate the corpus, you need to include a list of documents where each document is a string of text and a tokenizer. There is a default tokenizer, which simply lowercases words and splits documents on r"\w+". For most tasks, this will be insufficient. But text_data.tokenize offers convenient ways that should make building the vast majority of tokenizers easy.

The Corpus can be instantiated using __init__ or by using chunks(), which yields a generator, adding a mini-index. This allows you to technically perform calculations in-memory on larger databases.

You can also initialize a Corpus object by using the slice(), copy(), split_off(), or concatenate() methods. These methods work identically to their equivalent methods in text_data.index.WordIndex while updating extra data that the corpus has, updating n-gram indexes, and automatically re-indexing the corpus.

Updating Data

There are two methods for updating or adding data to the Corpus. update() allows you to add new documents to the corpus. add_ngram_index() allows you to add multi-word indexes.

Searching

There are a few methods devoted to searching. search_documents() allows you to find all of the individual documents matching a query. search_occurrences() shows all of the individual occurrences that matched your query. ranked_search() finds all of the individual occurrences and sorts them according to a variant of their TF-IDF score.

Statistics

Three methods allow you to get statistics about a search. search_document_count() allows you to find the total number of documents matching your query. search_document_freq() shows the proportion of documents matching your query. And search_occurrence_count() finds the total number of matches you have for your query.

Display

There are a number of functions designed to help you visually see the results of your query. display_document() and display_documents() render your documents in HTML. display_document_count(), display_document_frequency(), and display_occurrence_count() all render bar charts showing the number of query results you got. And display_search_results() shows the result of your search.

documents¶: A list of all the raw, non-tokenized documents in the corpus.

tokenizer¶: A function that converts a list of strings (one of the documents from documents into a list of words and a list of the character-level positions where the words are located in the raw text). See text_data.tokenize for details.

tokenized_documents¶: A list of the tokenized documents (each a list of words)

ngram_indexes¶: A list of WordIndex objects for multi-word (n-gram) indexes. See add_ngram_index() for details.

ngram_sep¶: A separator in between words. See add_ngram_index() for details.

ngram_prefix¶: A prefix to go before any n-gram phrases. See add_ngram_index() for details.

ngram_suffix¶: A suffix to go after any n-gram phrases. See add_ngram_index() for details.

Parameters

documents (List[str]) – A list of the raw, un-tokenized texts.
tokenizer (Callable[[str], Tuple[List[str], List[Tuple[int, int]]]]) – A function to tokenize the documents. See text_data.tokenize for details.
sep (Optional[str]) – The separator you want to use for computing n-grams. See add_ngram_index() for details.
prefix (Optional[str]) – The prefix you want to use for n-grams. See add_ngram_index() for details.
suffix (Optional[str]) – The suffix you want to use for n-grams. See add_ngram_index() for details.

add_documents(tokenized_documents, indexed_locations=None)[source]¶

This overrides the add_documents() method.

Because Corpus() objects can have n-gram indices, simply running add_documents would cause n-gram indices to go out of sync with the overall corpus. In order to prevent that, this function raises an error if you try to run it.

Raises

NotImplementedError – Warns you to use ~text_data.index.Corpus.update
instead. –

add_ngram_index(n=1, default=True, sep=None, prefix=None, suffix=None)[source]¶

Adds an n-gram index to the corpus.

This creates a WordIndex object that you can access by typing self.ngram_indexes[n].

There are times when you might want to compute TF-IDF scores, word frequency scores or similar scores over a multi-word index. For instance, you might want to know how frequently someone said ‘United States’ in a speech, without caring how often they used the word ‘united’ or ‘states’.

This function helps you do that. It automatically splits up your documents into an overlapping set of n-length phrases.

Internally, this takes each of your tokenized documents, merges them into lists of n-length phrases, and joins each of those lists by a space. However, you can customize this behavior. If you set prefix, each of the n-grams will be prefixed by that string; if you set suffix, each of the n-grams will end with that string. And if you set sep, each of the words in the n-gram will be separated by the separator.

Example

Say you have a simple four word corpus. If you use the default settings, here’s what your n-grams will look like:

>>> corpus = Corpus(["text data is fun"])
>>> corpus.add_ngram_index(n=2)
>>> corpus.ngram_indexes[2].vocab_list
['data is', 'is fun', 'text data']

By altering sep, prefix, or suffix, you can alter that behavior. But, be careful to set default to False if you want to change the behavior from something you set up in __init__. If you don’t, this will use whatever settings you instantiated the class with.

>>> corpus.add_ngram_index(n=2, sep="</w><w>", prefix="<w>", suffix="</w>", default=False)
>>> corpus.ngram_indexes[2].vocab_list
['<w>data</w><w>is</w>', '<w>is</w><w>fun</w>', '<w>text</w><w>data</w>']

Parameters

n (int) – The number of n-grams (defaults to unigrams)
default (bool) – If true, will keep the values stored in init (including defaults)
sep (Optional[str]) – The separator in between words (if storing n-grams)
prefix (Optional[str]) – The prefix before the first word of each n-gram
suffix (Optional[str]) – The suffix after the last word of each n-gram

classmethod chunks(documents, tokenizer=functools.partial(<function postprocess_positions>, [<method 'lower' of 'str' objects>, ]functools.partial(<function tokenize_regex_positions>, '\\\\w+', inverse_match=False)), sep=None, prefix=None, suffix=None, chunksize=1000000)[source]¶

Iterates through documents, yielding a Corpus with chunksize documents.

This is designed to allow you to technically use Corpus on large document sets. However, you should note that searching for documents will only work within the context of the current chunk.

The same is true for any frequency metrics. As such, you should probably limit metrics to raw counts or aggregations you’ve derived from raw counts.

Example

>>> for docs in Corpus.chunks(["chunk one", "chunk two"], chunksize=1):
...     print(len(docs))
1
1

Parameters

documents (Iterator[str]) – A list of raw text items (not tokenized)
tokenizer (Callable[[str], Tuple[List[str], List[Tuple[int, int]]]]) – A function to tokenize the documents
sep (Optional[str]) – The separator you want to use for computing n-grams.
prefix (Optional[str]) – The prefix for n-grams.
suffix (Optional[str]) – The suffix for n-grams.
chunksize (int) – The number of documents in each chunk.

Return type

Generator[~CorpusClass, None, None]

concatenate(other)[source]¶

This combines two Corpus objects into one, much like text_data.index.WordIndex.concatenate().

However, the new Corpus has data from this corpus, including n-gram data. Because of this, the two Corpus objects must have the same keys for their n-gram dictionaries.

Example

>>> corpus_1 = Corpus(["i am an example"])
>>> corpus_2 = Corpus(["i am too"])
>>> corpus_1.add_ngram_index(n=2)
>>> corpus_2.add_ngram_index(n=2)
>>> combined_corpus = corpus_1.concatenate(corpus_2)
>>> combined_corpus.most_common()
[('am', 2), ('i', 2), ('an', 1), ('example', 1), ('too', 1)]
>>> combined_corpus.ngram_indexes[2].most_common()
[('i am', 2), ('am an', 1), ('am too', 1), ('an example', 1)]

Parameters: other (Corpus) – another Corpus object with the same ngram indexes.
Raises: ValueError – if the n-gram indexes between the two corpuses are not the same.
Return type: Corpus

copy()[source]¶

This creates a shallow copy of a Corpus object.

It extends the contents of Corpus to also store data about the objects themselves.

Return type: Corpus

display_document(doc_idx)[source]¶

Print an entire document, given its index.

Parameters: doc_idx (int) – The index of the document
Return type: HTML

display_document_count(queries, query_tokenizer=<method 'split' of 'str' objects>)[source]¶

Returns a bar chart (in altair) showing the queries with the largest number of documents.

Note

This method requires that you have altair installed. To install, type pip install text_data[display] or poetry add text_data -E display.

Parameters

queries (List[str]) – A list of queries (in the same form you use to search for things)
query_tokenizer (Callable[[str], List[str]]) – The tokenizer for the query

display_document_frequency(queries, query_tokenizer=<method 'split' of 'str' objects>)[source]¶

Displays a bar chart showing the percentages of documents with a given query.

Note

This method requires that you have altair installed. To install, type pip install text_data[display] or poetry add text_data -E display.

Parameters

queries (List[str]) – A list of queries
query_tokenizer (Callable[[str], List[str]]) – A tokenizer for each query

display_documents(documents)[source]¶

Display a number of documents, at the specified indexes.

Parameters: documents (List[int]) – A list of document indexes.
Return type: HTML

display_occurrence_count(queries, query_tokenizer=<method 'split' of 'str' objects>)[source]¶

Display a bar chart showing the number of times a query matches.

Note

This method requires that you have altair installed. To install, type pip install text_data[display] or poetry add text_data -E display.

Parameters

queries (List[str]) – A list of queries
query_tokenizer (Callable[[str], List[str]]) – The tokenizer for the query

display_search_results(search_query, query_tokenizer=<method 'split' of 'str' objects>, max_results=None, window_size=None)[source]¶

Shows the results of a ranked query.

This function runs a query and then renders the result in human-readable HTML. For each result, you will get a document ID and the count of the result.

In addition, all of the matching occurrences of phrases or words you searched for will be highlighted in bold. You can optionally decide how many results you want to return and how long you want each result to be (up to the length of the whole document).

Parameters

search_query (str) – The query you’re searching for
query_tokenizer (Callable[[str], List[str]]) – The tokenizer for the query
max_results (Optional[int]) – The maximum number of results. If None, returns all results.
window_size (Optional[int]) – The number of characters you want to return around the matching phrase.
None (If) –
the entire document. (returns) –

Return type

HTML

flatten()[source]¶

Flattens a Corpus, converting it into a WordIndex.

Return type: WordIndex

ranked_search(query_string, query_tokenizer=<method 'split' of 'str' objects>)[source]¶

This produces a list of search responses in ranked order.

More specifically, the documents are ranked in order of the sum of the TF-IDF scores for each word in the query (with the exception of words that are negated using a NOT operator).

To compute the TF-IDF scores, I simply have computed the dot products between the raw query counts and the TF-IDF scores of all the unique words in the query. This is roughly equivalent to the ltn.lnn normalization scheme described in Manning. (The catch is that I have normalized the term-frequencies in the document to the length of the document.)

Each item in the resulting list is a list referring to a single item. The items inside each of those lists are of the same format you get from search_occurrences(). The first item in each list is either an item having the largest number of words in it or is the item that’s the nearest to another match within the document.

Parameters

query – Query string
query_tokenizer (Callable[[str], List[str]]) – Function for tokenizing the results.

Return type

List[List[PositionResult]]

Returns

A list of tuples, each in the same format as search_occurrences().

search_document_count(query_string, query_tokenizer=<method 'split' of 'str' objects>)[source]¶

Finds the total number of documents matching a query.

By entering a search, you can get the total number of documents that match the query.

Example

>>> corpus = Corpus(["the cow was hungry", "the cow likes the grass"])
>>> corpus.search_document_count("cow")
2
>>> corpus.search_document_count("grass")
1
>>> corpus.search_document_count("the")
2

Parameters

query_string (str) – The query you’re searching for
query_tokenizer (Callable[[str], List[str]]) – The tokenizer for the query

Return type

int

search_document_freq(query_string, query_tokenizer=<method 'split' of 'str' objects>)[source]¶

Finds the percentage of documents that match a query.

Example

>>> corpus = Corpus(["the cow was hungry", "the cow likes the grass"])
>>> corpus.search_document_freq("cow")
1.0
>>> corpus.search_document_freq("grass")
0.5
>>> corpus.search_document_freq("the grass")
0.5
>>> corpus.search_document_freq("the OR nonsense")
1.0

Parameters

query_string (str) – The query you’re searching for
query_tokenizer (Callable[[str], List[str]]) – The tokenizer for the query

Return type

float

search_documents(query, query_tokenizer=<method 'split' of 'str' objects>)[source]¶

Search documents from a query.

In order to figure out the intracacies of writing queries, you should view text_data.query.Query. In general, standard boolean (AND, OR, NOT) searches work perfectly reasonably. You should generally not need to set query_tokenizer to anything other than the default (string split).

This produces a set of unique documents, where each document is the index of the document. To view the documents by their ranked importance (ranked largely using TF-IDF), use ranked_search().

Example

>>> corpus = Corpus(["this is an example", "here is another"])
>>> assert corpus.search_documents("is") == {0, 1}
>>> assert corpus.search_documents("example") == {0}

Parameters

query (str) – A string boolean query (as defined in text_data.query.Query)
query_tokenizer (Callable[[str], List[str]]) – A function to tokenize the words in your query. This allows you to optionally search for words in your index that include spaces (since it defaults to string.split).

Return type

Set[int]

search_occurrence_count(query_string, query_tokenizer=<method 'split' of 'str' objects>)[source]¶

Finds the total number of occurrences you have for the given query.

This just gets the number of items in search_occurrences(). As a result, searching for occurrences where two separate words occur will find the total number of places where either word occurs within the set of documents where both words appear.

Example

>>> corpus = Corpus(["the cow was hungry", "the cow likes the grass"])
>>> corpus.search_occurrence_count("the")
3
>>> corpus.search_occurrence_count("the cow")
5
>>> corpus.search_occurrence_count("'the cow'")
2

Parameters

query_string (str) – The query you’re searching for
query_tokenizer (Callable[[str], List[str]]) – The tokenizer for the query

Return type

int

search_occurrences(query, query_tokenizer=<method 'split' of 'str' objects>)[source]¶

Search for matching positions within a search.

This allows you to figure out all of the occurrences matching your query. In addition, this is used internally to display search results.

Each matching position comes in the form of a tuple where the first field doc_id refers to the position of the document, the second field first_idx refers to the starting index of the occurrence (among the tokenized documents), last_idx refers to the last index of the occurrence, raw_start refers to the starting index of the occurrence from within the raw, non-tokenized documents. raw_end refers to the index after the last character of the matching result within the non-tokenized documents. There is not really a reason behind this decision.

Example

>>> corpus = Corpus(["this is fun"])
>>> result = list(corpus.search_occurrences("'this is'"))[0]
>>> result
PositionResult(doc_id=0, first_idx=0, last_idx=1, raw_start=0, raw_end=7)
>>> corpus.documents[result.doc_id][result.raw_start:result.raw_end]
'this is'
>>> corpus.tokenized_documents[result.doc_id][result.first_idx:result.last_idx+1]
['this', 'is']

Parameters

query (str) – The string query. See text_data.query.Query for details.
query_tokenizer (Callable[[str], List[str]]) – The tokenizing function for the query. See text_data.query.Query or search_documents() for details.

Return type

Set[PositionResult]

slice(indexes)[source]¶

This creates a Corpus object only including the documents listed.

This overrides the method in text_data.index.WordIndex(), which does the same thing (but without making changes to the underlying document set). This also creates slices of any of the n-gram indexes you have created.

Note

This also changes the indexes for the new corpus so they all go from 0 to len(indexes).

Parameters: indexes (Set[int]) – A set of document indexes you want to have in the new index.

Example

>>> corpus = Corpus(["example document", "another example", "yet another"])
>>> corpus.add_ngram_index(n=2)
>>> sliced_corpus = corpus.slice({1})
>>> len(sliced_corpus)
1
>>> sliced_corpus.most_common()
[('another', 1), ('example', 1)]
>>> sliced_corpus.ngram_indexes[2].most_common()
[('another example', 1)]

Return type: Corpus

split_off(indexes)[source]¶

This operates like split_off().

But it additionally maintains the state of the Corpus data, similar to how slice() works.

Example

>>> corpus = Corpus(["i am an example", "so am i"])
>>> sliced_data = corpus.split_off({0})
>>> corpus.documents
['so am i']
>>> sliced_data.documents
['i am an example']
>>> corpus.most_common()
[('am', 1), ('i', 1), ('so', 1)]

Return type: Corpus

to_index()[source]¶

Converts a Corpus object into a WordIndex object.

Corpus objects are convenient because they allow you to search across documents, in addition to computing statistics about them. But sometimes, you don’t need that, and the added convenience comes with extra memory requirements.

Return type: WordIndex

update(new_documents)[source]¶

Adds new documents to the corpus’s index and to the n-gram indices.

Parameters: new_documents (List[str]) – A list of new documents. The tokenizer used is the same tokenizer used to initialize the corpus.

class text_data.WordIndex(tokenized_documents, indexed_locations=None)[source]¶

Bases: object

An inverted, positional index containing the words in a corpus.

This is designed to allow people to be able to quickly compute statistics about the language used across a corpus. The class offers a couple of broad strategies for understanding the ways in which words are used across documents.

Manipulating Indexes These functions are designed to allow you to create new indexes based on ones you already have. They operate kind of like slices and filter functions in pandas, where your goal is to be able to create new data structures that you can analyze independently from ones you’ve already created. Most of them can also be used with method chaining. However, some of these functions remove positional information from the index, so be careful.

copy() creates an identical copy of a WordIndex object.
slice(), slice_many(), and split_off() all take sets of document indexes and create new indexes with only those documents.
add_documents() allows you to add new documents into an existing WordIndex object. concatenate() similarly combines WordIndex objects into a single WordIndex.
flatten() takes a WordIndex and returns an identical index that only has one document.
skip_words() takes a set of words and returns a WordIndex that does not have those words.
reset_index() changes the index of words.

Corpus Information

A number of functions are designed to allow you to look up information about the corpus. For instance, you can collect a sorted list or a set of all the unique words in the corpus. Or you can get a list of the most commonly appearing elements:

vocab and vocab_list both return the unique words or phrases appearing in the index.
vocab_size gets the number of unique words in the index.
num_words gets the total number of words in the index.
doc_lengths gets a dictionary mapping documents to the number of tokens, or words, they contain.

Word Statistics

These allow you to gather statistics about single words or about word, document pairs. For instance, you can see how many words there are in the corpus, how many unique words there are, or how often a particular word appears in a document.

The statistics generally fit into four categories. The first category computes statistics about how often a specific word appears in the corpus as a whole. The second category computes statistics about how often a specific word appears in a specific document. The third and fourth categories echo those first two categories but perform the statistics efficiently across the corpus as a whole, creating 1-dimensional numpy arrays in the case of the word-corpus statistics and 2-dimensional numpy arrays in the case of the word-document statistics. Functions in these latter two categories all end in _vector and _matrix respectively.

Here’s how those statistics map to one another:

Word Statistics¶
Word-Corpus	Word-Document	Vector	Matrix
`word_count()`	`term_count()`	`word_count_vector()`	`count_matrix()`
`word_frequency()`	`term_frequency()`	`word_freq_vector()`	`frequency_matrix()`
`document_count()`		`doc_count_vector()`
`document_frequency()`		`doc_freq_vector()`
`idf()`		`idf_vector()`
`odds_word()`	`odds_document()`	`odds_vector()`	`odds_matrix()`
`__contains__`	`doc_contains()`		`one_hot_matrix()`
			`tfidf_matrix()`

In the case of the vector and matrix calculations, the arrays represent the unique words of the vocabulary, presented in sorted order. As a result, you can safely run element-wise calculations over the matrices.

In addition to the term vector and term-document matrix functions, there is get_top_words(), which is designed to allow you to find the highest or lowest scores and their associated words along any term vector or term-document matrix you please.

Note

For the most part, you will not want to instantiate WordIndex directly. Instead, you will likely use Corpus, which subclasses WordIndex.

That’s because Corpus offers utilities for searching through documents. In addition, with the help of tools from text_data.tokenize, instantiating Corpus objects is a bit simpler than instantiating WordIndex objects directly.

I particularly recommend that you do not instantiate the indexed_locations directly (i.e. outside of Corpus). The only way you can do anything with indexed_locations from outside of Corpus is by using an internal attribute and hacking through poorly documented Rust code.

Parameters

tokenized_documents (List[List[str]]) – A list of documents where each document is a list of words.
indexed_locations (Optional[List[Tuple[int, int]]]) – A list of documents where each documents contains a list of the start end positions of the words in tokenized_documents.

add_documents(tokenized_documents, indexed_locations=None)[source]¶

This function updates the index with new documents.

It operates similarly to text_data.index.Corpus.update(), taking new documents and mutating the existing one.

Example

>>> tokenized_words = ["im just a simple document".split()]
>>> index = WordIndex(tokenized_words)
>>> len(index)
1
>>> index.num_words
5
>>> index.add_documents(["now im an entire corpus".split()])
>>> len(index)
2
>>> index.num_words
10

concatenate(other, ignore_index=True)[source]¶

Creates a WordIndex object with the documents of both this object and the other.

See text_data.multi_corpus.concatenate() for more details.

Parameters: ignore_index (bool) – If set to True, which is the default, the document indexes will be re-indexed starting from 0.
Raises: ValueError – If ignore_index is set to False and some of the indexes overlap.
Return type: WordIndex

copy()[source]¶

This creates a copy of itself.

Return type: WordIndex

count_matrix()[source]¶

Returns a matrix showing the number of times each word appeared in each document.

Example

>>> corpus = Corpus(["example", "this example is another example"])
>>> corpus.count_matrix().tolist() == [[0., 1.], [1., 2.], [0., 1.], [0., 1.]]
True

Return type: array

doc_contains(word, document)[source]¶

States whether the given document contains the word.

Example

>>> corpus = Corpus(["words", "more words"])
>>> corpus.doc_contains("more", 0)
False
>>> corpus.doc_contains("more", 1)
True

Parameters

word (str) – The word you’re looking up.
document (int) – The index of the document.

Raises

ValueError – If the document you’re looking up doesn’t exist.

Return type

bool

doc_count_vector()[source]¶

Returns the total number of documents each word appears in.

Example

>>> corpus = Corpus(["example", "another example"])
>>> corpus.doc_count_vector()
array([1., 2.])

Return type: array

doc_freq_vector()[source]¶

Returns the proportion of documents each word appears in.

Example

>>> corpus = Corpus(["example", "another example"])
>>> corpus.doc_freq_vector()
array([0.5, 1. ])

Return type: array

property doc_lengths¶

Returns a dictionary mapping the document indices to their lengths.

Example

>>> corpus = Corpus(["a cat and a dog", "a cat", ""])
>>> assert corpus.doc_lengths == {0: 5, 1: 2, 2: 0}

Return type: Dict[int, int]

docs_with_word(word)[source]¶

Returns a list of all the documents containing a word.

Example

>>> corpus = Corpus(["example document", "another document"])
>>> assert corpus.docs_with_word("document") == {0, 1}
>>> assert corpus.docs_with_word("another") == {1}

Parameters: word (str) – The word you’re looking up.
Return type: Set[int]

document_count(word)[source]¶

Returns the total number of documents a word appears in.

Example

>>> corpus = Corpus(["example document", "another example"])
>>> corpus.document_count("example")
2
>>> corpus.document_count("another")
1

Parameters: word (str) – The word you’re looking up.
Return type: int

document_frequency(word)[source]¶

Returns the percentage of documents that contain a word.

Example

>>> corpus = Corpus(["example document", "another example"])
>>> corpus.document_frequency("example")
1.0
>>> corpus.document_frequency("another")
0.5

Parameters: word (str) – The word you’re looking up.
Return type: float

flatten()[source]¶

Flattens a multi-document index into a single-document corpus.

This creates a new WordIndex object stripped of any positional information that has a single document in it. However, the list of words and their indexes remain.

Example

>>> corpus = Corpus(["i am a document", "so am i"])
>>> len(corpus)
2
>>> flattened = corpus.flatten()
>>> len(flattened)
1
>>> assert corpus.most_common() == flattened.most_common()

Return type: WordIndex

frequency_matrix()[source]¶

Returns a matrix showing the frequency of each word appearing in each document.

Example

>>> corpus = Corpus(["example", "this example is another example"])
>>> corpus.frequency_matrix().tolist() == [[0.0, 0.2], [1.0, 0.4], [0.0, 0.2], [0.0, 0.2]]
True

Return type: array

get_top_words(term_matrix, top_n=None, reverse=True)[source]¶

Get the top values along a term matrix.

Given a matrix where each row represents a word in your vocabulary, this returns a numpy matrix of those top values, along with an array of their respective words.

You can choose the number of results you want to get by setting top_n to some positive value, or you can leave it be and return all of the results in sorted order. Additionally, by setting reverse to False (instead of its default of True), you can return the scores from smallest to largest.

Parameters

term_matrix (array) – a matrix of floats where each row represents a word
top_n (Optional[int]) – The number of values you want to return. If None, returns all values.
reverse (bool) – If true (the default), returns the N values with the highest scores. If false, returns the N values with the lowest scores.

Return type

Tuple[array, array]

Returns

A tuple of 2-dimensional numpy arrays, where the first item is an array of the top-scoring words and the second item is an array of the top scores themselves. Both arrays are of the same size, that is min(self.vocab_size, top_n) by the number of columns in the term matrix.

Raises

ValueError – If top_n is less than 1, if there are not the same number of rows in the matrix as there are unique words in the index, or if the numpy array doesn’t have 1 or 2 dimensions.

Example

The first thing you need to do in order to use this function is create a 1- or 2-dimensional term matrix, where the number of rows corresponds to the number of unique words in the corpus. Any of the functions within WordIndex that ends in _matrix(**kwargs) (for 2-dimensional arrays) or _vector(**kwargs) (for 1-dimensional arrays) will do the trick here. I’ll show an example with both a word count vector and a word count matrix:

>>> corpus = Corpus(["The cat is near the birds", "The birds are distressed"])
>>> corpus.get_top_words(corpus.word_count_vector(), top_n=2)
(array(['the', 'birds'], dtype='<U10'), array([3., 2.]))
>>> corpus.get_top_words(corpus.count_matrix(), top_n=1)
(array([['the', 'the']], dtype='<U10'), array([[2., 1.]]))

Similarly, you can return the scores from lowest to highest by setting reverse=False. (This is not the default.):

>>> corpus.get_top_words(-1. * corpus.word_count_vector(), top_n=2, reverse=False)
(array(['the', 'birds'], dtype='<U10'), array([-3., -2.]))

idf(word)[source]¶

Returns the inverse document frequency.

If the number of documents in your WordIndex index is \(N\) and the document frequency from document_frequency() is \(df\), the inverse document frequency is \(\frac{N}{df}\).

Example

>>> corpus = Corpus(["example document", "another example"])
>>> corpus.idf("example")
1.0
>>> corpus.idf("another")
2.0

Parameters: word (str) – The word you’re looking for.
Return type: float

idf_vector()[source]¶

Returns the inverse document frequency vector.

Example

>>> corpus = Corpus(["example", "another example"])
>>> corpus.idf_vector()
array([2., 1.])

Return type: array

max_word_count()[source]¶

Returns the most common word and the number of times it appeared in the corpus.

Returns None if there are no words in the corpus.

Example

>>> corpus = Corpus([])
>>> corpus.max_word_count() is None
True
>>> corpus.update(["a bird a plane superman"])
>>> corpus.max_word_count()
('a', 2)

Return type: Optional[Tuple[str, int]]

most_common(num_words=None)[source]¶

Returns the most common items.

This is nearly identical to collections.Counter.most_common. However, unlike collections.Counter.most_common, the values that are returned appear in alphabetical order.

Example

>>> corpus = Corpus(["i walked to the zoo", "i bought a zoo"])
>>> corpus.most_common()
[('i', 2), ('zoo', 2), ('a', 1), ('bought', 1), ('the', 1), ('to', 1), ('walked', 1)]
>>> corpus.most_common(2)
[('i', 2), ('zoo', 2)]

Parameters: num_words (Optional[int]) – The number of words you return. If you enter None or you enter a number larger than the total number of words, it returns all of the words, in sorted order from most common to least common.
Return type: List[Tuple[str, int]]

property num_words¶

Returns the total number of words in the corpus (not just unique).

Example

>>> corpus = Corpus(["a cat and a dog"])
>>> corpus.num_words
5

Return type: int

odds_document(word, document, sublinear=False)[source]¶

Returns the odds of finding a word in a document.

This is the equivalent of odds_word(). But insteasd of calculating items at the word-corpus level, the calculations are performed at the word-document level.

Example

>>> corpus = Corpus(["this is a document", "document two"])
>>> corpus.odds_document("document", 1)
1.0
>>> corpus.odds_document("document", 1, sublinear=True)
0.0

Parameters

word (str) – The word you’re looking up
document (int) – The index of the document
sublinear (bool) – If True, returns the log-odds of finding the word in the document.

Raises

ValueError – If the document doesn’t exist.

Return type

float

odds_matrix(sublinear=False, add_k=None)[source]¶

Returns the odds of finding a word in a document for every possible word-document pair.

Because not all words are likely to appear in all of the documents, this implementation adds 1 to all of the numerators before taking the frequencies. So

\(O(w) = \frac{c_{i} + 1}{N + \vert V \vert}\)

where \(\vert V \vert\) is the total number of unique words in each document, \(N\) is the total number of total words in each document, and \(c_i\) is the count of a word in a document.

Example

>>> corpus = Corpus(["example document", "another example"])
>>> corpus.odds_matrix()
array([[0.33333333, 1.        ],
       [1.        , 0.33333333],
       [1.        , 1.        ]])
>>> corpus.odds_matrix(sublinear=True)
array([[-1.5849625,  0.       ],
       [ 0.       , -1.5849625],
       [ 0.       ,  0.       ]])

Parameters

sublinear (bool) – If True, computes the log-odds.
add_k (Optional[float]) – This adds k to each of the non-zero elements in the matrix. Since \(\log{1} = 0\), this prevents 50 percent probabilities from appearing to be the same as elements that don’t exist.

Return type

array

odds_vector(sublinear=False)[source]¶

Returns a vector of the odds of each word appearing at random.

Example

>>> corpus = Corpus(["example", "this example is another example"])
>>> corpus.odds_vector()
array([0.2, 1. , 0.2, 0.2])
>>> corpus.odds_vector(sublinear=True)
array([-2.32192809,  0.        , -2.32192809, -2.32192809])

Parameters: sublinear (bool) – If true, returns the log odds.
Return type: array

odds_word(word, sublinear=False)[source]¶

Returns the odds of seeing a word at random.

In statistics, the odds of something happening are the probability of it happening, versus the probability of it not happening, that is \(\frac{p}{1 - p}\). The “log odds” of something happening — the result of using self.log_odds_word — is similarly equivalent to \(log_{2}{\frac{p}{1 - p}}\).

(The probability in this case is simply the word frequency.)

Example

>>> corpus = Corpus(["i like odds ratios"])
>>> np.isclose(corpus.odds_word("odds"), 1. / 3.)
True
>>> np.isclose(corpus.odds_word("odds", sublinear=True), np.log2(1./3.))
True

Parameters

word (str) – The word you’re looking up.
sublinear (bool) – If true, returns the

Return type

float

one_hot_matrix()[source]¶

Returns a matrix showing whether each given word appeared in each document.

For these matrices, all cells contain a floating point value of either a 1., if the word is in that document, or a 0. if the word is not in the document.

These are sometimes referred to as ‘one-hot encoding matrices’ in machine learning.

Example

>>> corpus = Corpus(["example", "this example is another example"])
>>> np.array_equal(
...     corpus.one_hot_matrix(),
...     np.array([[0., 1.], [1., 1.], [0., 1.], [0., 1.]])
... )
True

Return type: array

reset_index(start_idx=None)[source]¶

An in-place operation that resets the document indexes for this corpus.

When you reset the index, all of the documents change their values, starting at start_idx (and incrementing from there). For the most part, you will not need to do this, since most of the library does not give you the option to change the document indexes. However, it may be useful when you’re using slice() or split_off().

Parameters: start_idx (Optional[int]) – The first (lowest) document index you want to set. Values must be positive. Defaults to 0.

skip_words(words)[source]¶

Creates a WordIndex without any of the skipped words.

This enables you to create an index that does not contain rare words, for example. The index will not have any positions associated with them, so be careful when implementing it on a text_data.index.Corpus object.

Example

>>> skip_words = {"document"}
>>> corpus = Corpus(["example document", "document"])
>>> "document" in corpus
True
>>> without_document = corpus.skip_words(skip_words)
>>> "document" in without_document
False

Return type: WordIndex

slice(indexes)[source]¶

Returns an index that just contains documents from the set of words.

Parameters: indexes (Set[int]) – A set of index values for the documents.

Example

>>> index = WordIndex([["example"], ["document"], ["another"], ["example"]])
>>> sliced_idx = index.slice({0, 2})
>>> len(sliced_idx)
2
>>> sliced_idx.most_common()
[('another', 1), ('example', 1)]

Return type: WordIndex

slice_many(indexes_list)[source]¶

This operates like slice() but creates multiple WordIndex objects.

Example

>>> corpus = Corpus(["example document", "another example", "yet another"])
>>> first, second, third = corpus.slice_many([{0}, {1}, {2}])
>>> first.documents
['example document']
>>> second.documents
['another example']
>>> third.documents
['yet another']

Parameters: indexes_list (List[Set[int]]) – A list of sets of indexes. See text_data.index.WordIndex.slice() for details.
Return type: List[WordIndex]

split_off(indexes)[source]¶

Returns an index with just a set of documents, while removing them from the index.

Parameters: indexes (Set[int]) – A set of index values for the documents.

Note

This removes words from the index inplace. So be make sure you want to do that before using this function.

Example

>>> index = WordIndex([["example"], ["document"], ["another"], ["example"]])
>>> split_idx = index.split_off({0, 2})
>>> len(split_idx)
2
>>> len(index)
2
>>> split_idx.most_common()
[('another', 1), ('example', 1)]
>>> index.most_common()
[('document', 1), ('example', 1)]

Return type: WordIndex

term_count(word, document)[source]¶

Returns the total number of times a word appeared in a document.

Assuming the document exists, returns 0 if the word does not appear in the document.

Example

>>> corpus = Corpus(["i am just thinking random thoughts", "am i"])
>>> corpus.term_count("random", 0)
1
>>> corpus.term_count("random", 1)
0

Parameters

word (str) – The word you’re looking up.
document (int) – The index of the document.

Raises

ValueError – If you selected a document

Return type

int

term_frequency(word, document)[source]¶

Returns the proportion of words in document document that are word.

Example

>>> corpus = Corpus(["just coming up with words", "more words"])
>>> np.isclose(corpus.term_frequency("words", 1), 0.5)
True
>>> np.isclose(corpus.term_frequency("words", 0), 0.2)
True

Parameters

word (str) – The word you’re looking up
document (int) – The index of the document

Raises

ValueError – If the document you’re looking up doesn’t exist

Return type

float

tfidf_matrix(norm='l2', use_idf=True, smooth_idf=False, sublinear_tf=True, add_k=1)[source]¶

This creates a term-document TF-IDF matrix from the index.

In natural language processing, TF-IDF is a mechanism for finding out which words are distinct across documents. It’s used particularly widely in information retrieval, where your goal is to rank documents that you know match a query by how relevant you think they’ll be.

The basic intuition goes like this: If a word appears particularly frequently in a document, it’s probably more relevant to that document than if the word occurred more rarely. But, some words are simply common: If document X uses the word ‘the’ more often than the word ‘idiomatic,’ that really tells you more about the words ‘the’ and ‘idiomatic’ than it does about the document.

TF-IDF tries to balance these two competing interests by taking the ‘term frequency,’ or how often a word appears in the document, and normalizing it by the ‘document frequency,’ or the proportion of documents that contain the word. This has the effect of reducing the weights of common words (and even setting the weights of some very common words to 0 in some implementations).

It should be noted that there are a number of different implementations of TF-IDF. Within information retrieval, TF-IDF is part of the ‘SMART Information Retrieval System’. Although the exact equations can vary considerably, they typically follow the same approach: First, they find some value to represent the frequency of each word in the document. Often (but not always), this is just the raw number of times in which a word appeared in the document. Then, they normalize that based on the document frequency. And finally, they normalize those values based on the length of the document, so that long documents are not weighted more favorably (or less favorably) than shorter documents.

The approach that I have taken to this is shamelessly cribbed from scikit’s TfidfTransformer. Specifically, I’ve allowed for some customization of the specific formula for TF-IDF while not including methods that require access to the raw documents, which would be computationally expensive to perform. This allows for the following options:

You can set the term frequency to either take the raw count of the word in the document (\(c_{t,d}\)) or by using sublinear_tf=True and taking \(1 + \log_{2}{c_{t,d}}\)
You can skip taking the inverse document frequency \(df^{-1}\) altogether by setting use_idf=False or you can smooth the inverse document frequency by setting smooth_idf=True. This adds one to the numerator and the denominator. (Note: Because this method is only run on a vocabulary of words that are in the corpus, there can’t be any divide by zero errors, but this allows you to replicate scikit’s TfidfTransformer.)
You can add some number to the logged inverse document frequency by setting add_k to something other than 1. This is the only difference between this implementation and scikits, as scikit automatically setts k at 1.
Finally, you can choose how to normalize the document lengths. By default, this takes the L-2 norm, or \(\sqrt{\sum{w_{i,k}^{2}}}\), where \(w_{i,k}\) is the weight you get from multiplying the term frequency by the inverse document frequency. But you can also set the norm to 'l1' to get the L1-norm, or \(\sum{\vert w_{i,k} \vert}\). Or you can set it to None to avoid doing any document-length normalization at all.

Examples

To get a sense of the different options, let’s start by creating a pure count matrix with this method. To do that, we’ll set norm=None so we’re not normalizing by the length of the document, use_idf=False so we’re not doing anything with the document frequency, and sublinear_tf=False so we’re not taking the logged counts:

>>> corpus = Corpus(["a cat", "a"])
>>> tfidf_count_matrix = corpus.tfidf_matrix(norm=None, use_idf=False, sublinear_tf=False)
>>> assert np.array_equal(tfidf_count_matrix, corpus.count_matrix())

In this particular case, setting sublinear_tf to True will produce the same result since all of the counts are 1 or 0 and \(\log{1} + 1 = 1\):

>>> assert np.array_equal(corpus.tfidf_matrix(norm=None, use_idf=False), tfidf_count_matrix)

Now, we can incorporate the inverse document frequency. Because the word ‘a’ appears in both documents, its inverse document frequency in is 1; the inverse document frequency of ‘cat’ is 2, since ‘cat’ appears in half of the documents. We’re additionally taking the base-2 log of the inverse document frequency and adding 1 to the final result. So we get:

>>> idf_add_1 = corpus.tfidf_matrix(norm=None, sublinear_tf=False, smooth_idf=False)
>>> assert idf_add_1.tolist() == [[1., 1.], [2.,0.]]

Or we can add nothing to the logged values:

>>> idf = corpus.tfidf_matrix(norm=None, sublinear_tf=False, smooth_idf=False, add_k=0)
>>> assert idf.tolist() == [[0.0, 0.0], [1.0, 0.0]]

The L-1 norm normalizes the results by the sum of the absolute values of their weights. In the case of the count matrix, this is equivalent to creating the frequency matrix:

>>> tfidf_freq_mat = corpus.tfidf_matrix(norm="l1", use_idf=False, sublinear_tf=False)
>>> assert np.array_equal(tfidf_freq_mat, corpus.frequency_matrix())

Parameters

norm (Optional[str]) – Set to ‘l2’ for the L2 norm (square root of the sums of the square weights), ‘l1’ for the l1 norm (the summed absolute value, or None for no normalization).
use_idf (bool) – If you set this to False, the weights will only include the term frequency (adjusted however you like)
smooth_idf (bool) – Adds a constant to the numerator and the denominator.
sublinear_tf (bool) – Computes the term frequency in log space.
add_k (int) – This adds k to every value in the IDF. scikit adds 1 to all documents, but this allows for more variable computing (e.g. adding 0 if you want to remove words appearing in every document)

Return type

array

property vocab¶

Returns all of the unique words in the index.

Example

>>> corpus = Corpus(["a cat and a dog"])
>>> corpus.vocab == {"a", "cat", "and", "dog"}
True

Return type: Set[str]

property vocab_list¶

Returns a sorted list of the words appearing in the index.

This is primarily intended for use in matrix or vector functions, where the order of the words matters.

Example

>>> corpus = Corpus(["a cat and a dog"])
>>> corpus.vocab_list
['a', 'and', 'cat', 'dog']

Return type: List[str]

property vocab_size¶

Returns the total number of unique words in the corpus.

Example

>>> corpus = Corpus(["a cat and a dog"])
>>> corpus.vocab_size
4

Return type: int

word_count(word)[source]¶

Returns the total number of times the word appeared.

Defaults to 0 if the word never appeared.

Example

>>> corpus = Corpus(["this is a document", "a bird and a plane"])
>>> corpus.word_count("document")
1
>>> corpus.word_count("a")
3
>>> corpus.word_count("malarkey")
0

Parameters: word (str) – The string word (or phrase).
Return type: int

word_count_vector()[source]¶

Returns the total number of times each word appeared in the corpus.

Example

>>> corpus = Corpus(["example", "this example is another example"])
>>> corpus.word_count_vector()
array([1., 3., 1., 1.])

Return type: array

word_counter(word)[source]¶

Maps the documents containing a word to the number of times the word appeared.

Examples

>>> corpus = Corpus(["a bird", "a bird and a plane", "two birds"])
>>> corpus.word_counter("a") == {0: 1, 1: 2}
True

Parameters

word (str) – The word you’re looking up

Return type

Dict[int, int]

Returns

A dictionary mapping the document index of the word to the number of times: it appeared in that document.

word_freq_vector()[source]¶

Returns the frequency in which each word appears over the corpus.

Example

>>> corpus = Corpus(["example", "this example is another example"])
>>> corpus.word_freq_vector()
array([0.16666667, 0.5       , 0.16666667, 0.16666667])

Return type: array

word_frequency(word)[source]¶

Returns the frequency in which the word appeared in the corpus.

Example

>>> corpus = Corpus(["this is fun", "or is it"])
>>> np.isclose(corpus.word_frequency("fun"), 1. / 6.)
True
>>> np.isclose(corpus.word_frequency("is"), 2. / 6.)
True

Parameters: word (str) – The string word or phrase.
Return type: float

text_data package¶

Subpackages¶

Submodules¶

text_data.display module¶

text_data.index module¶

text_data.multi_corpus module¶

text_data.query module¶

text_data.tokenize module¶

Module contents¶