text_data package¶
Subpackages¶
Submodules¶
text_data.display module¶
Renders data visualizations on text_data.index.WordIndex
objects.
The graphics in this module are designed to work across different metrics. You just have to pass them 1- or 2-dimensional numpy arrays.
This enables you to take the outputs from any functions inside of
text_data.index.WordIndex
and visualize them.
-
text_data.display.
display_score_table
(words, scores, table_name='Top Scores')[source]¶ Returns the top (or bottom scores) as a table.
It requires a 1-dimensional numpy array of the scores and the words, much as you would receive from
text_data.index.WordIndex.get_top_words()
. For a 2-dimensional equivalent, usedisplay_score_tables()
.- Parameters
words (
array
) – A 1-dimensional numpy array of words.scores (
array
) – A 1-dimensional numpy array of corresponding scores.table_name (
str
) – The name to give your table.
- Raises
ValueError – If you did not use a 1-dimensional array, or if the two arrays don’t have identical shapes.
- Return type
str
-
text_data.display.
display_score_tables
(words, scores, table_names=None)[source]¶ Renders two score tables.
This is the 2-dimensional equivalent of
display_score_table()
for details.- Parameters
words (
array
) – A 2-dimensional matrix of wordsscores (
array
) – A 2-dimensional matrix of scorestable_names (
Optional
[List
[str
]]) – A list of names for your corresponding tables.
- Raises
ValueError – If
words
andscores
aren’t both 2-dimensional arrays of the same shape, or iftable_names
isn’t of the same length as the number of documents.
-
text_data.display.
frequency_map
(index, word_vector, x_label='Word Frequency', y_label='Score')[source]¶ A scatterplot scores over a corpus to their underlying frequencies.
I cribbed this idea from Monroe et al 2008, a great paper that uses it to show distributional problems in metrics that are trying to compare two things.
The basic idea is that by creating a scatter plot mapping the frequencies of words to scores, you can both figure out which scores are disproportionately high or low and identify bias in whether your metric is excessively favoring common or rare words.
In order to render this graphic, your word vector has to conform to the number of words in your index. If you feel the need to remove words to make the graphic manageable to look at, consider using
text_data.index.WordIndex.skip_words()
.- Parameters
index (
WordIndex
) – Atext_data.index.WordIndex
object. This is used to get the overall frequencies.word_vector (
array
) – A 1-dimensional numpy array with floating point scores.x_label (
str
) – The name of the x label for your graphic.y_label (
str
) – The name of the y label for your graphic.
- Raises
ValueError – If the word_vector doesn’t have 1 dimension or if the vector isn’t the same length as your vocabulary.
-
text_data.display.
heatmap
(distance_matrix, left_indexes=None, right_indexes=None, left_name='Left', right_name='Right', metric_name='Similarity')[source]¶ Displays a heatmap displaying scores across a 2-dimensional matrix.
The purpose of this is to visually gauge which documents are closest to each other given two sets of documents. (If you only have one set of documents, the left and right can be the same.) The visual rendering here is inspired by tensorflow’s Universal Sentence Encoder documentation. But, while you can use a universal sentence encoder to create the heatmap, you can also easily use any of the metrics in scikit’s pairwise_distances function. Or, indeed, any other 2-dimensional matrix of floats will do the trick.
Note that the
left_name
andright_name
must be different. In order to account for this, this function automatically adds a suffix to both names if they are the same.- Parameters
distance_matrix (
array
) – A distance matrix of size M x N where M is the number of documents on the left side and N is the number of documents on the right side.left_indexes (
Optional
[List
[Any
]]) – Labels for the left side (the Y axis)right_indexes (
Optional
[List
[Any
]]) – Labels for the right side (the X axis)left_name (
str
) – The Y axis labelright_name (
str
) – The X axis label
- Raises
ValueError – If the side of the indexes doesn’t match the shape of the matrix of if there are not 2 dimensions in the distance matrix.
-
text_data.display.
histogram
(values, x_label='Score', y_label='Number of Documents', x_scale='linear', y_scale='linear', max_bins=100)[source]¶ Displays a histogram of values.
This can be really useful for debugging the lengths of documents.
- Parameters
values (
array
) – A numpy array of quantitative values.x_label (
str
) – A label for the x-axis.y_label (
str
) – A label for the y-axis.x_scale (
str
) – A continuous scale type, defined by altair.y_scale (
str
) –A continuous scale type, defined by altair.
max_bins (
int
) – The maximum number of histogram bins.
-
text_data.display.
render_bar_chart
(labels, vector_data, x_label='Score', y_label='Word')[source]¶ Renders a bar chart given a 1-dimensional numpy array.
- Parameters
vector_data (
array
) – A 1-dimensional numpy array of floating point scores.labels (
array
) – A 1-dimensional numpy array of labels for the bar chart (e.g. words)x_label (
str
) – The label for your x-axis (the score).y_label (
str
) – The label for the y-axis (the words).
- Raises
ValueError – If the numpy arrays have more than 1 dimension.
-
text_data.display.
render_multi_bar_chart
(labels, matrix_scores, document_names, y_label='Score')[source]¶ This renders a bar chart, grouped by document, showing word-document statistics.
It’s essentially the 2-dimensional matrix equivalent of
render_bar_chart()
.- Parameters
labels (
array
) – A 2-dimensional numpy array of words, like those passed fromtext_data.index.get_top_scores()
.matrix_scores (
array
) – A 2-dimensional numpy array of scores, like those passed fromtext_data.index.get_top_scores()
.document_names (
Optional
[List
[str
]]) – A list of names for the documents. IfNone
, this will display numbers incrementing from 0.y_label (
str
) – The name for the y label (where the scores go).
- Raises
ValueError – If your labels or your axes aren’t 2 dimensional or aren’t of the same size.
text_data.index module¶
This module handles the indexing of text_data.
Its two classes — WordIndex
and Corpus
— form the central part
of this library.
text_data.index.WordIndex
indexes lists of documents — which themselves form
lists of words or phrases — and offers utilities for performing
statistical calculations on your data.
Using the index, you can find out how many times a given word appeared in a
document or do more complicated things, like finding the TF-IDF values
for every single word across all of the documents in a corpus. In addition
to offering a bunch of different ways to compute statistics, WordIndex
also offers capabilities for creating new WordIndex
objects — something
that can be very helpful if you’re trying to figure out what
makes a set of documents different from some other documents.
The text_data.index.Corpus
, meanwhile, is a wrapper over WordIndex
that offers tools for searching
through sets of documents. In addition, it offers tools for visually seeing the results of search queries.
-
class
text_data.index.
Corpus
(documents, tokenizer=functools.partial(<function postprocess_positions>, [<method 'lower' of 'str' objects>, ]functools.partial(<function tokenize_regex_positions>, '\\\\w+', inverse_match=False)), sep=None, prefix=None, suffix=None)[source]¶ Bases:
text_data.index.WordIndex
This is probably going to be your main entrypoint into
text_data
.The corpus holds the raw text, the index, and the tokenized text of whatever you’re trying to analyze. Its primary role is to extend the functionality of
WordIndex
to support searching. This means that you can use theCorpus
to search for arbitrarily long phrases using boolean search methods (AND, NOT, BUT).In addition, it allows you to add indexes so you can calculate statistics on phrases. By using
add_ngram_index()
, you can figure out the frequency or TF-IDF values of multi-word phrases while still being able to search through your normal index.Initializing Data
To instantiate the corpus, you need to include a list of documents where each document is a string of text and a tokenizer. There is a default tokenizer, which simply lowercases words and splits documents on
r"\w+"
. For most tasks, this will be insufficient. Buttext_data.tokenize
offers convenient ways that should make building the vast majority of tokenizers easy.The
Corpus
can be instantiated using__init__
or by usingchunks()
, which yields a generator, adding a mini-index. This allows you to technically perform calculations in-memory on larger databases.You can also initialize a
Corpus
object by using theslice()
,copy()
,split_off()
, orconcatenate()
methods. These methods work identically to their equivalent methods intext_data.index.WordIndex
while updating extra data that the corpus has, updating n-gram indexes, and automatically re-indexing the corpus.Updating Data
There are two methods for updating or adding data to the
Corpus
.update()
allows you to add new documents to the corpus.add_ngram_index()
allows you to add multi-word indexes.Searching
There are a few methods devoted to searching.
search_documents()
allows you to find all of the individual documents matching a query.search_occurrences()
shows all of the individual occurrences that matched your query.ranked_search()
finds all of the individual occurrences and sorts them according to a variant of their TF-IDF score.Statistics
Three methods allow you to get statistics about a search.
search_document_count()
allows you to find the total number of documents matching your query.search_document_freq()
shows the proportion of documents matching your query. Andsearch_occurrence_count()
finds the total number of matches you have for your query.Display
There are a number of functions designed to help you visually see the results of your query.
display_document()
anddisplay_documents()
render your documents in HTML.display_document_count()
,display_document_frequency()
, anddisplay_occurrence_count()
all render bar charts showing the number of query results you got. Anddisplay_search_results()
shows the result of your search.-
documents
¶ A list of all the raw, non-tokenized documents in the corpus.
-
tokenizer
¶ A function that converts a list of strings (one of the documents from documents into a list of words and a list of the character-level positions where the words are located in the raw text). See
text_data.tokenize
for details.
-
tokenized_documents
¶ A list of the tokenized documents (each a list of words)
-
ngram_indexes
¶ A list of
WordIndex
objects for multi-word (n-gram) indexes. Seeadd_ngram_index()
for details.
-
ngram_sep
¶ A separator in between words. See
add_ngram_index()
for details.
-
ngram_prefix
¶ A prefix to go before any n-gram phrases. See
add_ngram_index()
for details.
-
ngram_suffix
¶ A suffix to go after any n-gram phrases. See
add_ngram_index()
for details.
- Parameters
documents (
List
[str
]) – A list of the raw, un-tokenized texts.tokenizer (
Callable
[[str
],Tuple
[List
[str
],List
[Tuple
[int
,int
]]]]) – A function to tokenize the documents. Seetext_data.tokenize
for details.sep (
Optional
[str
]) – The separator you want to use for computing n-grams. Seeadd_ngram_index()
for details.prefix (
Optional
[str
]) – The prefix you want to use for n-grams. Seeadd_ngram_index()
for details.suffix (
Optional
[str
]) – The suffix you want to use for n-grams. Seeadd_ngram_index()
for details.
-
add_documents
(tokenized_documents, indexed_locations=None)[source]¶ This overrides the
add_documents()
method.Because
Corpus()
objects can have n-gram indices, simply runningadd_documents
would cause n-gram indices to go out of sync with the overall corpus. In order to prevent that, this function raises an error if you try to run it.- Raises
NotImplementedError – Warns you to use
~text_data.index.Corpus.update
instead. –
-
add_ngram_index
(n=1, default=True, sep=None, prefix=None, suffix=None)[source]¶ Adds an n-gram index to the corpus.
This creates a
WordIndex
object that you can access by typingself.ngram_indexes[n]
.There are times when you might want to compute TF-IDF scores, word frequency scores or similar scores over a multi-word index. For instance, you might want to know how frequently someone said ‘United States’ in a speech, without caring how often they used the word ‘united’ or ‘states’.
This function helps you do that. It automatically splits up your documents into an overlapping set of
n
-length phrases.Internally, this takes each of your tokenized documents, merges them into lists of
n
-length phrases, and joins each of those lists by a space. However, you can customize this behavior. If you setprefix
, each of the n-grams will be prefixed by that string; if you setsuffix
, each of the n-grams will end with that string. And if you setsep
, each of the words in the n-gram will be separated by the separator.Example
Say you have a simple four word corpus. If you use the default settings, here’s what your n-grams will look like:
>>> corpus = Corpus(["text data is fun"]) >>> corpus.add_ngram_index(n=2) >>> corpus.ngram_indexes[2].vocab_list ['data is', 'is fun', 'text data']
By altering
sep
,prefix
, orsuffix
, you can alter that behavior. But, be careful to setdefault
toFalse
if you want to change the behavior from something you set up in__init__
. If you don’t, this will use whatever settings you instantiated the class with.>>> corpus.add_ngram_index(n=2, sep="</w><w>", prefix="<w>", suffix="</w>", default=False) >>> corpus.ngram_indexes[2].vocab_list ['<w>data</w><w>is</w>', '<w>is</w><w>fun</w>', '<w>text</w><w>data</w>']
- Parameters
n (
int
) – The number of n-grams (defaults to unigrams)default (
bool
) – If true, will keep the values stored in init (including defaults)sep (
Optional
[str
]) – The separator in between words (if storing n-grams)prefix (
Optional
[str
]) – The prefix before the first word of each n-gramsuffix (
Optional
[str
]) – The suffix after the last word of each n-gram
-
classmethod
chunks
(documents, tokenizer=functools.partial(<function postprocess_positions>, [<method 'lower' of 'str' objects>, ]functools.partial(<function tokenize_regex_positions>, '\\\\w+', inverse_match=False)), sep=None, prefix=None, suffix=None, chunksize=1000000)[source]¶ Iterates through documents, yielding a
Corpus
withchunksize
documents.This is designed to allow you to technically use
Corpus
on large document sets. However, you should note that searching for documents will only work within the context of the current chunk.The same is true for any frequency metrics. As such, you should probably limit metrics to raw counts or aggregations you’ve derived from raw counts.
Example
>>> for docs in Corpus.chunks(["chunk one", "chunk two"], chunksize=1): ... print(len(docs)) 1 1
- Parameters
documents (
Iterator
[str
]) – A list of raw text items (not tokenized)tokenizer (
Callable
[[str
],Tuple
[List
[str
],List
[Tuple
[int
,int
]]]]) – A function to tokenize the documentssep (
Optional
[str
]) – The separator you want to use for computing n-grams.prefix (
Optional
[str
]) – The prefix for n-grams.suffix (
Optional
[str
]) – The suffix for n-grams.chunksize (
int
) – The number of documents in each chunk.
- Return type
Generator
[~CorpusClass,None
,None
]
-
concatenate
(other)[source]¶ This combines two
Corpus
objects into one, much liketext_data.index.WordIndex.concatenate()
.However, the new
Corpus
has data from this corpus, including n-gram data. Because of this, the twoCorpus
objects must have the same keys for their n-gram dictionaries.Example
>>> corpus_1 = Corpus(["i am an example"]) >>> corpus_2 = Corpus(["i am too"]) >>> corpus_1.add_ngram_index(n=2) >>> corpus_2.add_ngram_index(n=2) >>> combined_corpus = corpus_1.concatenate(corpus_2) >>> combined_corpus.most_common() [('am', 2), ('i', 2), ('an', 1), ('example', 1), ('too', 1)] >>> combined_corpus.ngram_indexes[2].most_common() [('i am', 2), ('am an', 1), ('am too', 1), ('an example', 1)]
-
copy
()[source]¶ This creates a shallow copy of a
Corpus
object.It extends the contents of
Corpus
to also store data about the objects themselves.- Return type
-
display_document
(doc_idx)[source]¶ Print an entire document, given its index.
- Parameters
doc_idx (
int
) – The index of the document- Return type
HTML
-
display_document_count
(queries, query_tokenizer=<method 'split' of 'str' objects>)[source]¶ Returns a bar chart (in altair) showing the queries with the largest number of documents.
Note
This method requires that you have
altair
installed. To install, typepip install text_data[display]
orpoetry add text_data -E display
.- Parameters
queries (
List
[str
]) – A list of queries (in the same form you use to search for things)query_tokenizer (
Callable
[[str
],List
[str
]]) – The tokenizer for the query
-
display_document_frequency
(queries, query_tokenizer=<method 'split' of 'str' objects>)[source]¶ Displays a bar chart showing the percentages of documents with a given query.
Note
This method requires that you have
altair
installed. To install, typepip install text_data[display]
orpoetry add text_data -E display
.- Parameters
queries (
List
[str
]) – A list of queriesquery_tokenizer (
Callable
[[str
],List
[str
]]) – A tokenizer for each query
-
display_documents
(documents)[source]¶ Display a number of documents, at the specified indexes.
- Parameters
documents (
List
[int
]) – A list of document indexes.- Return type
HTML
-
display_occurrence_count
(queries, query_tokenizer=<method 'split' of 'str' objects>)[source]¶ Display a bar chart showing the number of times a query matches.
Note
This method requires that you have
altair
installed. To install, typepip install text_data[display]
orpoetry add text_data -E display
.- Parameters
queries (
List
[str
]) – A list of queriesquery_tokenizer (
Callable
[[str
],List
[str
]]) – The tokenizer for the query
-
display_search_results
(search_query, query_tokenizer=<method 'split' of 'str' objects>, max_results=None, window_size=None)[source]¶ Shows the results of a ranked query.
This function runs a query and then renders the result in human-readable HTML. For each result, you will get a document ID and the count of the result.
In addition, all of the matching occurrences of phrases or words you searched for will be highlighted in bold. You can optionally decide how many results you want to return and how long you want each result to be (up to the length of the whole document).
- Parameters
search_query (
str
) – The query you’re searching forquery_tokenizer (
Callable
[[str
],List
[str
]]) – The tokenizer for the querymax_results (
Optional
[int
]) – The maximum number of results. If None, returns all results.window_size (
Optional
[int
]) – The number of characters you want to return around the matching phrase.None (If) –
the entire document. (returns) –
- Return type
HTML
-
ranked_search
(query_string, query_tokenizer=<method 'split' of 'str' objects>)[source]¶ This produces a list of search responses in ranked order.
More specifically, the documents are ranked in order of the sum of the TF-IDF scores for each word in the query (with the exception of words that are negated using a NOT operator).
To compute the TF-IDF scores, I simply have computed the dot products between the raw query counts and the TF-IDF scores of all the unique words in the query. This is roughly equivalent to the
ltn.lnn
normalization scheme described in Manning. (The catch is that I have normalized the term-frequencies in the document to the length of the document.)Each item in the resulting list is a list referring to a single item. The items inside each of those lists are of the same format you get from
search_occurrences()
. The first item in each list is either an item having the largest number of words in it or is the item that’s the nearest to another match within the document.- Parameters
query – Query string
query_tokenizer (
Callable
[[str
],List
[str
]]) – Function for tokenizing the results.
- Return type
List
[List
[PositionResult
]]- Returns
A list of tuples, each in the same format as
search_occurrences()
.
-
search_document_count
(query_string, query_tokenizer=<method 'split' of 'str' objects>)[source]¶ Finds the total number of documents matching a query.
By entering a search, you can get the total number of documents that match the query.
Example
>>> corpus = Corpus(["the cow was hungry", "the cow likes the grass"]) >>> corpus.search_document_count("cow") 2 >>> corpus.search_document_count("grass") 1 >>> corpus.search_document_count("the") 2
- Parameters
query_string (
str
) – The query you’re searching forquery_tokenizer (
Callable
[[str
],List
[str
]]) – The tokenizer for the query
- Return type
int
-
search_document_freq
(query_string, query_tokenizer=<method 'split' of 'str' objects>)[source]¶ Finds the percentage of documents that match a query.
Example
>>> corpus = Corpus(["the cow was hungry", "the cow likes the grass"]) >>> corpus.search_document_freq("cow") 1.0 >>> corpus.search_document_freq("grass") 0.5 >>> corpus.search_document_freq("the grass") 0.5 >>> corpus.search_document_freq("the OR nonsense") 1.0
- Parameters
query_string (
str
) – The query you’re searching forquery_tokenizer (
Callable
[[str
],List
[str
]]) – The tokenizer for the query
- Return type
float
-
search_documents
(query, query_tokenizer=<method 'split' of 'str' objects>)[source]¶ Search documents from a query.
In order to figure out the intracacies of writing queries, you should view
text_data.query.Query
. In general, standard boolean (AND, OR, NOT) searches work perfectly reasonably. You should generally not need to setquery_tokenizer
to anything other than the default (string split).This produces a set of unique documents, where each document is the index of the document. To view the documents by their ranked importance (ranked largely using TF-IDF), use
ranked_search()
.Example
>>> corpus = Corpus(["this is an example", "here is another"]) >>> assert corpus.search_documents("is") == {0, 1} >>> assert corpus.search_documents("example") == {0}
- Parameters
query (
str
) – A string boolean query (as defined intext_data.query.Query
)query_tokenizer (
Callable
[[str
],List
[str
]]) – A function to tokenize the words in your query. This allows you to optionally search for words in your index that include spaces (since it defaults to string.split).
- Return type
Set
[int
]
-
search_occurrence_count
(query_string, query_tokenizer=<method 'split' of 'str' objects>)[source]¶ Finds the total number of occurrences you have for the given query.
This just gets the number of items in
search_occurrences()
. As a result, searching for occurrences where two separate words occur will find the total number of places where either word occurs within the set of documents where both words appear.Example
>>> corpus = Corpus(["the cow was hungry", "the cow likes the grass"]) >>> corpus.search_occurrence_count("the") 3 >>> corpus.search_occurrence_count("the cow") 5 >>> corpus.search_occurrence_count("'the cow'") 2
- Parameters
query_string (
str
) – The query you’re searching forquery_tokenizer (
Callable
[[str
],List
[str
]]) – The tokenizer for the query
- Return type
int
-
search_occurrences
(query, query_tokenizer=<method 'split' of 'str' objects>)[source]¶ Search for matching positions within a search.
This allows you to figure out all of the occurrences matching your query. In addition, this is used internally to display search results.
Each matching position comes in the form of a tuple where the first field
doc_id
refers to the position of the document, the second fieldfirst_idx
refers to the starting index of the occurrence (among the tokenized documents),last_idx
refers to the last index of the occurrence,raw_start
refers to the starting index of the occurrence from within the raw, non-tokenized documents.raw_end
refers to the index after the last character of the matching result within the non-tokenized documents. There is not really a reason behind this decision.Example
>>> corpus = Corpus(["this is fun"]) >>> result = list(corpus.search_occurrences("'this is'"))[0] >>> result PositionResult(doc_id=0, first_idx=0, last_idx=1, raw_start=0, raw_end=7) >>> corpus.documents[result.doc_id][result.raw_start:result.raw_end] 'this is' >>> corpus.tokenized_documents[result.doc_id][result.first_idx:result.last_idx+1] ['this', 'is']
- Parameters
query (
str
) – The string query. Seetext_data.query.Query
for details.query_tokenizer (
Callable
[[str
],List
[str
]]) – The tokenizing function for the query. Seetext_data.query.Query
orsearch_documents()
for details.
- Return type
Set
[PositionResult
]
-
slice
(indexes)[source]¶ This creates a
Corpus
object only including the documents listed.This overrides the method in
text_data.index.WordIndex()
, which does the same thing (but without making changes to the underlying document set). This also creates slices of any of the n-gram indexes you have created.Note
This also changes the indexes for the new corpus so they all go from 0 to
len(indexes)
.- Parameters
indexes (
Set
[int
]) – A set of document indexes you want to have in the new index.
Example
>>> corpus = Corpus(["example document", "another example", "yet another"]) >>> corpus.add_ngram_index(n=2) >>> sliced_corpus = corpus.slice({1}) >>> len(sliced_corpus) 1 >>> sliced_corpus.most_common() [('another', 1), ('example', 1)] >>> sliced_corpus.ngram_indexes[2].most_common() [('another example', 1)]
- Return type
-
split_off
(indexes)[source]¶ This operates like
split_off()
.But it additionally maintains the state of the
Corpus
data, similar to howslice()
works.Example
>>> corpus = Corpus(["i am an example", "so am i"]) >>> sliced_data = corpus.split_off({0}) >>> corpus.documents ['so am i'] >>> sliced_data.documents ['i am an example'] >>> corpus.most_common() [('am', 1), ('i', 1), ('so', 1)]
- Return type
-
to_index
()[source]¶ Converts a
Corpus
object into aWordIndex
object.Corpus
objects are convenient because they allow you to search across documents, in addition to computing statistics about them. But sometimes, you don’t need that, and the added convenience comes with extra memory requirements.- Return type
-
-
class
text_data.index.
PositionResult
(doc_id, first_idx, last_idx, raw_start, raw_end)¶ Bases:
tuple
This represents the position of a word or phrase within a document.
See
text_data.index.Corpus.search_occurrences()
for more details and an example.- Parameters
doc_id (int) – The index of the document within the index
first_idx (int) – The index of the first word within the tokenized document at
corpus.tokenized_documents[doc_id]
.last_idx (int) – The index of the last word within the tokenized document at
corpus.tokenized_documents[doc_id]
.raw_start (Optional[int]) – The starting character-level index within the raw string document at
corpus.documents[doc_id]
.raw_end (Optional[int]) – The index after the ending character-level index within the raw string document at
corpus.documents[doc_id]
.
-
doc_id
¶ Alias for field number 0
-
first_idx
¶ Alias for field number 1
-
last_idx
¶ Alias for field number 2
-
raw_end
¶ Alias for field number 4
-
raw_start
¶ Alias for field number 3
-
class
text_data.index.
WordIndex
(tokenized_documents, indexed_locations=None)[source]¶ Bases:
object
An inverted, positional index containing the words in a corpus.
This is designed to allow people to be able to quickly compute statistics about the language used across a corpus. The class offers a couple of broad strategies for understanding the ways in which words are used across documents.
Manipulating Indexes These functions are designed to allow you to create new indexes based on ones you already have. They operate kind of like slices and filter functions in
pandas
, where your goal is to be able to create new data structures that you can analyze independently from ones you’ve already created. Most of them can also be used with method chaining. However, some of these functions remove positional information from the index, so be careful.copy()
creates an identical copy of aWordIndex
object.slice()
,slice_many()
, andsplit_off()
all take sets of document indexes and create new indexes with only those documents.add_documents()
allows you to add new documents into an existingWordIndex
object.concatenate()
similarly combinesWordIndex
objects into a singleWordIndex
.flatten()
takes aWordIndex
and returns an identical index that only has one document.skip_words()
takes a set of words and returns aWordIndex
that does not have those words.reset_index()
changes the index of words.
Corpus Information
A number of functions are designed to allow you to look up information about the corpus. For instance, you can collect a sorted list or a set of all the unique words in the corpus. Or you can get a list of the most commonly appearing elements:
vocab
andvocab_list
both return the unique words or phrases appearing in the index.vocab_size
gets the number of unique words in the index.num_words
gets the total number of words in the index.doc_lengths
gets a dictionary mapping documents to the number of tokens, or words, they contain.
Word Statistics
These allow you to gather statistics about single words or about word, document pairs. For instance, you can see how many words there are in the corpus, how many unique words there are, or how often a particular word appears in a document.
The statistics generally fit into four categories. The first category computes statistics about how often a specific word appears in the corpus as a whole. The second category computes statistics about how often a specific word appears in a specific document. The third and fourth categories echo those first two categories but perform the statistics efficiently across the corpus as a whole, creating 1-dimensional numpy arrays in the case of the word-corpus statistics and 2-dimensional numpy arrays in the case of the word-document statistics. Functions in these latter two categories all end in
_vector
and_matrix
respectively.Here’s how those statistics map to one another:
¶ Word-Corpus
Word-Document
Vector
Matrix
__contains__
In the case of the vector and matrix calculations, the arrays represent the unique words of the vocabulary, presented in sorted order. As a result, you can safely run element-wise calculations over the matrices.
In addition to the term vector and term-document matrix functions, there is
get_top_words()
, which is designed to allow you to find the highest or lowest scores and their associated words along any term vector or term-document matrix you please.Note
For the most part, you will not want to instantiate
WordIndex
directly. Instead, you will likely useCorpus
, which subclassesWordIndex
.That’s because
Corpus
offers utilities for searching through documents. In addition, with the help of tools fromtext_data.tokenize
, instantiatingCorpus
objects is a bit simpler than instantiatingWordIndex
objects directly.I particularly recommend that you do not instantiate the
indexed_locations
directly (i.e. outside ofCorpus
). The only way you can do anything withindexed_locations
from outside ofCorpus
is by using an internal attribute and hacking through poorly documented Rust code.- Parameters
tokenized_documents (
List
[List
[str
]]) – A list of documents where each document is a list of words.indexed_locations (
Optional
[List
[Tuple
[int
,int
]]]) – A list of documents where each documents contains a list of the start end positions of the words intokenized_documents
.
-
add_documents
(tokenized_documents, indexed_locations=None)[source]¶ This function updates the index with new documents.
It operates similarly to
text_data.index.Corpus.update()
, taking new documents and mutating the existing one.Example
>>> tokenized_words = ["im just a simple document".split()] >>> index = WordIndex(tokenized_words) >>> len(index) 1 >>> index.num_words 5 >>> index.add_documents(["now im an entire corpus".split()]) >>> len(index) 2 >>> index.num_words 10
-
concatenate
(other, ignore_index=True)[source]¶ Creates a
WordIndex
object with the documents of both this object and the other.See
text_data.multi_corpus.concatenate()
for more details.- Parameters
ignore_index (
bool
) – If set toTrue
, which is the default, the document indexes will be re-indexed starting from 0.- Raises
ValueError – If
ignore_index
is set toFalse
and some of the indexes overlap.- Return type
-
count_matrix
()[source]¶ Returns a matrix showing the number of times each word appeared in each document.
Example
>>> corpus = Corpus(["example", "this example is another example"]) >>> corpus.count_matrix().tolist() == [[0., 1.], [1., 2.], [0., 1.], [0., 1.]] True
- Return type
array
-
doc_contains
(word, document)[source]¶ States whether the given document contains the word.
Example
>>> corpus = Corpus(["words", "more words"]) >>> corpus.doc_contains("more", 0) False >>> corpus.doc_contains("more", 1) True
- Parameters
word (
str
) – The word you’re looking up.document (
int
) – The index of the document.
- Raises
ValueError – If the document you’re looking up doesn’t exist.
- Return type
bool
-
doc_count_vector
()[source]¶ Returns the total number of documents each word appears in.
Example
>>> corpus = Corpus(["example", "another example"]) >>> corpus.doc_count_vector() array([1., 2.])
- Return type
array
-
doc_freq_vector
()[source]¶ Returns the proportion of documents each word appears in.
Example
>>> corpus = Corpus(["example", "another example"]) >>> corpus.doc_freq_vector() array([0.5, 1. ])
- Return type
array
-
property
doc_lengths
¶ Returns a dictionary mapping the document indices to their lengths.
Example
>>> corpus = Corpus(["a cat and a dog", "a cat", ""]) >>> assert corpus.doc_lengths == {0: 5, 1: 2, 2: 0}
- Return type
Dict
[int
,int
]
-
docs_with_word
(word)[source]¶ Returns a list of all the documents containing a word.
Example
>>> corpus = Corpus(["example document", "another document"]) >>> assert corpus.docs_with_word("document") == {0, 1} >>> assert corpus.docs_with_word("another") == {1}
- Parameters
word (
str
) – The word you’re looking up.- Return type
Set
[int
]
-
document_count
(word)[source]¶ Returns the total number of documents a word appears in.
Example
>>> corpus = Corpus(["example document", "another example"]) >>> corpus.document_count("example") 2 >>> corpus.document_count("another") 1
- Parameters
word (
str
) – The word you’re looking up.- Return type
int
-
document_frequency
(word)[source]¶ Returns the percentage of documents that contain a word.
Example
>>> corpus = Corpus(["example document", "another example"]) >>> corpus.document_frequency("example") 1.0 >>> corpus.document_frequency("another") 0.5
- Parameters
word (
str
) – The word you’re looking up.- Return type
float
-
flatten
()[source]¶ Flattens a multi-document index into a single-document corpus.
This creates a new
WordIndex
object stripped of any positional information that has a single document in it. However, the list of words and their indexes remain.Example
>>> corpus = Corpus(["i am a document", "so am i"]) >>> len(corpus) 2 >>> flattened = corpus.flatten() >>> len(flattened) 1 >>> assert corpus.most_common() == flattened.most_common()
- Return type
-
frequency_matrix
()[source]¶ Returns a matrix showing the frequency of each word appearing in each document.
Example
>>> corpus = Corpus(["example", "this example is another example"]) >>> corpus.frequency_matrix().tolist() == [[0.0, 0.2], [1.0, 0.4], [0.0, 0.2], [0.0, 0.2]] True
- Return type
array
-
get_top_words
(term_matrix, top_n=None, reverse=True)[source]¶ Get the top values along a term matrix.
Given a matrix where each row represents a word in your vocabulary, this returns a numpy matrix of those top values, along with an array of their respective words.
You can choose the number of results you want to get by setting
top_n
to some positive value, or you can leave it be and return all of the results in sorted order. Additionally, by settingreverse
to False (instead of its default ofTrue
), you can return the scores from smallest to largest.- Parameters
term_matrix (
array
) – a matrix of floats where each row represents a wordtop_n (
Optional
[int
]) – The number of values you want to return. If None, returns all values.reverse (
bool
) – If true (the default), returns the N values with the highest scores. If false, returns the N values with the lowest scores.
- Return type
Tuple
[array
,array
]- Returns
A tuple of 2-dimensional numpy arrays, where the first item is an array of the top-scoring words and the second item is an array of the top scores themselves. Both arrays are of the same size, that is
min(self.vocab_size, top_n)
by the number of columns in the term matrix.- Raises
ValueError – If
top_n
is less than 1, if there are not the same number of rows in the matrix as there are unique words in the index, or if the numpy array doesn’t have 1 or 2 dimensions.
Example
The first thing you need to do in order to use this function is create a 1- or 2-dimensional term matrix, where the number of rows corresponds to the number of unique words in the corpus. Any of the functions within
WordIndex
that ends in_matrix(**kwargs)
(for 2-dimensional arrays) or_vector(**kwargs)
(for 1-dimensional arrays) will do the trick here. I’ll show an example with both a word count vector and a word count matrix:>>> corpus = Corpus(["The cat is near the birds", "The birds are distressed"]) >>> corpus.get_top_words(corpus.word_count_vector(), top_n=2) (array(['the', 'birds'], dtype='<U10'), array([3., 2.])) >>> corpus.get_top_words(corpus.count_matrix(), top_n=1) (array([['the', 'the']], dtype='<U10'), array([[2., 1.]]))
Similarly, you can return the scores from lowest to highest by setting
reverse=False
. (This is not the default.):>>> corpus.get_top_words(-1. * corpus.word_count_vector(), top_n=2, reverse=False) (array(['the', 'birds'], dtype='<U10'), array([-3., -2.]))
-
idf
(word)[source]¶ Returns the inverse document frequency.
If the number of documents in your
WordIndex
index
is \(N\) and the document frequency fromdocument_frequency()
is \(df\), the inverse document frequency is \(\frac{N}{df}\).Example
>>> corpus = Corpus(["example document", "another example"]) >>> corpus.idf("example") 1.0 >>> corpus.idf("another") 2.0
- Parameters
word (
str
) – The word you’re looking for.- Return type
float
-
idf_vector
()[source]¶ Returns the inverse document frequency vector.
Example
>>> corpus = Corpus(["example", "another example"]) >>> corpus.idf_vector() array([2., 1.])
- Return type
array
-
max_word_count
()[source]¶ Returns the most common word and the number of times it appeared in the corpus.
Returns
None
if there are no words in the corpus.Example
>>> corpus = Corpus([]) >>> corpus.max_word_count() is None True >>> corpus.update(["a bird a plane superman"]) >>> corpus.max_word_count() ('a', 2)
- Return type
Optional
[Tuple
[str
,int
]]
-
most_common
(num_words=None)[source]¶ Returns the most common items.
This is nearly identical to
collections.Counter.most_common
. However, unlike collections.Counter.most_common, the values that are returned appear in alphabetical order.Example
>>> corpus = Corpus(["i walked to the zoo", "i bought a zoo"]) >>> corpus.most_common() [('i', 2), ('zoo', 2), ('a', 1), ('bought', 1), ('the', 1), ('to', 1), ('walked', 1)] >>> corpus.most_common(2) [('i', 2), ('zoo', 2)]
- Parameters
num_words (
Optional
[int
]) – The number of words you return. If you enter None or you enter a number larger than the total number of words, it returns all of the words, in sorted order from most common to least common.- Return type
List
[Tuple
[str
,int
]]
-
property
num_words
¶ Returns the total number of words in the corpus (not just unique).
Example
>>> corpus = Corpus(["a cat and a dog"]) >>> corpus.num_words 5
- Return type
int
-
odds_document
(word, document, sublinear=False)[source]¶ Returns the odds of finding a word in a document.
This is the equivalent of
odds_word()
. But insteasd of calculating items at the word-corpus level, the calculations are performed at the word-document level.Example
>>> corpus = Corpus(["this is a document", "document two"]) >>> corpus.odds_document("document", 1) 1.0 >>> corpus.odds_document("document", 1, sublinear=True) 0.0
- Parameters
word (
str
) – The word you’re looking updocument (
int
) – The index of the documentsublinear (
bool
) – IfTrue
, returns the log-odds of finding the word in the document.
- Raises
ValueError – If the document doesn’t exist.
- Return type
float
-
odds_matrix
(sublinear=False, add_k=None)[source]¶ Returns the odds of finding a word in a document for every possible word-document pair.
Because not all words are likely to appear in all of the documents, this implementation adds
1
to all of the numerators before taking the frequencies. So\(O(w) = \frac{c_{i} + 1}{N + \vert V \vert}\)
where \(\vert V \vert\) is the total number of unique words in each document, \(N\) is the total number of total words in each document, and \(c_i\) is the count of a word in a document.
Example
>>> corpus = Corpus(["example document", "another example"]) >>> corpus.odds_matrix() array([[0.33333333, 1. ], [1. , 0.33333333], [1. , 1. ]]) >>> corpus.odds_matrix(sublinear=True) array([[-1.5849625, 0. ], [ 0. , -1.5849625], [ 0. , 0. ]])
- Parameters
sublinear (
bool
) – IfTrue
, computes the log-odds.add_k (
Optional
[float
]) – This addsk
to each of the non-zero elements in the matrix. Since \(\log{1} = 0\), this prevents 50 percent probabilities from appearing to be the same as elements that don’t exist.
- Return type
array
-
odds_vector
(sublinear=False)[source]¶ Returns a vector of the odds of each word appearing at random.
Example
>>> corpus = Corpus(["example", "this example is another example"]) >>> corpus.odds_vector() array([0.2, 1. , 0.2, 0.2]) >>> corpus.odds_vector(sublinear=True) array([-2.32192809, 0. , -2.32192809, -2.32192809])
- Parameters
sublinear (
bool
) – If true, returns the log odds.- Return type
array
-
odds_word
(word, sublinear=False)[source]¶ Returns the odds of seeing a word at random.
In statistics, the odds of something happening are the probability of it happening, versus the probability of it not happening, that is \(\frac{p}{1 - p}\). The “log odds” of something happening — the result of using
self.log_odds_word
— is similarly equivalent to \(log_{2}{\frac{p}{1 - p}}\).(The probability in this case is simply the word frequency.)
Example
>>> corpus = Corpus(["i like odds ratios"]) >>> np.isclose(corpus.odds_word("odds"), 1. / 3.) True >>> np.isclose(corpus.odds_word("odds", sublinear=True), np.log2(1./3.)) True
- Parameters
word (
str
) – The word you’re looking up.sublinear (
bool
) – If true, returns the
- Return type
float
-
one_hot_matrix
()[source]¶ Returns a matrix showing whether each given word appeared in each document.
For these matrices, all cells contain a floating point value of either a 1., if the word is in that document, or a 0. if the word is not in the document.
These are sometimes referred to as ‘one-hot encoding matrices’ in machine learning.
Example
>>> corpus = Corpus(["example", "this example is another example"]) >>> np.array_equal( ... corpus.one_hot_matrix(), ... np.array([[0., 1.], [1., 1.], [0., 1.], [0., 1.]]) ... ) True
- Return type
array
-
reset_index
(start_idx=None)[source]¶ An in-place operation that resets the document indexes for this corpus.
When you reset the index, all of the documents change their values, starting at
start_idx
(and incrementing from there). For the most part, you will not need to do this, since most of the library does not give you the option to change the document indexes. However, it may be useful when you’re usingslice()
orsplit_off()
.- Parameters
start_idx (
Optional
[int
]) – The first (lowest) document index you want to set. Values must be positive. Defaults to 0.
-
skip_words
(words)[source]¶ Creates a
WordIndex
without any of the skipped words.This enables you to create an index that does not contain rare words, for example. The index will not have any positions associated with them, so be careful when implementing it on a
text_data.index.Corpus
object.Example
>>> skip_words = {"document"} >>> corpus = Corpus(["example document", "document"]) >>> "document" in corpus True >>> without_document = corpus.skip_words(skip_words) >>> "document" in without_document False
- Return type
-
slice
(indexes)[source]¶ Returns an index that just contains documents from the set of words.
- Parameters
indexes (
Set
[int
]) – A set of index values for the documents.
Example
>>> index = WordIndex([["example"], ["document"], ["another"], ["example"]]) >>> sliced_idx = index.slice({0, 2}) >>> len(sliced_idx) 2 >>> sliced_idx.most_common() [('another', 1), ('example', 1)]
- Return type
-
slice_many
(indexes_list)[source]¶ This operates like
slice()
but creates multipleWordIndex
objects.Example
>>> corpus = Corpus(["example document", "another example", "yet another"]) >>> first, second, third = corpus.slice_many([{0}, {1}, {2}]) >>> first.documents ['example document'] >>> second.documents ['another example'] >>> third.documents ['yet another']
- Parameters
indexes_list (
List
[Set
[int
]]) – A list of sets of indexes. Seetext_data.index.WordIndex.slice()
for details.- Return type
List
[WordIndex
]
-
split_off
(indexes)[source]¶ Returns an index with just a set of documents, while removing them from the index.
- Parameters
indexes (
Set
[int
]) – A set of index values for the documents.
Note
This removes words from the index inplace. So be make sure you want to do that before using this function.
Example
>>> index = WordIndex([["example"], ["document"], ["another"], ["example"]]) >>> split_idx = index.split_off({0, 2}) >>> len(split_idx) 2 >>> len(index) 2 >>> split_idx.most_common() [('another', 1), ('example', 1)] >>> index.most_common() [('document', 1), ('example', 1)]
- Return type
-
term_count
(word, document)[source]¶ Returns the total number of times a word appeared in a document.
Assuming the document exists, returns 0 if the word does not appear in the document.
Example
>>> corpus = Corpus(["i am just thinking random thoughts", "am i"]) >>> corpus.term_count("random", 0) 1 >>> corpus.term_count("random", 1) 0
- Parameters
word (
str
) – The word you’re looking up.document (
int
) – The index of the document.
- Raises
ValueError – If you selected a document
- Return type
int
-
term_frequency
(word, document)[source]¶ Returns the proportion of words in document
document
that areword
.Example
>>> corpus = Corpus(["just coming up with words", "more words"]) >>> np.isclose(corpus.term_frequency("words", 1), 0.5) True >>> np.isclose(corpus.term_frequency("words", 0), 0.2) True
- Parameters
word (
str
) – The word you’re looking updocument (
int
) – The index of the document
- Raises
ValueError – If the document you’re looking up doesn’t exist
- Return type
float
-
tfidf_matrix
(norm='l2', use_idf=True, smooth_idf=False, sublinear_tf=True, add_k=1)[source]¶ This creates a term-document TF-IDF matrix from the index.
In natural language processing, TF-IDF is a mechanism for finding out which words are distinct across documents. It’s used particularly widely in information retrieval, where your goal is to rank documents that you know match a query by how relevant you think they’ll be.
The basic intuition goes like this: If a word appears particularly frequently in a document, it’s probably more relevant to that document than if the word occurred more rarely. But, some words are simply common: If document X uses the word ‘the’ more often than the word ‘idiomatic,’ that really tells you more about the words ‘the’ and ‘idiomatic’ than it does about the document.
TF-IDF tries to balance these two competing interests by taking the ‘term frequency,’ or how often a word appears in the document, and normalizing it by the ‘document frequency,’ or the proportion of documents that contain the word. This has the effect of reducing the weights of common words (and even setting the weights of some very common words to 0 in some implementations).
It should be noted that there are a number of different implementations of TF-IDF. Within information retrieval, TF-IDF is part of the ‘SMART Information Retrieval System’. Although the exact equations can vary considerably, they typically follow the same approach: First, they find some value to represent the frequency of each word in the document. Often (but not always), this is just the raw number of times in which a word appeared in the document. Then, they normalize that based on the document frequency. And finally, they normalize those values based on the length of the document, so that long documents are not weighted more favorably (or less favorably) than shorter documents.
The approach that I have taken to this is shamelessly cribbed from scikit’s TfidfTransformer. Specifically, I’ve allowed for some customization of the specific formula for TF-IDF while not including methods that require access to the raw documents, which would be computationally expensive to perform. This allows for the following options:
You can set the term frequency to either take the raw count of the word in the document (\(c_{t,d}\)) or by using
sublinear_tf=True
and taking \(1 + \log_{2}{c_{t,d}}\)You can skip taking the inverse document frequency \(df^{-1}\) altogether by setting
use_idf=False
or you can smooth the inverse document frequency by settingsmooth_idf=True
. This adds one to the numerator and the denominator. (Note: Because this method is only run on a vocabulary of words that are in the corpus, there can’t be any divide by zero errors, but this allows you to replicate scikit’sTfidfTransformer
.)You can add some number to the logged inverse document frequency by setting
add_k
to something other than 1. This is the only difference between this implementation and scikits, as scikit automatically settsk
at 1.Finally, you can choose how to normalize the document lengths. By default, this takes the L-2 norm, or \(\sqrt{\sum{w_{i,k}^{2}}}\), where \(w_{i,k}\) is the weight you get from multiplying the term frequency by the inverse document frequency. But you can also set the norm to
'l1'
to get the L1-norm, or \(\sum{\vert w_{i,k} \vert}\). Or you can set it toNone
to avoid doing any document-length normalization at all.
Examples
To get a sense of the different options, let’s start by creating a pure count matrix with this method. To do that, we’ll set
norm=None
so we’re not normalizing by the length of the document,use_idf=False
so we’re not doing anything with the document frequency, andsublinear_tf=False
so we’re not taking the logged counts:>>> corpus = Corpus(["a cat", "a"]) >>> tfidf_count_matrix = corpus.tfidf_matrix(norm=None, use_idf=False, sublinear_tf=False) >>> assert np.array_equal(tfidf_count_matrix, corpus.count_matrix())
In this particular case, setting
sublinear_tf
toTrue
will produce the same result since all of the counts are 1 or 0 and \(\log{1} + 1 = 1\):>>> assert np.array_equal(corpus.tfidf_matrix(norm=None, use_idf=False), tfidf_count_matrix)
Now, we can incorporate the inverse document frequency. Because the word ‘a’ appears in both documents, its inverse document frequency in is 1; the inverse document frequency of ‘cat’ is 2, since ‘cat’ appears in half of the documents. We’re additionally taking the base-2 log of the inverse document frequency and adding 1 to the final result. So we get:
>>> idf_add_1 = corpus.tfidf_matrix(norm=None, sublinear_tf=False, smooth_idf=False) >>> assert idf_add_1.tolist() == [[1., 1.], [2.,0.]]
Or we can add nothing to the logged values:
>>> idf = corpus.tfidf_matrix(norm=None, sublinear_tf=False, smooth_idf=False, add_k=0) >>> assert idf.tolist() == [[0.0, 0.0], [1.0, 0.0]]
The L-1 norm normalizes the results by the sum of the absolute values of their weights. In the case of the count matrix, this is equivalent to creating the frequency matrix:
>>> tfidf_freq_mat = corpus.tfidf_matrix(norm="l1", use_idf=False, sublinear_tf=False) >>> assert np.array_equal(tfidf_freq_mat, corpus.frequency_matrix())
- Parameters
norm (
Optional
[str
]) – Set to ‘l2’ for the L2 norm (square root of the sums of the square weights), ‘l1’ for the l1 norm (the summed absolute value, or None for no normalization).use_idf (
bool
) – If you set this to False, the weights will only include the term frequency (adjusted however you like)smooth_idf (
bool
) – Adds a constant to the numerator and the denominator.sublinear_tf (
bool
) – Computes the term frequency in log space.add_k (
int
) – This adds k to every value in the IDF. scikit adds 1 to all documents, but this allows for more variable computing (e.g. adding 0 if you want to remove words appearing in every document)
- Return type
array
-
property
vocab
¶ Returns all of the unique words in the index.
Example
>>> corpus = Corpus(["a cat and a dog"]) >>> corpus.vocab == {"a", "cat", "and", "dog"} True
- Return type
Set
[str
]
-
property
vocab_list
¶ Returns a sorted list of the words appearing in the index.
This is primarily intended for use in matrix or vector functions, where the order of the words matters.
Example
>>> corpus = Corpus(["a cat and a dog"]) >>> corpus.vocab_list ['a', 'and', 'cat', 'dog']
- Return type
List
[str
]
-
property
vocab_size
¶ Returns the total number of unique words in the corpus.
Example
>>> corpus = Corpus(["a cat and a dog"]) >>> corpus.vocab_size 4
- Return type
int
-
word_count
(word)[source]¶ Returns the total number of times the word appeared.
Defaults to 0 if the word never appeared.
Example
>>> corpus = Corpus(["this is a document", "a bird and a plane"]) >>> corpus.word_count("document") 1 >>> corpus.word_count("a") 3 >>> corpus.word_count("malarkey") 0
- Parameters
word (
str
) – The string word (or phrase).- Return type
int
-
word_count_vector
()[source]¶ Returns the total number of times each word appeared in the corpus.
Example
>>> corpus = Corpus(["example", "this example is another example"]) >>> corpus.word_count_vector() array([1., 3., 1., 1.])
- Return type
array
-
word_counter
(word)[source]¶ Maps the documents containing a word to the number of times the word appeared.
Examples
>>> corpus = Corpus(["a bird", "a bird and a plane", "two birds"]) >>> corpus.word_counter("a") == {0: 1, 1: 2} True
- Parameters
word (
str
) – The word you’re looking up- Return type
Dict
[int
,int
]- Returns
- A dictionary mapping the document index of the word to the number of times
it appeared in that document.
-
word_freq_vector
()[source]¶ Returns the frequency in which each word appears over the corpus.
Example
>>> corpus = Corpus(["example", "this example is another example"]) >>> corpus.word_freq_vector() array([0.16666667, 0.5 , 0.16666667, 0.16666667])
- Return type
array
-
word_frequency
(word)[source]¶ Returns the frequency in which the word appeared in the corpus.
Example
>>> corpus = Corpus(["this is fun", "or is it"]) >>> np.isclose(corpus.word_frequency("fun"), 1. / 6.) True >>> np.isclose(corpus.word_frequency("is"), 2. / 6.) True
- Parameters
word (
str
) – The string word or phrase.- Return type
float
text_data.multi_corpus module¶
Tools and displays for handling multiple document sets.
These are primarily designed to provide features for merging sets of documents so you can easy compute statistics on them.
-
text_data.multi_corpus.
concatenate
(*indexes, ignore_index=True)[source]¶ Concatenates an arbitrary number of
text_data.index.WordIndex
objects.- Parameters
ignore_index (
bool
) – If set toTrue
, which is the default, the resulting index has a reset index beginning at 0.- Raises
ValueError – If
ignore_index
is set toFalse
and there are overlapping document indexes.
Example
>>> corpus_1 = WordIndex([["example"], ["document"]]) >>> corpus_2 = WordIndex([["second"], ["document"]]) >>> corpus_3 = WordIndex([["third"], ["document"]]) >>> concatenate().most_common() [] >>> concatenate(corpus_1).most_common() [('document', 1), ('example', 1)] >>> concatenate(corpus_1, corpus_2).most_common() [('document', 2), ('example', 1), ('second', 1)] >>> concatenate(corpus_1, corpus_2, corpus_3).most_common() [('document', 3), ('example', 1), ('second', 1), ('third', 1)]
- Return type
-
text_data.multi_corpus.
flat_concat
(*indexes)[source]¶ This flattens a sequence of
text_data.index.WordIndex
objects and concatenates them.This does not preserve any information about
text_data.index.Corpus
objects.Example
>>> corpus_1 = WordIndex([["example"], ["document"]]) >>> corpus_2 = WordIndex([["another"], ["set"], ["of"], ["documents"]]) >>> len(corpus_1) 2 >>> len(corpus_2) 4 >>> len(concatenate(corpus_1, corpus_2)) 6 >>> len(flat_concat(corpus_1, corpus_2)) 2
- Parameters
indexes (
WordIndex
) – A sequence oftext_data.index.Corpus
ortext_data.index.WordIndex
objects.- Return type
text_data.query module¶
This builds and runs search queries for text_data.index.Corpus
.
For the most part, you won’t be using this directly. Instead, you’ll likely
be using text_data.index.Corpus
. However, viewing the __repr__
for the query you’re running can be helpful for debugging or validating
queries.
-
class
text_data.query.
Query
(query_string, query_tokenizer=<method 'split' of 'str' objects>)[source]¶ Bases:
object
Represents a query. This is used internaly by
text_data.index.Corpus
to handle searching.The basic formula for writing queries should be familiar; all of the queries are simple boolean phrases. But here are more complete specifications:
In order to search for places where two words appeared, you simply need to type the two words:
Query("i am")
Searches using this query will look for documents where the words “i” and “am” both appeared. To have them look for places where either word appeared, use an “OR” query:
Query("i OR am")
Alternatively, you can look for documents where one word occurred but the other didn’t using a NOT query:
Query("i NOT am")
To search for places where the phrase “i am” appeared, use quotes:
Query("'i am'")
You can use AND queries to limit the results of previous sets of queries. For instance:
Query("i OR am AND you")
will find places where “you” and either “I” or “am” appeared.
In order to search for the literal words ‘AND’, ‘OR’, or ‘NOT’, you must encapsulate them in quotes:
Query("'AND'")
Finally, you may customize the way your queries are parsed by passing a tokenizer. By default,
Query
identifies strings of text that it needs to split and usesstr.split
to split the strings. But you can change how to split the text, which can be helpful/necessary if the words you’re searching for have spaces in them. For instance, this will split the words you’re querying by spaces, unless the words are ‘united states’:>>> import re >>> us_phrase = re.compile(r"(united states|\S+)") >>> Query("he is from the united states", query_tokenizer=us_phrase.findall) <Query ([[QueryItem(words=['he', 'is', 'from', 'the', 'united states'], exact=False, modifier='OR')]])>
- Parameters
query_string (
str
) – The human-readable queryquery_tokenizer (
Callable
[[str
],List
[str
]]) – A function to tokenize phrases in the query (Defaults to string.split). Note: This specifically tokenizes individual phrases in the query. As a result, the function does not need to handle quotations.
-
class
text_data.query.
QueryItem
(words, exact, modifier)¶ Bases:
tuple
This represents an set of words you want to search for.
Each query item has attached to it a set of words, an identifier stating whether the query terms are part of an exact phrase (i.e. whether the order matters) and what kind of query (a boolean AND query, a boolean OR query, or a boolean NOT query), is being performed on the query.
- Parameters
words (List[str]) – A list of words representing all of the words that will be searched for.
exact (bool) – Whether the search terms are part of an exact phrase match
modifier (str) – The boolean query (AND, OR, or NOT)
-
exact
¶ Alias for field number 1
-
modifier
¶ Alias for field number 2
-
words
¶ Alias for field number 0
text_data.tokenize module¶
This is a module for tokenizing data.
The primary motivation behind this module is that effectively
presenting search results revolves around knowing the positions
of the words prior to tokenization. In order to handle these raw
positions, the index text_data.index.Corpus
uses stores the
original character-level positions of words.
This module offers a default tokenizer that you can use
for text_data.index.Corpus
. However, you’ll likely need to customize
them for most applications. That said, doing so should not be difficult.
One of the functions in this module, corpus_tokenizer()
,
is designed specifically to create tokenizers that can be used
directly by text_data.index.Corpus
. All you have to do
is create a regular expression that splits words from nonwords
and then create a series of postprocessing functions to clean the
text (including, optionally, removing tokens). If possible,
I would recommend taking this approach, since it allows you
to mostly ignore the picky preferences of the underlying API.
-
text_data.tokenize.
corpus_tokenizer
(regex_patten, postprocess_funcs, inverse_match=False)[source]¶ This is designed to make it easy to build a custom tokenizer for
text_data.index.Corpus
.It acts as a combination of
tokenize_regex_positions()
andpostprocess_positions()
, making it simple to create tokenizers fortext_data.index.Corpus
.In other words, if you pass the tokenizer a regular expression pattern, set
inverse_match
as you would fortokenize_regex_positions()
, and add a list of postprocessing functions as you would forpostprocess_positions()
, this tokenizer will return a function that you can use directly as an argument intext_data.index.Corpus
.Examples
Let’s say that we want to build a tokenizing function that splits on vowels or whitespace. We also want to lowercase all of the remaining words:
>>> split_vowels = corpus_tokenizer(r"[aeiou\s]+", [str.lower], inverse_match=True) >>> split_vowels("Them and you") (['th', 'm', 'nd', 'y'], [(0, 2), (3, 4), (6, 8), (9, 10)])
You can additionally use this function to remove stopwords, although I generally would recommend against it. The postprocessing functions optionally return a string or a
NoneType
, andNone
values simply don’t get tokenized:>>> skip_stopwords = corpus_tokenizer(r"\w+", [lambda x: x if x != "the" else None]) >>> skip_stopwords("I ran to the store") (['I', 'ran', 'to', 'store'], [(0, 1), (2, 5), (6, 8), (13, 18)])
- Return type
Callable
[[str
],Tuple
[List
[str
],List
[Tuple
[int
,int
]]]]
-
text_data.tokenize.
default_tokenizer
(document: str) → Tuple[List[str], List[Tuple[int, int]]]¶ This is the default tokenizer for
text_data.index.Corpus
.It simply splits on words (
"\w+"
) and lowercases words.
-
text_data.tokenize.
postprocess_positions
(postprocess_funcs, tokenize_func, document)[source]¶ Runs postprocessing functions to produce final tokenized documents.
This function allows you to take
tokenize_regex_positions()
(or something that has a similar function signature) and run postprocessing on it. It requires that you also give it a document, which it will tokenize using the tokenizing function you give it.These postprocessing functions should take a string (i.e. one of the individual tokens), but they can return either a string or None. If they return None, the token will not appear in the final tokenized result.
- Parameters
postprocess_funcs (
List
[Callable
[[str
],Optional
[str
]]]) – A list of postprocessing functions (e.g.str.lower
)tokenize_func (
Callable
[[str
],Tuple
[List
[str
],List
[Tuple
[int
,int
]]]]) – A function that takes raw text and converts it into a list of strings and a list of character-level positions (e.g. the output oftext_data.tokenize.tokenize_regex_positions()
)document (
str
) – The (single) text you want to tokenize.tokenized_docs – The tokenized results (e.g. the output of
text_data.tokenize.tokenize_regex_positions()
)
- Return type
Tuple
[List
[str
],List
[Tuple
[int
,int
]]]
-
text_data.tokenize.
tokenize_regex_positions
(pattern, document_text, inverse_match=False)[source]¶ Finds all of the tokens matching a regular expression.
Returns the positions of those tokens along with the tokens themselves.
- Parameters
pattern (
str
) – A raw regular expression stringdocument_text (
str
) – The raw document textinverse_match (
bool
) – If true, tokenizes the text between matches.
- Return type
Tuple
[List
[str
],List
[Tuple
[int
,int
]]]- Returns
A tuple consisting of the list of words and a list of tuples, where each tuple represents the start and end character positions of the phrase.
Module contents¶
Top-level package for Text Data.
-
class
text_data.
Corpus
(documents, tokenizer=functools.partial(<function postprocess_positions>, [<method 'lower' of 'str' objects>, ]functools.partial(<function tokenize_regex_positions>, '\\\\w+', inverse_match=False)), sep=None, prefix=None, suffix=None)[source]¶ Bases:
text_data.index.WordIndex
This is probably going to be your main entrypoint into
text_data
.The corpus holds the raw text, the index, and the tokenized text of whatever you’re trying to analyze. Its primary role is to extend the functionality of
WordIndex
to support searching. This means that you can use theCorpus
to search for arbitrarily long phrases using boolean search methods (AND, NOT, BUT).In addition, it allows you to add indexes so you can calculate statistics on phrases. By using
add_ngram_index()
, you can figure out the frequency or TF-IDF values of multi-word phrases while still being able to search through your normal index.Initializing Data
To instantiate the corpus, you need to include a list of documents where each document is a string of text and a tokenizer. There is a default tokenizer, which simply lowercases words and splits documents on
r"\w+"
. For most tasks, this will be insufficient. Buttext_data.tokenize
offers convenient ways that should make building the vast majority of tokenizers easy.The
Corpus
can be instantiated using__init__
or by usingchunks()
, which yields a generator, adding a mini-index. This allows you to technically perform calculations in-memory on larger databases.You can also initialize a
Corpus
object by using theslice()
,copy()
,split_off()
, orconcatenate()
methods. These methods work identically to their equivalent methods intext_data.index.WordIndex
while updating extra data that the corpus has, updating n-gram indexes, and automatically re-indexing the corpus.Updating Data
There are two methods for updating or adding data to the
Corpus
.update()
allows you to add new documents to the corpus.add_ngram_index()
allows you to add multi-word indexes.Searching
There are a few methods devoted to searching.
search_documents()
allows you to find all of the individual documents matching a query.search_occurrences()
shows all of the individual occurrences that matched your query.ranked_search()
finds all of the individual occurrences and sorts them according to a variant of their TF-IDF score.Statistics
Three methods allow you to get statistics about a search.
search_document_count()
allows you to find the total number of documents matching your query.search_document_freq()
shows the proportion of documents matching your query. Andsearch_occurrence_count()
finds the total number of matches you have for your query.Display
There are a number of functions designed to help you visually see the results of your query.
display_document()
anddisplay_documents()
render your documents in HTML.display_document_count()
,display_document_frequency()
, anddisplay_occurrence_count()
all render bar charts showing the number of query results you got. Anddisplay_search_results()
shows the result of your search.-
documents
¶ A list of all the raw, non-tokenized documents in the corpus.
-
tokenizer
¶ A function that converts a list of strings (one of the documents from documents into a list of words and a list of the character-level positions where the words are located in the raw text). See
text_data.tokenize
for details.
-
tokenized_documents
¶ A list of the tokenized documents (each a list of words)
-
ngram_indexes
¶ A list of
WordIndex
objects for multi-word (n-gram) indexes. Seeadd_ngram_index()
for details.
-
ngram_sep
¶ A separator in between words. See
add_ngram_index()
for details.
-
ngram_prefix
¶ A prefix to go before any n-gram phrases. See
add_ngram_index()
for details.
-
ngram_suffix
¶ A suffix to go after any n-gram phrases. See
add_ngram_index()
for details.
- Parameters
documents (
List
[str
]) – A list of the raw, un-tokenized texts.tokenizer (
Callable
[[str
],Tuple
[List
[str
],List
[Tuple
[int
,int
]]]]) – A function to tokenize the documents. Seetext_data.tokenize
for details.sep (
Optional
[str
]) – The separator you want to use for computing n-grams. Seeadd_ngram_index()
for details.prefix (
Optional
[str
]) – The prefix you want to use for n-grams. Seeadd_ngram_index()
for details.suffix (
Optional
[str
]) – The suffix you want to use for n-grams. Seeadd_ngram_index()
for details.
-
add_documents
(tokenized_documents, indexed_locations=None)[source]¶ This overrides the
add_documents()
method.Because
Corpus()
objects can have n-gram indices, simply runningadd_documents
would cause n-gram indices to go out of sync with the overall corpus. In order to prevent that, this function raises an error if you try to run it.- Raises
NotImplementedError – Warns you to use
~text_data.index.Corpus.update
instead. –
-
add_ngram_index
(n=1, default=True, sep=None, prefix=None, suffix=None)[source]¶ Adds an n-gram index to the corpus.
This creates a
WordIndex
object that you can access by typingself.ngram_indexes[n]
.There are times when you might want to compute TF-IDF scores, word frequency scores or similar scores over a multi-word index. For instance, you might want to know how frequently someone said ‘United States’ in a speech, without caring how often they used the word ‘united’ or ‘states’.
This function helps you do that. It automatically splits up your documents into an overlapping set of
n
-length phrases.Internally, this takes each of your tokenized documents, merges them into lists of
n
-length phrases, and joins each of those lists by a space. However, you can customize this behavior. If you setprefix
, each of the n-grams will be prefixed by that string; if you setsuffix
, each of the n-grams will end with that string. And if you setsep
, each of the words in the n-gram will be separated by the separator.Example
Say you have a simple four word corpus. If you use the default settings, here’s what your n-grams will look like:
>>> corpus = Corpus(["text data is fun"]) >>> corpus.add_ngram_index(n=2) >>> corpus.ngram_indexes[2].vocab_list ['data is', 'is fun', 'text data']
By altering
sep
,prefix
, orsuffix
, you can alter that behavior. But, be careful to setdefault
toFalse
if you want to change the behavior from something you set up in__init__
. If you don’t, this will use whatever settings you instantiated the class with.>>> corpus.add_ngram_index(n=2, sep="</w><w>", prefix="<w>", suffix="</w>", default=False) >>> corpus.ngram_indexes[2].vocab_list ['<w>data</w><w>is</w>', '<w>is</w><w>fun</w>', '<w>text</w><w>data</w>']
- Parameters
n (
int
) – The number of n-grams (defaults to unigrams)default (
bool
) – If true, will keep the values stored in init (including defaults)sep (
Optional
[str
]) – The separator in between words (if storing n-grams)prefix (
Optional
[str
]) – The prefix before the first word of each n-gramsuffix (
Optional
[str
]) – The suffix after the last word of each n-gram
-
classmethod
chunks
(documents, tokenizer=functools.partial(<function postprocess_positions>, [<method 'lower' of 'str' objects>, ]functools.partial(<function tokenize_regex_positions>, '\\\\w+', inverse_match=False)), sep=None, prefix=None, suffix=None, chunksize=1000000)[source]¶ Iterates through documents, yielding a
Corpus
withchunksize
documents.This is designed to allow you to technically use
Corpus
on large document sets. However, you should note that searching for documents will only work within the context of the current chunk.The same is true for any frequency metrics. As such, you should probably limit metrics to raw counts or aggregations you’ve derived from raw counts.
Example
>>> for docs in Corpus.chunks(["chunk one", "chunk two"], chunksize=1): ... print(len(docs)) 1 1
- Parameters
documents (
Iterator
[str
]) – A list of raw text items (not tokenized)tokenizer (
Callable
[[str
],Tuple
[List
[str
],List
[Tuple
[int
,int
]]]]) – A function to tokenize the documentssep (
Optional
[str
]) – The separator you want to use for computing n-grams.prefix (
Optional
[str
]) – The prefix for n-grams.suffix (
Optional
[str
]) – The suffix for n-grams.chunksize (
int
) – The number of documents in each chunk.
- Return type
Generator
[~CorpusClass,None
,None
]
-
concatenate
(other)[source]¶ This combines two
Corpus
objects into one, much liketext_data.index.WordIndex.concatenate()
.However, the new
Corpus
has data from this corpus, including n-gram data. Because of this, the twoCorpus
objects must have the same keys for their n-gram dictionaries.Example
>>> corpus_1 = Corpus(["i am an example"]) >>> corpus_2 = Corpus(["i am too"]) >>> corpus_1.add_ngram_index(n=2) >>> corpus_2.add_ngram_index(n=2) >>> combined_corpus = corpus_1.concatenate(corpus_2) >>> combined_corpus.most_common() [('am', 2), ('i', 2), ('an', 1), ('example', 1), ('too', 1)] >>> combined_corpus.ngram_indexes[2].most_common() [('i am', 2), ('am an', 1), ('am too', 1), ('an example', 1)]
-
copy
()[source]¶ This creates a shallow copy of a
Corpus
object.It extends the contents of
Corpus
to also store data about the objects themselves.- Return type
-
display_document
(doc_idx)[source]¶ Print an entire document, given its index.
- Parameters
doc_idx (
int
) – The index of the document- Return type
HTML
-
display_document_count
(queries, query_tokenizer=<method 'split' of 'str' objects>)[source]¶ Returns a bar chart (in altair) showing the queries with the largest number of documents.
Note
This method requires that you have
altair
installed. To install, typepip install text_data[display]
orpoetry add text_data -E display
.- Parameters
queries (
List
[str
]) – A list of queries (in the same form you use to search for things)query_tokenizer (
Callable
[[str
],List
[str
]]) – The tokenizer for the query
-
display_document_frequency
(queries, query_tokenizer=<method 'split' of 'str' objects>)[source]¶ Displays a bar chart showing the percentages of documents with a given query.
Note
This method requires that you have
altair
installed. To install, typepip install text_data[display]
orpoetry add text_data -E display
.- Parameters
queries (
List
[str
]) – A list of queriesquery_tokenizer (
Callable
[[str
],List
[str
]]) – A tokenizer for each query
-
display_documents
(documents)[source]¶ Display a number of documents, at the specified indexes.
- Parameters
documents (
List
[int
]) – A list of document indexes.- Return type
HTML
-
display_occurrence_count
(queries, query_tokenizer=<method 'split' of 'str' objects>)[source]¶ Display a bar chart showing the number of times a query matches.
Note
This method requires that you have
altair
installed. To install, typepip install text_data[display]
orpoetry add text_data -E display
.- Parameters
queries (
List
[str
]) – A list of queriesquery_tokenizer (
Callable
[[str
],List
[str
]]) – The tokenizer for the query
-
display_search_results
(search_query, query_tokenizer=<method 'split' of 'str' objects>, max_results=None, window_size=None)[source]¶ Shows the results of a ranked query.
This function runs a query and then renders the result in human-readable HTML. For each result, you will get a document ID and the count of the result.
In addition, all of the matching occurrences of phrases or words you searched for will be highlighted in bold. You can optionally decide how many results you want to return and how long you want each result to be (up to the length of the whole document).
- Parameters
search_query (
str
) – The query you’re searching forquery_tokenizer (
Callable
[[str
],List
[str
]]) – The tokenizer for the querymax_results (
Optional
[int
]) – The maximum number of results. If None, returns all results.window_size (
Optional
[int
]) – The number of characters you want to return around the matching phrase.None (If) –
the entire document. (returns) –
- Return type
HTML
-
ranked_search
(query_string, query_tokenizer=<method 'split' of 'str' objects>)[source]¶ This produces a list of search responses in ranked order.
More specifically, the documents are ranked in order of the sum of the TF-IDF scores for each word in the query (with the exception of words that are negated using a NOT operator).
To compute the TF-IDF scores, I simply have computed the dot products between the raw query counts and the TF-IDF scores of all the unique words in the query. This is roughly equivalent to the
ltn.lnn
normalization scheme described in Manning. (The catch is that I have normalized the term-frequencies in the document to the length of the document.)Each item in the resulting list is a list referring to a single item. The items inside each of those lists are of the same format you get from
search_occurrences()
. The first item in each list is either an item having the largest number of words in it or is the item that’s the nearest to another match within the document.- Parameters
query – Query string
query_tokenizer (
Callable
[[str
],List
[str
]]) – Function for tokenizing the results.
- Return type
List
[List
[PositionResult
]]- Returns
A list of tuples, each in the same format as
search_occurrences()
.
-
search_document_count
(query_string, query_tokenizer=<method 'split' of 'str' objects>)[source]¶ Finds the total number of documents matching a query.
By entering a search, you can get the total number of documents that match the query.
Example
>>> corpus = Corpus(["the cow was hungry", "the cow likes the grass"]) >>> corpus.search_document_count("cow") 2 >>> corpus.search_document_count("grass") 1 >>> corpus.search_document_count("the") 2
- Parameters
query_string (
str
) – The query you’re searching forquery_tokenizer (
Callable
[[str
],List
[str
]]) – The tokenizer for the query
- Return type
int
-
search_document_freq
(query_string, query_tokenizer=<method 'split' of 'str' objects>)[source]¶ Finds the percentage of documents that match a query.
Example
>>> corpus = Corpus(["the cow was hungry", "the cow likes the grass"]) >>> corpus.search_document_freq("cow") 1.0 >>> corpus.search_document_freq("grass") 0.5 >>> corpus.search_document_freq("the grass") 0.5 >>> corpus.search_document_freq("the OR nonsense") 1.0
- Parameters
query_string (
str
) – The query you’re searching forquery_tokenizer (
Callable
[[str
],List
[str
]]) – The tokenizer for the query
- Return type
float
-
search_documents
(query, query_tokenizer=<method 'split' of 'str' objects>)[source]¶ Search documents from a query.
In order to figure out the intracacies of writing queries, you should view
text_data.query.Query
. In general, standard boolean (AND, OR, NOT) searches work perfectly reasonably. You should generally not need to setquery_tokenizer
to anything other than the default (string split).This produces a set of unique documents, where each document is the index of the document. To view the documents by their ranked importance (ranked largely using TF-IDF), use
ranked_search()
.Example
>>> corpus = Corpus(["this is an example", "here is another"]) >>> assert corpus.search_documents("is") == {0, 1} >>> assert corpus.search_documents("example") == {0}
- Parameters
query (
str
) – A string boolean query (as defined intext_data.query.Query
)query_tokenizer (
Callable
[[str
],List
[str
]]) – A function to tokenize the words in your query. This allows you to optionally search for words in your index that include spaces (since it defaults to string.split).
- Return type
Set
[int
]
-
search_occurrence_count
(query_string, query_tokenizer=<method 'split' of 'str' objects>)[source]¶ Finds the total number of occurrences you have for the given query.
This just gets the number of items in
search_occurrences()
. As a result, searching for occurrences where two separate words occur will find the total number of places where either word occurs within the set of documents where both words appear.Example
>>> corpus = Corpus(["the cow was hungry", "the cow likes the grass"]) >>> corpus.search_occurrence_count("the") 3 >>> corpus.search_occurrence_count("the cow") 5 >>> corpus.search_occurrence_count("'the cow'") 2
- Parameters
query_string (
str
) – The query you’re searching forquery_tokenizer (
Callable
[[str
],List
[str
]]) – The tokenizer for the query
- Return type
int
-
search_occurrences
(query, query_tokenizer=<method 'split' of 'str' objects>)[source]¶ Search for matching positions within a search.
This allows you to figure out all of the occurrences matching your query. In addition, this is used internally to display search results.
Each matching position comes in the form of a tuple where the first field
doc_id
refers to the position of the document, the second fieldfirst_idx
refers to the starting index of the occurrence (among the tokenized documents),last_idx
refers to the last index of the occurrence,raw_start
refers to the starting index of the occurrence from within the raw, non-tokenized documents.raw_end
refers to the index after the last character of the matching result within the non-tokenized documents. There is not really a reason behind this decision.Example
>>> corpus = Corpus(["this is fun"]) >>> result = list(corpus.search_occurrences("'this is'"))[0] >>> result PositionResult(doc_id=0, first_idx=0, last_idx=1, raw_start=0, raw_end=7) >>> corpus.documents[result.doc_id][result.raw_start:result.raw_end] 'this is' >>> corpus.tokenized_documents[result.doc_id][result.first_idx:result.last_idx+1] ['this', 'is']
- Parameters
query (
str
) – The string query. Seetext_data.query.Query
for details.query_tokenizer (
Callable
[[str
],List
[str
]]) – The tokenizing function for the query. Seetext_data.query.Query
orsearch_documents()
for details.
- Return type
Set
[PositionResult
]
-
slice
(indexes)[source]¶ This creates a
Corpus
object only including the documents listed.This overrides the method in
text_data.index.WordIndex()
, which does the same thing (but without making changes to the underlying document set). This also creates slices of any of the n-gram indexes you have created.Note
This also changes the indexes for the new corpus so they all go from 0 to
len(indexes)
.- Parameters
indexes (
Set
[int
]) – A set of document indexes you want to have in the new index.
Example
>>> corpus = Corpus(["example document", "another example", "yet another"]) >>> corpus.add_ngram_index(n=2) >>> sliced_corpus = corpus.slice({1}) >>> len(sliced_corpus) 1 >>> sliced_corpus.most_common() [('another', 1), ('example', 1)] >>> sliced_corpus.ngram_indexes[2].most_common() [('another example', 1)]
- Return type
-
split_off
(indexes)[source]¶ This operates like
split_off()
.But it additionally maintains the state of the
Corpus
data, similar to howslice()
works.Example
>>> corpus = Corpus(["i am an example", "so am i"]) >>> sliced_data = corpus.split_off({0}) >>> corpus.documents ['so am i'] >>> sliced_data.documents ['i am an example'] >>> corpus.most_common() [('am', 1), ('i', 1), ('so', 1)]
- Return type
-
to_index
()[source]¶ Converts a
Corpus
object into aWordIndex
object.Corpus
objects are convenient because they allow you to search across documents, in addition to computing statistics about them. But sometimes, you don’t need that, and the added convenience comes with extra memory requirements.- Return type
-
-
class
text_data.
WordIndex
(tokenized_documents, indexed_locations=None)[source]¶ Bases:
object
An inverted, positional index containing the words in a corpus.
This is designed to allow people to be able to quickly compute statistics about the language used across a corpus. The class offers a couple of broad strategies for understanding the ways in which words are used across documents.
Manipulating Indexes These functions are designed to allow you to create new indexes based on ones you already have. They operate kind of like slices and filter functions in
pandas
, where your goal is to be able to create new data structures that you can analyze independently from ones you’ve already created. Most of them can also be used with method chaining. However, some of these functions remove positional information from the index, so be careful.copy()
creates an identical copy of aWordIndex
object.slice()
,slice_many()
, andsplit_off()
all take sets of document indexes and create new indexes with only those documents.add_documents()
allows you to add new documents into an existingWordIndex
object.concatenate()
similarly combinesWordIndex
objects into a singleWordIndex
.flatten()
takes aWordIndex
and returns an identical index that only has one document.skip_words()
takes a set of words and returns aWordIndex
that does not have those words.reset_index()
changes the index of words.
Corpus Information
A number of functions are designed to allow you to look up information about the corpus. For instance, you can collect a sorted list or a set of all the unique words in the corpus. Or you can get a list of the most commonly appearing elements:
vocab
andvocab_list
both return the unique words or phrases appearing in the index.vocab_size
gets the number of unique words in the index.num_words
gets the total number of words in the index.doc_lengths
gets a dictionary mapping documents to the number of tokens, or words, they contain.
Word Statistics
These allow you to gather statistics about single words or about word, document pairs. For instance, you can see how many words there are in the corpus, how many unique words there are, or how often a particular word appears in a document.
The statistics generally fit into four categories. The first category computes statistics about how often a specific word appears in the corpus as a whole. The second category computes statistics about how often a specific word appears in a specific document. The third and fourth categories echo those first two categories but perform the statistics efficiently across the corpus as a whole, creating 1-dimensional numpy arrays in the case of the word-corpus statistics and 2-dimensional numpy arrays in the case of the word-document statistics. Functions in these latter two categories all end in
_vector
and_matrix
respectively.Here’s how those statistics map to one another:
¶ Word-Corpus
Word-Document
Vector
Matrix
__contains__
In the case of the vector and matrix calculations, the arrays represent the unique words of the vocabulary, presented in sorted order. As a result, you can safely run element-wise calculations over the matrices.
In addition to the term vector and term-document matrix functions, there is
get_top_words()
, which is designed to allow you to find the highest or lowest scores and their associated words along any term vector or term-document matrix you please.Note
For the most part, you will not want to instantiate
WordIndex
directly. Instead, you will likely useCorpus
, which subclassesWordIndex
.That’s because
Corpus
offers utilities for searching through documents. In addition, with the help of tools fromtext_data.tokenize
, instantiatingCorpus
objects is a bit simpler than instantiatingWordIndex
objects directly.I particularly recommend that you do not instantiate the
indexed_locations
directly (i.e. outside ofCorpus
). The only way you can do anything withindexed_locations
from outside ofCorpus
is by using an internal attribute and hacking through poorly documented Rust code.- Parameters
tokenized_documents (
List
[List
[str
]]) – A list of documents where each document is a list of words.indexed_locations (
Optional
[List
[Tuple
[int
,int
]]]) – A list of documents where each documents contains a list of the start end positions of the words intokenized_documents
.
-
add_documents
(tokenized_documents, indexed_locations=None)[source]¶ This function updates the index with new documents.
It operates similarly to
text_data.index.Corpus.update()
, taking new documents and mutating the existing one.Example
>>> tokenized_words = ["im just a simple document".split()] >>> index = WordIndex(tokenized_words) >>> len(index) 1 >>> index.num_words 5 >>> index.add_documents(["now im an entire corpus".split()]) >>> len(index) 2 >>> index.num_words 10
-
concatenate
(other, ignore_index=True)[source]¶ Creates a
WordIndex
object with the documents of both this object and the other.See
text_data.multi_corpus.concatenate()
for more details.- Parameters
ignore_index (
bool
) – If set toTrue
, which is the default, the document indexes will be re-indexed starting from 0.- Raises
ValueError – If
ignore_index
is set toFalse
and some of the indexes overlap.- Return type
-
count_matrix
()[source]¶ Returns a matrix showing the number of times each word appeared in each document.
Example
>>> corpus = Corpus(["example", "this example is another example"]) >>> corpus.count_matrix().tolist() == [[0., 1.], [1., 2.], [0., 1.], [0., 1.]] True
- Return type
array
-
doc_contains
(word, document)[source]¶ States whether the given document contains the word.
Example
>>> corpus = Corpus(["words", "more words"]) >>> corpus.doc_contains("more", 0) False >>> corpus.doc_contains("more", 1) True
- Parameters
word (
str
) – The word you’re looking up.document (
int
) – The index of the document.
- Raises
ValueError – If the document you’re looking up doesn’t exist.
- Return type
bool
-
doc_count_vector
()[source]¶ Returns the total number of documents each word appears in.
Example
>>> corpus = Corpus(["example", "another example"]) >>> corpus.doc_count_vector() array([1., 2.])
- Return type
array
-
doc_freq_vector
()[source]¶ Returns the proportion of documents each word appears in.
Example
>>> corpus = Corpus(["example", "another example"]) >>> corpus.doc_freq_vector() array([0.5, 1. ])
- Return type
array
-
property
doc_lengths
¶ Returns a dictionary mapping the document indices to their lengths.
Example
>>> corpus = Corpus(["a cat and a dog", "a cat", ""]) >>> assert corpus.doc_lengths == {0: 5, 1: 2, 2: 0}
- Return type
Dict
[int
,int
]
-
docs_with_word
(word)[source]¶ Returns a list of all the documents containing a word.
Example
>>> corpus = Corpus(["example document", "another document"]) >>> assert corpus.docs_with_word("document") == {0, 1} >>> assert corpus.docs_with_word("another") == {1}
- Parameters
word (
str
) – The word you’re looking up.- Return type
Set
[int
]
-
document_count
(word)[source]¶ Returns the total number of documents a word appears in.
Example
>>> corpus = Corpus(["example document", "another example"]) >>> corpus.document_count("example") 2 >>> corpus.document_count("another") 1
- Parameters
word (
str
) – The word you’re looking up.- Return type
int
-
document_frequency
(word)[source]¶ Returns the percentage of documents that contain a word.
Example
>>> corpus = Corpus(["example document", "another example"]) >>> corpus.document_frequency("example") 1.0 >>> corpus.document_frequency("another") 0.5
- Parameters
word (
str
) – The word you’re looking up.- Return type
float
-
flatten
()[source]¶ Flattens a multi-document index into a single-document corpus.
This creates a new
WordIndex
object stripped of any positional information that has a single document in it. However, the list of words and their indexes remain.Example
>>> corpus = Corpus(["i am a document", "so am i"]) >>> len(corpus) 2 >>> flattened = corpus.flatten() >>> len(flattened) 1 >>> assert corpus.most_common() == flattened.most_common()
- Return type
-
frequency_matrix
()[source]¶ Returns a matrix showing the frequency of each word appearing in each document.
Example
>>> corpus = Corpus(["example", "this example is another example"]) >>> corpus.frequency_matrix().tolist() == [[0.0, 0.2], [1.0, 0.4], [0.0, 0.2], [0.0, 0.2]] True
- Return type
array
-
get_top_words
(term_matrix, top_n=None, reverse=True)[source]¶ Get the top values along a term matrix.
Given a matrix where each row represents a word in your vocabulary, this returns a numpy matrix of those top values, along with an array of their respective words.
You can choose the number of results you want to get by setting
top_n
to some positive value, or you can leave it be and return all of the results in sorted order. Additionally, by settingreverse
to False (instead of its default ofTrue
), you can return the scores from smallest to largest.- Parameters
term_matrix (
array
) – a matrix of floats where each row represents a wordtop_n (
Optional
[int
]) – The number of values you want to return. If None, returns all values.reverse (
bool
) – If true (the default), returns the N values with the highest scores. If false, returns the N values with the lowest scores.
- Return type
Tuple
[array
,array
]- Returns
A tuple of 2-dimensional numpy arrays, where the first item is an array of the top-scoring words and the second item is an array of the top scores themselves. Both arrays are of the same size, that is
min(self.vocab_size, top_n)
by the number of columns in the term matrix.- Raises
ValueError – If
top_n
is less than 1, if there are not the same number of rows in the matrix as there are unique words in the index, or if the numpy array doesn’t have 1 or 2 dimensions.
Example
The first thing you need to do in order to use this function is create a 1- or 2-dimensional term matrix, where the number of rows corresponds to the number of unique words in the corpus. Any of the functions within
WordIndex
that ends in_matrix(**kwargs)
(for 2-dimensional arrays) or_vector(**kwargs)
(for 1-dimensional arrays) will do the trick here. I’ll show an example with both a word count vector and a word count matrix:>>> corpus = Corpus(["The cat is near the birds", "The birds are distressed"]) >>> corpus.get_top_words(corpus.word_count_vector(), top_n=2) (array(['the', 'birds'], dtype='<U10'), array([3., 2.])) >>> corpus.get_top_words(corpus.count_matrix(), top_n=1) (array([['the', 'the']], dtype='<U10'), array([[2., 1.]]))
Similarly, you can return the scores from lowest to highest by setting
reverse=False
. (This is not the default.):>>> corpus.get_top_words(-1. * corpus.word_count_vector(), top_n=2, reverse=False) (array(['the', 'birds'], dtype='<U10'), array([-3., -2.]))
-
idf
(word)[source]¶ Returns the inverse document frequency.
If the number of documents in your
WordIndex
index
is \(N\) and the document frequency fromdocument_frequency()
is \(df\), the inverse document frequency is \(\frac{N}{df}\).Example
>>> corpus = Corpus(["example document", "another example"]) >>> corpus.idf("example") 1.0 >>> corpus.idf("another") 2.0
- Parameters
word (
str
) – The word you’re looking for.- Return type
float
-
idf_vector
()[source]¶ Returns the inverse document frequency vector.
Example
>>> corpus = Corpus(["example", "another example"]) >>> corpus.idf_vector() array([2., 1.])
- Return type
array
-
max_word_count
()[source]¶ Returns the most common word and the number of times it appeared in the corpus.
Returns
None
if there are no words in the corpus.Example
>>> corpus = Corpus([]) >>> corpus.max_word_count() is None True >>> corpus.update(["a bird a plane superman"]) >>> corpus.max_word_count() ('a', 2)
- Return type
Optional
[Tuple
[str
,int
]]
-
most_common
(num_words=None)[source]¶ Returns the most common items.
This is nearly identical to
collections.Counter.most_common
. However, unlike collections.Counter.most_common, the values that are returned appear in alphabetical order.Example
>>> corpus = Corpus(["i walked to the zoo", "i bought a zoo"]) >>> corpus.most_common() [('i', 2), ('zoo', 2), ('a', 1), ('bought', 1), ('the', 1), ('to', 1), ('walked', 1)] >>> corpus.most_common(2) [('i', 2), ('zoo', 2)]
- Parameters
num_words (
Optional
[int
]) – The number of words you return. If you enter None or you enter a number larger than the total number of words, it returns all of the words, in sorted order from most common to least common.- Return type
List
[Tuple
[str
,int
]]
-
property
num_words
¶ Returns the total number of words in the corpus (not just unique).
Example
>>> corpus = Corpus(["a cat and a dog"]) >>> corpus.num_words 5
- Return type
int
-
odds_document
(word, document, sublinear=False)[source]¶ Returns the odds of finding a word in a document.
This is the equivalent of
odds_word()
. But insteasd of calculating items at the word-corpus level, the calculations are performed at the word-document level.Example
>>> corpus = Corpus(["this is a document", "document two"]) >>> corpus.odds_document("document", 1) 1.0 >>> corpus.odds_document("document", 1, sublinear=True) 0.0
- Parameters
word (
str
) – The word you’re looking updocument (
int
) – The index of the documentsublinear (
bool
) – IfTrue
, returns the log-odds of finding the word in the document.
- Raises
ValueError – If the document doesn’t exist.
- Return type
float
-
odds_matrix
(sublinear=False, add_k=None)[source]¶ Returns the odds of finding a word in a document for every possible word-document pair.
Because not all words are likely to appear in all of the documents, this implementation adds
1
to all of the numerators before taking the frequencies. So\(O(w) = \frac{c_{i} + 1}{N + \vert V \vert}\)
where \(\vert V \vert\) is the total number of unique words in each document, \(N\) is the total number of total words in each document, and \(c_i\) is the count of a word in a document.
Example
>>> corpus = Corpus(["example document", "another example"]) >>> corpus.odds_matrix() array([[0.33333333, 1. ], [1. , 0.33333333], [1. , 1. ]]) >>> corpus.odds_matrix(sublinear=True) array([[-1.5849625, 0. ], [ 0. , -1.5849625], [ 0. , 0. ]])
- Parameters
sublinear (
bool
) – IfTrue
, computes the log-odds.add_k (
Optional
[float
]) – This addsk
to each of the non-zero elements in the matrix. Since \(\log{1} = 0\), this prevents 50 percent probabilities from appearing to be the same as elements that don’t exist.
- Return type
array
-
odds_vector
(sublinear=False)[source]¶ Returns a vector of the odds of each word appearing at random.
Example
>>> corpus = Corpus(["example", "this example is another example"]) >>> corpus.odds_vector() array([0.2, 1. , 0.2, 0.2]) >>> corpus.odds_vector(sublinear=True) array([-2.32192809, 0. , -2.32192809, -2.32192809])
- Parameters
sublinear (
bool
) – If true, returns the log odds.- Return type
array
-
odds_word
(word, sublinear=False)[source]¶ Returns the odds of seeing a word at random.
In statistics, the odds of something happening are the probability of it happening, versus the probability of it not happening, that is \(\frac{p}{1 - p}\). The “log odds” of something happening — the result of using
self.log_odds_word
— is similarly equivalent to \(log_{2}{\frac{p}{1 - p}}\).(The probability in this case is simply the word frequency.)
Example
>>> corpus = Corpus(["i like odds ratios"]) >>> np.isclose(corpus.odds_word("odds"), 1. / 3.) True >>> np.isclose(corpus.odds_word("odds", sublinear=True), np.log2(1./3.)) True
- Parameters
word (
str
) – The word you’re looking up.sublinear (
bool
) – If true, returns the
- Return type
float
-
one_hot_matrix
()[source]¶ Returns a matrix showing whether each given word appeared in each document.
For these matrices, all cells contain a floating point value of either a 1., if the word is in that document, or a 0. if the word is not in the document.
These are sometimes referred to as ‘one-hot encoding matrices’ in machine learning.
Example
>>> corpus = Corpus(["example", "this example is another example"]) >>> np.array_equal( ... corpus.one_hot_matrix(), ... np.array([[0., 1.], [1., 1.], [0., 1.], [0., 1.]]) ... ) True
- Return type
array
-
reset_index
(start_idx=None)[source]¶ An in-place operation that resets the document indexes for this corpus.
When you reset the index, all of the documents change their values, starting at
start_idx
(and incrementing from there). For the most part, you will not need to do this, since most of the library does not give you the option to change the document indexes. However, it may be useful when you’re usingslice()
orsplit_off()
.- Parameters
start_idx (
Optional
[int
]) – The first (lowest) document index you want to set. Values must be positive. Defaults to 0.
-
skip_words
(words)[source]¶ Creates a
WordIndex
without any of the skipped words.This enables you to create an index that does not contain rare words, for example. The index will not have any positions associated with them, so be careful when implementing it on a
text_data.index.Corpus
object.Example
>>> skip_words = {"document"} >>> corpus = Corpus(["example document", "document"]) >>> "document" in corpus True >>> without_document = corpus.skip_words(skip_words) >>> "document" in without_document False
- Return type
-
slice
(indexes)[source]¶ Returns an index that just contains documents from the set of words.
- Parameters
indexes (
Set
[int
]) – A set of index values for the documents.
Example
>>> index = WordIndex([["example"], ["document"], ["another"], ["example"]]) >>> sliced_idx = index.slice({0, 2}) >>> len(sliced_idx) 2 >>> sliced_idx.most_common() [('another', 1), ('example', 1)]
- Return type
-
slice_many
(indexes_list)[source]¶ This operates like
slice()
but creates multipleWordIndex
objects.Example
>>> corpus = Corpus(["example document", "another example", "yet another"]) >>> first, second, third = corpus.slice_many([{0}, {1}, {2}]) >>> first.documents ['example document'] >>> second.documents ['another example'] >>> third.documents ['yet another']
- Parameters
indexes_list (
List
[Set
[int
]]) – A list of sets of indexes. Seetext_data.index.WordIndex.slice()
for details.- Return type
List
[WordIndex
]
-
split_off
(indexes)[source]¶ Returns an index with just a set of documents, while removing them from the index.
- Parameters
indexes (
Set
[int
]) – A set of index values for the documents.
Note
This removes words from the index inplace. So be make sure you want to do that before using this function.
Example
>>> index = WordIndex([["example"], ["document"], ["another"], ["example"]]) >>> split_idx = index.split_off({0, 2}) >>> len(split_idx) 2 >>> len(index) 2 >>> split_idx.most_common() [('another', 1), ('example', 1)] >>> index.most_common() [('document', 1), ('example', 1)]
- Return type
-
term_count
(word, document)[source]¶ Returns the total number of times a word appeared in a document.
Assuming the document exists, returns 0 if the word does not appear in the document.
Example
>>> corpus = Corpus(["i am just thinking random thoughts", "am i"]) >>> corpus.term_count("random", 0) 1 >>> corpus.term_count("random", 1) 0
- Parameters
word (
str
) – The word you’re looking up.document (
int
) – The index of the document.
- Raises
ValueError – If you selected a document
- Return type
int
-
term_frequency
(word, document)[source]¶ Returns the proportion of words in document
document
that areword
.Example
>>> corpus = Corpus(["just coming up with words", "more words"]) >>> np.isclose(corpus.term_frequency("words", 1), 0.5) True >>> np.isclose(corpus.term_frequency("words", 0), 0.2) True
- Parameters
word (
str
) – The word you’re looking updocument (
int
) – The index of the document
- Raises
ValueError – If the document you’re looking up doesn’t exist
- Return type
float
-
tfidf_matrix
(norm='l2', use_idf=True, smooth_idf=False, sublinear_tf=True, add_k=1)[source]¶ This creates a term-document TF-IDF matrix from the index.
In natural language processing, TF-IDF is a mechanism for finding out which words are distinct across documents. It’s used particularly widely in information retrieval, where your goal is to rank documents that you know match a query by how relevant you think they’ll be.
The basic intuition goes like this: If a word appears particularly frequently in a document, it’s probably more relevant to that document than if the word occurred more rarely. But, some words are simply common: If document X uses the word ‘the’ more often than the word ‘idiomatic,’ that really tells you more about the words ‘the’ and ‘idiomatic’ than it does about the document.
TF-IDF tries to balance these two competing interests by taking the ‘term frequency,’ or how often a word appears in the document, and normalizing it by the ‘document frequency,’ or the proportion of documents that contain the word. This has the effect of reducing the weights of common words (and even setting the weights of some very common words to 0 in some implementations).
It should be noted that there are a number of different implementations of TF-IDF. Within information retrieval, TF-IDF is part of the ‘SMART Information Retrieval System’. Although the exact equations can vary considerably, they typically follow the same approach: First, they find some value to represent the frequency of each word in the document. Often (but not always), this is just the raw number of times in which a word appeared in the document. Then, they normalize that based on the document frequency. And finally, they normalize those values based on the length of the document, so that long documents are not weighted more favorably (or less favorably) than shorter documents.
The approach that I have taken to this is shamelessly cribbed from scikit’s TfidfTransformer. Specifically, I’ve allowed for some customization of the specific formula for TF-IDF while not including methods that require access to the raw documents, which would be computationally expensive to perform. This allows for the following options:
You can set the term frequency to either take the raw count of the word in the document (\(c_{t,d}\)) or by using
sublinear_tf=True
and taking \(1 + \log_{2}{c_{t,d}}\)You can skip taking the inverse document frequency \(df^{-1}\) altogether by setting
use_idf=False
or you can smooth the inverse document frequency by settingsmooth_idf=True
. This adds one to the numerator and the denominator. (Note: Because this method is only run on a vocabulary of words that are in the corpus, there can’t be any divide by zero errors, but this allows you to replicate scikit’sTfidfTransformer
.)You can add some number to the logged inverse document frequency by setting
add_k
to something other than 1. This is the only difference between this implementation and scikits, as scikit automatically settsk
at 1.Finally, you can choose how to normalize the document lengths. By default, this takes the L-2 norm, or \(\sqrt{\sum{w_{i,k}^{2}}}\), where \(w_{i,k}\) is the weight you get from multiplying the term frequency by the inverse document frequency. But you can also set the norm to
'l1'
to get the L1-norm, or \(\sum{\vert w_{i,k} \vert}\). Or you can set it toNone
to avoid doing any document-length normalization at all.
Examples
To get a sense of the different options, let’s start by creating a pure count matrix with this method. To do that, we’ll set
norm=None
so we’re not normalizing by the length of the document,use_idf=False
so we’re not doing anything with the document frequency, andsublinear_tf=False
so we’re not taking the logged counts:>>> corpus = Corpus(["a cat", "a"]) >>> tfidf_count_matrix = corpus.tfidf_matrix(norm=None, use_idf=False, sublinear_tf=False) >>> assert np.array_equal(tfidf_count_matrix, corpus.count_matrix())
In this particular case, setting
sublinear_tf
toTrue
will produce the same result since all of the counts are 1 or 0 and \(\log{1} + 1 = 1\):>>> assert np.array_equal(corpus.tfidf_matrix(norm=None, use_idf=False), tfidf_count_matrix)
Now, we can incorporate the inverse document frequency. Because the word ‘a’ appears in both documents, its inverse document frequency in is 1; the inverse document frequency of ‘cat’ is 2, since ‘cat’ appears in half of the documents. We’re additionally taking the base-2 log of the inverse document frequency and adding 1 to the final result. So we get:
>>> idf_add_1 = corpus.tfidf_matrix(norm=None, sublinear_tf=False, smooth_idf=False) >>> assert idf_add_1.tolist() == [[1., 1.], [2.,0.]]
Or we can add nothing to the logged values:
>>> idf = corpus.tfidf_matrix(norm=None, sublinear_tf=False, smooth_idf=False, add_k=0) >>> assert idf.tolist() == [[0.0, 0.0], [1.0, 0.0]]
The L-1 norm normalizes the results by the sum of the absolute values of their weights. In the case of the count matrix, this is equivalent to creating the frequency matrix:
>>> tfidf_freq_mat = corpus.tfidf_matrix(norm="l1", use_idf=False, sublinear_tf=False) >>> assert np.array_equal(tfidf_freq_mat, corpus.frequency_matrix())
- Parameters
norm (
Optional
[str
]) – Set to ‘l2’ for the L2 norm (square root of the sums of the square weights), ‘l1’ for the l1 norm (the summed absolute value, or None for no normalization).use_idf (
bool
) – If you set this to False, the weights will only include the term frequency (adjusted however you like)smooth_idf (
bool
) – Adds a constant to the numerator and the denominator.sublinear_tf (
bool
) – Computes the term frequency in log space.add_k (
int
) – This adds k to every value in the IDF. scikit adds 1 to all documents, but this allows for more variable computing (e.g. adding 0 if you want to remove words appearing in every document)
- Return type
array
-
property
vocab
¶ Returns all of the unique words in the index.
Example
>>> corpus = Corpus(["a cat and a dog"]) >>> corpus.vocab == {"a", "cat", "and", "dog"} True
- Return type
Set
[str
]
-
property
vocab_list
¶ Returns a sorted list of the words appearing in the index.
This is primarily intended for use in matrix or vector functions, where the order of the words matters.
Example
>>> corpus = Corpus(["a cat and a dog"]) >>> corpus.vocab_list ['a', 'and', 'cat', 'dog']
- Return type
List
[str
]
-
property
vocab_size
¶ Returns the total number of unique words in the corpus.
Example
>>> corpus = Corpus(["a cat and a dog"]) >>> corpus.vocab_size 4
- Return type
int
-
word_count
(word)[source]¶ Returns the total number of times the word appeared.
Defaults to 0 if the word never appeared.
Example
>>> corpus = Corpus(["this is a document", "a bird and a plane"]) >>> corpus.word_count("document") 1 >>> corpus.word_count("a") 3 >>> corpus.word_count("malarkey") 0
- Parameters
word (
str
) – The string word (or phrase).- Return type
int
-
word_count_vector
()[source]¶ Returns the total number of times each word appeared in the corpus.
Example
>>> corpus = Corpus(["example", "this example is another example"]) >>> corpus.word_count_vector() array([1., 3., 1., 1.])
- Return type
array
-
word_counter
(word)[source]¶ Maps the documents containing a word to the number of times the word appeared.
Examples
>>> corpus = Corpus(["a bird", "a bird and a plane", "two birds"]) >>> corpus.word_counter("a") == {0: 1, 1: 2} True
- Parameters
word (
str
) – The word you’re looking up- Return type
Dict
[int
,int
]- Returns
- A dictionary mapping the document index of the word to the number of times
it appeared in that document.
-
word_freq_vector
()[source]¶ Returns the frequency in which each word appears over the corpus.
Example
>>> corpus = Corpus(["example", "this example is another example"]) >>> corpus.word_freq_vector() array([0.16666667, 0.5 , 0.16666667, 0.16666667])
- Return type
array
-
word_frequency
(word)[source]¶ Returns the frequency in which the word appeared in the corpus.
Example
>>> corpus = Corpus(["this is fun", "or is it"]) >>> np.isclose(corpus.word_frequency("fun"), 1. / 6.) True >>> np.isclose(corpus.word_frequency("is"), 2. / 6.) True
- Parameters
word (
str
) – The string word or phrase.- Return type
float