This module implements the concept of Dictionary – a mapping between words and their internal ids.
Dictionaries can be created from a corpus and can later be pruned according to document frequency (removing (un)common words via the filterExtremes() method), save/loaded from disk via save() and load() methods etc.
Dictionary encapsulates mappings between words, their normalized forms and ids of those normalized forms.
The main function is doc2bow, which coverts a collection of words to its bow representation, optionally also updating the dictionary mappings with new words and their ids.
Convert document (a list of words) into bag-of-words format = list of (tokenId, tokenCount) 2-tuples.
normalizeWord must be a function that accepts one utf-8 encoded string and returns another. Possible choices are identity, lowercasing etc.
If allowUpdate is set, then also update dictionary in the process: create ids for new words. At the same time, update document frequencies – for each word appearing in this document, increase its self.docFreq by one.
At the same time rebuild the dictionary, shrinking resulting gaps in tokenIds (lowering len(self) and freeing up memory in the process).
Note that the same token may have a different tokenId before and after the call to this function!
Build dictionary from a collection of documents. Each document is a list of words (ie. tokenized strings).
The normalizeWord function is used to convert each word to its utf-8 encoded canonical form (identity, lowercasing, stemming, ...); use whichever normalization suits you.
>>> print Dictionary.fromDocuments(["máma mele maso".split(), "ema má mama".split()], utils.deaccent)
Dictionary(5 unique tokens covering 6 surface forms)
Assign new tokenIds to all tokens.
This is done to make tokenIds more compact, ie. after some tokens have been removed via filterTokens() and there are gaps in the tokenId series. Calling this method will remove the gaps.