utils

This module contains various general utility functions.

class gensim.utils.SaveLoad

Objects which inherit from this class have save/load functions, which un/pickle them to disk.

This uses cPickle for de/serializing, so objects must not contains unpicklable attributes, such as lambda functions etc.

classmethod load(fname)
Load a previously saved object from file (also see save).
save(fname)
Save the object to file via pickling (also see load).
gensim.utils.deaccent(text)

Remove accentuation from the given string.

Input text is either a unicode string or utf8 encoded bytestring. Return input string with accents removed, as unicode.

>>> deaccent("Šéf chomutovských komunistů dostal poštou bílý prášek")
u'Sef chomutovskych komunistu dostal postou bily prasek'
gensim.utils.dictFromCorpus(corpus)

Scan corpus for all word ids that appear in it, then contruct and return a mapping which maps each wordId -> str(wordId).

This function is used whenever words need to be displayed (as opposed to just their ids) but no wordId->word mapping was provided. The resulting mapping only covers words actually used in the corpus, up to the highest wordId found.

gensim.utils.tokenize(text, lowercase=False, deacc=False, errors='strict', toLower=False, lower=False)

Iteratively yield tokens as unicode strings, optionally also lowercasing them and removing accent marks.

Input text may be either unicode or utf8-encoded byte string.

The tokens on output are maximal contiguous sequences of alphabetic characters (no digits!).

>>> list(tokenize('Nic nemůže letět rychlostí vyšší, než 300 tisíc kilometrů za sekundu!', deacc = True))
[u'Nic', u'nemuze', u'letet', u'rychlosti', u'vyssi', u'nez', u'tisic', u'kilometru', u'za', u'sekundu']

Previous topic

interfaces

Next topic

matutils