ldamodel

This module encapsulates functionality for the Latent Dirichlet Allocation algorithm.

It allows both model estimation from a training corpus and inference on new, unseen documents.

The implementation is based on Blei et al., Latent Dirichlet Allocation, 2003, and on Blei’s LDA-C software in particular. This means it uses variational EM inference rather than Gibbs sampling to estimate model parameters.

class gensim.models.ldamodel.LdaModel(corpus, id2word=None, numTopics=200, alpha=None, initMode='random')

Objects of this class allow building and maintaining a model of Latent Dirichlet Allocation.

The code is based on Blei’s C implementation, see http://www.cs.princeton.edu/~blei/lda-c/ .

This Python code uses numpy heavily, and is about 4-5x slower than the original C version. The up side is that it is much more straightforward and concise, using vector operations ala MATLAB, easily pluggable/extensible etc.

The constructor estimates model parameters based on a training corpus:

>>> lda = LdaModel(corpus, numTopics = 10)

You can then infer topic distributions on new, unseen documents:

>>> doc_lda = lda[doc_bow]

Model persistency is achieved via its load/save methods.

Initialize the model based on corpus.

id2word is a mapping from word ids (integers) to words (strings). It is used to determine the vocabulary size, as well as for debugging and topic printing.

numTopics is the number of requested topics.

alpha is either None (to be estimated during training) or a number between (0.0, 1.0).

apply(corpus)
Apply the transformation to a whole corpus (as opposed to a single document) and return the result as another another corpus.
computeLikelihood(doc, phi, gamma)
Compute the document likelihood, given all model parameters.
countsFromCorpus(corpus, numInitDocs=1)
Initialize the model word counts from the corpus. Each topic will be initialized from numInitDocs random documents.
docEStep(doc)
Find optimizing parameters for phi and gamma, and update sufficient statistics.
getTopicsMatrix()

Transform topic-word distribution via a tf-idf score and return it instead of the simple self.logProbW word-topic probabilities.

The transformation is a sort of TF-IDF score, where the word gets higher score if it’s probable in this topic (the TF part) and lower score if it’s probable across the whole corpus (the IDF part).

The exact formula is taken from Blei&Laffery: “Topic Models”, 2009

The returned matrix is of the same shape as logProbW.

infer(corpus)

Perform inference on a corpus of documents.

This means that a standard inference step is taken for each document from the corpus and the results are saved into file corpus.fname.lda_inferred.

The output format of this file is one doc per line:: doc_likelihood[TAB]topic1:prob ... topicK:prob[TAB]word1:topic ... wordN:topic

Topics are sorted by probability, words are in the same order as in the input.

inference(doc)

Perform inference on a single document.

Return 3-tuple of (likelihood of this document, word-topic distribution phi, expected word counts gamma (~topic distribution)).

A document is simply a bag-of-words collection which supports len() and iteration over (wordIndex, wordCount) 2-tuples.

The model itself is not affected in any way (this function is read-only aka const).

initialize(corpus, initMode='random')

Run LDA parameter estimation from a training corpus, using the EM algorithm.

After the model has been initialized, you can infer topic distribution over other, different corpora, using this estimated model.

initMode can be either ‘random’, for a fast random initialization of the model parameters, or ‘seeded’, for an initialization based on a handful of real documents. The ‘seeded’ mode requires a sweep over the entire corpus, and is thus much slower.

classmethod load(fname)
Load a previously saved object from file (also see save).
mle(estimateAlpha)

Maximum likelihood estimate.

This maximizes the lower bound on log likelihood wrt. to the alpha and beta parameters.

optAlpha(MAX_ALPHA_ITER=1000, NEWTON_THRESH=1.0000000000000001e-05)
Estimate new Dirichlet priors (actually just one scalar shared across all topics).
printTopics(numWords=10)

Print the top numTerms words for each topic, along with the log of their probability.

Uses getTopicsMatrix() method to determine the ‘top words’.

save(fname)
Save the object to file via pickling (also see load).

Previous topic

svmlightcorpus

Next topic

lsimodel