
Latent Dirichlet allocation using collapsed Gibbs sampling

class lda.lda.LDA(n_topics, n_iter=2000, alpha=0.1, eta=0.01, random_state=None, refresh=10)

Latent Dirichlet allocation using collapsed Gibbs sampling


n_topics : int

Number of topics

n_iter : int, default 2000

Number of sampling iterations

alpha : float, default 0.1

Dirichlet parameter for distribution over topics

eta : float, default 0.01

Dirichlet parameter for distribution over words

random_state : int or RandomState, optional

The generator used for the initial topics.


Blei, David M., Andrew Y. Ng, and Michael I. Jordan. “Latent Dirichlet Allocation.” Journal of Machine Learning Research 3 (2003): 993–1022.

Griffiths, Thomas L., and Mark Steyvers. “Finding Scientific Topics.” Proceedings of the National Academy of Sciences 101 (2004): 5228–5235. doi:10.1073/pnas.0307752101.

Wallach, Hanna, David Mimno, and Andrew McCallum. “Rethinking LDA: Why Priors Matter.” In Advances in Neural Information Processing Systems 22, edited by Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, 1973–1981, 2009.


>>> import numpy
>>> X = numpy.array([[1,1], [2, 1], [3, 1], [4, 1], [5, 8], [6, 1]])
>>> import lda
>>> model = lda.LDA(n_topics=2, random_state=0, n_iter=100)
>>> model.fit(X) 
>>> model.components_
array([[ 0.85714286,  0.14285714],
       [ 0.45      ,  0.55      ]])
>>> model.loglikelihood() 


components_ (array, shape = [n_topics, n_features]) Point estimate of the topic-word distributions (Phi in literature)
topic_word_ : Alias for components_
nzw_ (array, shape = [n_topics, n_features]) Matrix of counts recording topic-word assignments in final iteration.
ndz_ (array, shape = [n_samples, n_topics]) Matrix of counts recording document-topic assignments in final iteration.
doc_topic_ (array, shape = [n_samples, n_features]) Point estimate of the document-topic distributions (Theta in literature)
nz_ (array, shape = [n_topics]) Array of topic assignment counts in final iteration.
fit(X, y=None)

Fit the model with X.


X: array-like, shape (n_samples, n_features)

Training data, where n_samples in the number of samples and n_features is the number of features. Sparse matrix allowed.


self : object

Returns the instance itself.

fit_transform(X, y=None)

Apply dimensionality reduction on X


X : array-like, shape (n_samples, n_features)

New data, where n_samples in the number of samples and n_features is the number of features. Sparse matrix allowed.


doc_topic : array-like, shape (n_samples, n_topics)

Point estimate of the document-topic distributions


Calculate complete log likelihood, log p(w,z)

Formula used is log p(w,z) = log p(w|z) + log p(z)

transform(X, max_iter=20, tol=1e-16)

Transform the data X according to previously fitted model


X : array-like, shape (n_samples, n_features)

New data, where n_samples in the number of samples and n_features is the number of features.

max_iter : int, optional

Maximum number of iterations in iterated-pseudocount estimation.

tol: double, optional

Tolerance value used in stopping condition.


doc_topic : array-like, shape (n_samples, n_topics)

Point estimate of the document-topic distributions


lda.utils.dtm2ldac(dtm, offset=0)

Convert a document-term matrix into an LDA-C formatted file

Parameters:dtm : array of shape N,V
Returns:doclines : iterable of LDA-C lines suitable for writing to file


If a format similar to SVMLight is desired, offset of 1 may be used.

lda.utils.ldac2dtm(stream, offset=0)

Convert an LDA-C formatted file to a document-term array


stream: file object

File yielding unicode strings in LDA-C format.


dtm : array of shape N,V


If a format similar to SVMLight is the source, an offset of 1 may be used.

lda.utils.lists_to_matrix(WS, DS)

Convert array of word (or topic) and document indices to doc-term array


(WS, DS) : tuple of two arrays

WS[k] contains the kth word in the corpus DS[k] contains the document index for the kth word


doc_word : array (D, V)

document-term array of counts


Convert a (sparse) matrix of counts into arrays of word and doc indices


doc_word : array or sparse matrix (D, V)

document-term matrix of counts


(WS, DS) : tuple of two arrays

WS[k] contains the kth word in the corpus DS[k] contains the document index for the kth word