API

lda.lda

Latent Dirichlet allocation using collapsed Gibbs sampling

class lda.lda.LDA(n_topics, n_iter=2000, alpha=0.1, eta=0.01, random_state=None, refresh=10)

Latent Dirichlet allocation using collapsed Gibbs sampling

Parameters:

n_topics : int

Number of topics

n_iter : int, default 2000

Number of sampling iterations

alpha : float, default 0.1

Dirichlet parameter for distribution over topics

eta : float, default 0.01

Dirichlet parameter for distribution over words

random_state : int or RandomState, optional

The generator used for the initial topics.

References

Blei, David M., Andrew Y. Ng, and Michael I. Jordan. “Latent Dirichlet Allocation.” Journal of Machine Learning Research 3 (2003): 993–1022.

Griffiths, Thomas L., and Mark Steyvers. “Finding Scientific Topics.” Proceedings of the National Academy of Sciences 101 (2004): 5228–5235. doi:10.1073/pnas.0307752101.

Wallach, Hanna, David Mimno, and Andrew McCallum. “Rethinking LDA: Why Priors Matter.” In Advances in Neural Information Processing Systems 22, edited by Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, 1973–1981, 2009.

Examples

>>> import numpy
>>> X = numpy.array([[1,1], [2, 1], [3, 1], [4, 1], [5, 8], [6, 1]])
>>> import lda
>>> model = lda.LDA(n_topics=2, random_state=0, n_iter=100)
>>> model.fit(X) 
LDA(alpha=...
>>> model.components_
array([[ 0.85714286,  0.14285714],
       [ 0.45      ,  0.55      ]])
>>> model.loglikelihood() 
-40.395...

Attributes

components_ (array, shape = [n_topics, n_features]) Point estimate of the topic-word distributions (Phi in literature)
topic_word_ : Alias for components_
nzw_ (array, shape = [n_topics, n_features]) Matrix of counts recording topic-word assignments in final iteration.
ndz_ (array, shape = [n_samples, n_topics]) Matrix of counts recording document-topic assignments in final iteration.
doc_topic_ (array, shape = [n_samples, n_features]) Point estimate of the document-topic distributions (Theta in literature)
nz_ (array, shape = [n_topics]) Array of topic assignment counts in final iteration.
fit(X, y=None)

Fit the model with X.

Parameters:

X: array-like, shape (n_samples, n_features)

Training data, where n_samples in the number of samples and n_features is the number of features. Sparse matrix allowed.

Returns:

self : object

Returns the instance itself.

fit_transform(X, y=None)

Apply dimensionality reduction on X

Parameters:

X : array-like, shape (n_samples, n_features)

New data, where n_samples in the number of samples and n_features is the number of features. Sparse matrix allowed.

Returns:

doc_topic : array-like, shape (n_samples, n_topics)

Point estimate of the document-topic distributions

loglikelihood()

Calculate complete log likelihood, log p(w,z)

Formula used is log p(w,z) = log p(w|z) + log p(z)

transform(X, max_iter=20, tol=1e-16)

Transform the data X according to previously fitted model

Parameters:

X : array-like, shape (n_samples, n_features)

New data, where n_samples in the number of samples and n_features is the number of features.

max_iter : int, optional

Maximum number of iterations in iterated-pseudocount estimation.

tol: double, optional

Tolerance value used in stopping condition.

Returns:

doc_topic : array-like, shape (n_samples, n_topics)

Point estimate of the document-topic distributions

lda.utils

lda.utils.check_random_state(seed)
lda.utils.dtm2ldac(dtm, offset=0)

Convert a document-term matrix into an LDA-C formatted file

Parameters:dtm : array of shape N,V
Returns:doclines : iterable of LDA-C lines suitable for writing to file

Notes

If a format similar to SVMLight is desired, offset of 1 may be used.

lda.utils.ldac2dtm(stream, offset=0)

Convert an LDA-C formatted file to a document-term array

Parameters:

stream: file object

File yielding unicode strings in LDA-C format.

Returns:

dtm : array of shape N,V

Notes

If a format similar to SVMLight is the source, an offset of 1 may be used.

lda.utils.lists_to_matrix(WS, DS)

Convert array of word (or topic) and document indices to doc-term array

Parameters:

(WS, DS) : tuple of two arrays

WS[k] contains the kth word in the corpus DS[k] contains the document index for the kth word

Returns:

doc_word : array (D, V)

document-term array of counts

lda.utils.matrix_to_lists(doc_word)

Convert a (sparse) matrix of counts into arrays of word and doc indices

Parameters:

doc_word : array or sparse matrix (D, V)

document-term matrix of counts

Returns:

(WS, DS) : tuple of two arrays

WS[k] contains the kth word in the corpus DS[k] contains the document index for the kth word