API¶
lda.lda¶
Latent Dirichlet allocation using collapsed Gibbs sampling
-
class
lda.lda.
LDA
(n_topics, n_iter=2000, alpha=0.1, eta=0.01, random_state=None, refresh=10)¶ Latent Dirichlet allocation using collapsed Gibbs sampling
Parameters: n_topics : int
Number of topics
n_iter : int, default 2000
Number of sampling iterations
alpha : float, default 0.1
Dirichlet parameter for distribution over topics
eta : float, default 0.01
Dirichlet parameter for distribution over words
random_state : int or RandomState, optional
The generator used for the initial topics.
References
Blei, David M., Andrew Y. Ng, and Michael I. Jordan. “Latent Dirichlet Allocation.” Journal of Machine Learning Research 3 (2003): 993–1022.
Griffiths, Thomas L., and Mark Steyvers. “Finding Scientific Topics.” Proceedings of the National Academy of Sciences 101 (2004): 5228–5235. doi:10.1073/pnas.0307752101.
Wallach, Hanna, David Mimno, and Andrew McCallum. “Rethinking LDA: Why Priors Matter.” In Advances in Neural Information Processing Systems 22, edited by Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, 1973–1981, 2009.
Examples
>>> import numpy >>> X = numpy.array([[1,1], [2, 1], [3, 1], [4, 1], [5, 8], [6, 1]]) >>> import lda >>> model = lda.LDA(n_topics=2, random_state=0, n_iter=100) >>> model.fit(X) LDA(alpha=... >>> model.components_ array([[ 0.85714286, 0.14285714], [ 0.45 , 0.55 ]]) >>> model.loglikelihood() -40.395...
Attributes
components_ (array, shape = [n_topics, n_features]) Point estimate of the topic-word distributions (Phi in literature) topic_word_ : Alias for components_ nzw_ (array, shape = [n_topics, n_features]) Matrix of counts recording topic-word assignments in final iteration. ndz_ (array, shape = [n_samples, n_topics]) Matrix of counts recording document-topic assignments in final iteration. doc_topic_ (array, shape = [n_samples, n_features]) Point estimate of the document-topic distributions (Theta in literature) nz_ (array, shape = [n_topics]) Array of topic assignment counts in final iteration. -
fit
(X, y=None)¶ Fit the model with X.
Parameters: X: array-like, shape (n_samples, n_features)
Training data, where n_samples in the number of samples and n_features is the number of features. Sparse matrix allowed.
Returns: self : object
Returns the instance itself.
-
fit_transform
(X, y=None)¶ Apply dimensionality reduction on X
Parameters: X : array-like, shape (n_samples, n_features)
New data, where n_samples in the number of samples and n_features is the number of features. Sparse matrix allowed.
Returns: doc_topic : array-like, shape (n_samples, n_topics)
Point estimate of the document-topic distributions
-
loglikelihood
()¶ Calculate complete log likelihood, log p(w,z)
Formula used is log p(w,z) = log p(w|z) + log p(z)
-
transform
(X, max_iter=20, tol=1e-16)¶ Transform the data X according to previously fitted model
Parameters: X : array-like, shape (n_samples, n_features)
New data, where n_samples in the number of samples and n_features is the number of features.
max_iter : int, optional
Maximum number of iterations in iterated-pseudocount estimation.
tol: double, optional
Tolerance value used in stopping condition.
Returns: doc_topic : array-like, shape (n_samples, n_topics)
Point estimate of the document-topic distributions
-
lda.utils¶
-
lda.utils.
check_random_state
(seed)¶
-
lda.utils.
dtm2ldac
(dtm, offset=0)¶ Convert a document-term matrix into an LDA-C formatted file
Parameters: dtm : array of shape N,V Returns: doclines : iterable of LDA-C lines suitable for writing to file Notes
If a format similar to SVMLight is desired, offset of 1 may be used.
-
lda.utils.
ldac2dtm
(stream, offset=0)¶ Convert an LDA-C formatted file to a document-term array
Parameters: stream: file object
File yielding unicode strings in LDA-C format.
Returns: dtm : array of shape N,V
Notes
If a format similar to SVMLight is the source, an offset of 1 may be used.
-
lda.utils.
lists_to_matrix
(WS, DS)¶ Convert array of word (or topic) and document indices to doc-term array
Parameters: (WS, DS) : tuple of two arrays
WS[k] contains the kth word in the corpus DS[k] contains the document index for the kth word
Returns: doc_word : array (D, V)
document-term array of counts
-
lda.utils.
matrix_to_lists
(doc_word)¶ Convert a (sparse) matrix of counts into arrays of word and doc indices
Parameters: doc_word : array or sparse matrix (D, V)
document-term matrix of counts
Returns: (WS, DS) : tuple of two arrays
WS[k] contains the kth word in the corpus DS[k] contains the document index for the kth word