LdaModel train¶

train(self, frame, document_column_name, word_column_name, word_count_column_name, max_iterations=20, alpha=None, beta=1.10000002384, num_topics=10, random_seed=None)¶

[ALPHA] Creates Latent Dirichlet Allocation model

Parameters:

Parameters:	frame : Frame Input frame data. document_column_name : unicode Column Name for documents. Column should contain a str value. word_column_name : unicode Column name for words. Column should contain a str value. word_count_column_name : unicode Column name for word count. Column should contain an int32 or int64 value. max_iterations : int32 (default=20) The maximum number of iterations that the algorithm will execute. The valid value range is all positive int. Default is 20. alpha : list (default=None) The hyperparameter for document-specific distribution over topics. Mainly used as a smoothing parameter in Bayesian inference. If set to a singleton list List(-1d), then docConcentration is set automatically. If set to singleton list List(t) where t != -1, then t is replicated to a vector of length k during LDAOptimizer.initialize(). Otherwise, the alpha must be length k. Currently the EM optimizer only supports symmetric distributions, so all values in the vector should be the same. Values should be greater than 1.0. Default value is -1.0 indicating automatic setting. beta : float32 (default=1.10000002384) The hyperparameter for word-specific distribution over topics. Mainly used as a smoothing parameter in Bayesian inference. Larger value implies that topics contain all words more uniformly and smaller value implies that topics are more concentrated on a small subset of words. Valid value range is all positive float greater than or equal to 1. Default is 0.1. num_topics : int32 (default=10) The number of topics to identify in the LDA model. Using fewer topics will speed up the computation, but the extracted topics might be more abstract or less specific; using more topics will result in more computation but lead to more specific topics. Valid value range is all positive int. Default is 10. random_seed : int64 (default=None) An optional random seed. The random seed is used to initialize the pseudorandom number generator used in the LDA model. Setting the random seed to the same value every time the model is trained, allows LDA to generate the same topic distribution if the corpus and LDA parameters are unchanged.
Returns:	: dict The data returned is composed of multiple components: Frame : topics_given_doc Conditional probabilities of topic given document. Frame : word_given_topics Conditional probabilities of word given topic. Frame : topics_given_word Conditional probabilities of topic given word. str : report The configuration and learning curve report for Latent Dirichlet Allocation as a multiple line str.

frame : Frame

Input frame data.

document_column_name : unicode

Column Name for documents. Column should contain a str value.

word_column_name : unicode

Column name for words. Column should contain a str value.

word_count_column_name : unicode

Column name for word count. Column should contain an int32 or int64 value.

max_iterations : int32 (default=20)

The maximum number of iterations that the algorithm will execute. The valid value range is all positive int. Default is 20.

alpha : list (default=None)

The hyperparameter for document-specific distribution over topics. Mainly used as a smoothing parameter in Bayesian inference. If set to a singleton list List(-1d), then docConcentration is set automatically. If set to singleton list List(t) where t != -1, then t is replicated to a vector of length k during LDAOptimizer.initialize(). Otherwise, the alpha must be length k. Currently the EM optimizer only supports symmetric distributions, so all values in the vector should be the same. Values should be greater than 1.0. Default value is -1.0 indicating automatic setting.

beta : float32 (default=1.10000002384)

The hyperparameter for word-specific distribution over topics. Mainly used as a smoothing parameter in Bayesian inference. Larger value implies that topics contain all words more uniformly and smaller value implies that topics are more concentrated on a small subset of words. Valid value range is all positive float greater than or equal to 1. Default is 0.1.

num_topics : int32 (default=10)

The number of topics to identify in the LDA model. Using fewer topics will speed up the computation, but the extracted topics might be more abstract or less specific; using more topics will result in more computation but lead to more specific topics. Valid value range is all positive int. Default is 10.

random_seed : int64 (default=None)

An optional random seed. The random seed is used to initialize the pseudorandom number generator used in the LDA model. Setting the random seed to the same value every time the model is trained, allows LDA to generate the same topic distribution if the corpus and LDA parameters are unchanged.

Returns:

: dict

The data returned is composed of multiple components:

Frame : topics_given_doc

Conditional probabilities of topic given document.

Frame : word_given_topics

Conditional probabilities of word given topic.

Frame : topics_given_word

Conditional probabilities of topic given word.

str : report

The configuration and learning curve report for Latent Dirichlet

Allocation as a multiple line str.

See the discussion about Latent Dirichlet Allocation at Wikipedia.

Examples

See here for examples.

Quick search

Table Of Contents

LdaModel train¶