Models LdaModel¶
-
class
LdaModel
¶ Entity LdaModel
Attributes
last_read_date Read-only property - Last time this model’s data was accessed. name Set or get the name of the model object. status Read-only property - Current model life cycle status. Methods
__init__(self[, name, _info]) Creates Latent Dirichlet Allocation model predict(self, document) [ALPHA] Predict conditional probabilities of topics given document. publish(self) [ALPHA] Creates a tar file that will used as input to the scoring engine train(self, frame, document_column_name, word_column_name, word_count_column_name) [ALPHA] Creates Latent Dirichlet Allocation model
-
__init__
(self, name=None)¶ Creates Latent Dirichlet Allocation model
Parameters: name : unicode (default=None)
User supplied name.
Returns: : Model
A new instance of LdaModel
LDA is a commonly-used algorithm for topic modeling, but, more broadly, is considered a dimensionality reduction technique. For more detail see LDA.
Examples
Consider the following model trained and tested on the sample data set in frame ‘frame’.
Consider the following frame containing three columns.
>>> frame.inspect() [#] doc_id word_id word_count ====================================== [0] nytimes harry 3 [1] nytimes economy 35 [2] nytimes jobs 40 [3] nytimes magic 1 [4] nytimes realestate 15 [5] nytimes movies 6 [6] economist economy 50 [7] economist jobs 35 [8] economist realestate 20 [9] economist movies 1 >>> model = ta.LdaModel() [===Job Progress===] >>> train_output = model.train(frame, 'doc_id', 'word_id', 'word_count', max_iterations = 3, num_topics = 2) [===Job Progress===] >>> train_output {'topics_given_word': Frame <unnamed> row_count = 8 schema = [word_id:unicode, topic_probabilities:vector(2)] status = ACTIVE (last_read_date = 2015-10-23T11:07:46.556000-07:00), 'topics_given_doc': Frame <unnamed> row_count = 3 schema = [doc_id:unicode, topic_probabilities:vector(2)] status = ACTIVE (last_read_date = 2015-10-23T11:07:46.369000-07:00), 'report': u'======Graph Statistics======\nNumber of vertices: 11} (doc: 3, word: 8})\nNumber of edges: 16\n\n======LDA Configuration======\nnumTopics: 2\nalpha: 1.100000023841858\nbeta: 1.100000023841858\nmaxIterations: 3\n', 'word_given_topics': Frame <unnamed> row_count = 8 schema = [word_id:unicode, topic_probabilities:vector(2)] status = ACTIVE (last_read_date = 2015-10-23T11:07:46.465000-07:00)} >>> topics_given_doc = train_output['topics_given_doc'] [===Job Progress===] >>> topics_given_doc.inspect() [#] doc_id topic_probabilities =========================================================== [0] harrypotter [0.06417509902256538, 0.9358249009774346] [1] economist [0.8065841283073141, 0.19341587169268581] [2] nytimes [0.855316939742769, 0.14468306025723088] >>> topics_given_doc.column_names [u'doc_id', u'topic_probabilities'] >>> word_given_topics = train_output['word_given_topics'] [===Job Progress===] >>> word_given_topics.inspect() [#] word_id topic_probabilities ============================================================= [0] harry [0.005015572372943657, 0.2916109787103347] [1] realestate [0.167941871746252, 0.032187084858186256] [2] secrets [0.026543839878055035, 0.17103864163730945] [3] movies [0.03704750433384287, 0.003294403360133419] [4] magic [0.016497495727347045, 0.19676900962555072] [5] economy [0.3805836266747442, 0.10952481503975171] [6] chamber [0.0035944004256137523, 0.13168123398523954] [7] jobs [0.36277568884120137, 0.06389383278349432] >>> word_given_topics.column_names [u'word_id', u'topic_probabilities'] >>> topics_given_word = train_output['topics_given_word'] [===Job Progress===] >>> topics_given_word.inspect() [#] word_id topic_probabilities =========================================================== [0] harry [0.018375903962878668, 0.9816240960371213] [1] realestate [0.8663322126823493, 0.13366778731765067] [2] secrets [0.15694172611285945, 0.8430582738871405] [3] movies [0.9444179131148587, 0.055582086885141324] [4] magic [0.09026309091077593, 0.9097369090892241] [5] economy [0.8098866029287505, 0.19011339707124958] [6] chamber [0.0275551649439219, 0.9724448350560781] [7] jobs [0.8748608515169193, 0.12513914848308066] >>> topics_given_word.column_names [u'word_id', u'topic_probabilities'] >>> prediction = model.predict(['harry', 'secrets', 'magic', 'harry', 'chamber' 'test']) [===Job Progress===] >>> prediction {u'topics_given_doc': [0.3149285399451628, 0.48507146005483726], u'new_words_percentage': 20.0, u'new_words_count': 1} >>> prediction['topics_given_doc'] [0.3149285399451628, 0.48507146005483726] >>> prediction['new_words_percentage'] 20.0 >>> prediction['new_words_count'] 1 >>> prediction.has_key('topics_given_doc') True >>> prediction.has_key('new_words_percentage') True >>> prediction.has_key('new_words_count') True >>> model.publish() [===Job Progress===]