Models LdaModel


class LdaModel

Entity LdaModel

Attributes

last_read_date Read-only property - Last time this model’s data was accessed.
name Set or get the name of the model object.
status Read-only property - Current model life cycle status.

Methods

__init__(self[, name, _info]) Creates Latent Dirichlet Allocation model
predict(self, document) [ALPHA] Predict conditional probabilities of topics given document.
publish(self) [ALPHA] Creates a tar file that will used as input to the scoring engine
train(self, frame, document_column_name, word_column_name, word_count_column_name) [ALPHA] Creates Latent Dirichlet Allocation model
__init__(self, name=None)

Creates Latent Dirichlet Allocation model

Parameters:

name : unicode (default=None)

User supplied name.

Returns:

: Model

A new instance of LdaModel

LDA is a commonly-used algorithm for topic modeling, but, more broadly, is considered a dimensionality reduction technique. For more detail see LDA.

Examples

Consider the following model trained and tested on the sample data set in frame ‘frame’.

Consider the following frame containing three columns.

>>> frame.inspect()
[#]  doc_id     word_id     word_count
======================================
[0]  nytimes    harry                3
[1]  nytimes    economy             35
[2]  nytimes    jobs                40
[3]  nytimes    magic                1
[4]  nytimes    realestate          15
[5]  nytimes    movies               6
[6]  economist  economy             50
[7]  economist  jobs                35
[8]  economist  realestate          20
[9]  economist  movies               1
>>> model = ta.LdaModel()
[===Job Progress===]
>>> train_output = model.train(frame, 'doc_id', 'word_id', 'word_count', max_iterations = 3, num_topics = 2)
[===Job Progress===]
>>> train_output
{'topics_given_word': Frame  <unnamed>
row_count = 8
schema = [word_id:unicode, topic_probabilities:vector(2)]
status = ACTIVE  (last_read_date = 2015-10-23T11:07:46.556000-07:00), 'topics_given_doc': Frame  <unnamed>
row_count = 3
schema = [doc_id:unicode, topic_probabilities:vector(2)]
status = ACTIVE  (last_read_date = 2015-10-23T11:07:46.369000-07:00), 'report': u'======Graph Statistics======\nNumber of vertices: 11} (doc: 3, word: 8})\nNumber of edges: 16\n\n======LDA Configuration======\nnumTopics: 2\nalpha: 1.100000023841858\nbeta: 1.100000023841858\nmaxIterations: 3\n', 'word_given_topics': Frame  <unnamed>
row_count = 8
schema = [word_id:unicode, topic_probabilities:vector(2)]
status = ACTIVE  (last_read_date = 2015-10-23T11:07:46.465000-07:00)}
>>> topics_given_doc = train_output['topics_given_doc']
[===Job Progress===]
>>> topics_given_doc.inspect()
[#]  doc_id       topic_probabilities
===========================================================
[0]  harrypotter  [0.06417509902256538, 0.9358249009774346]
[1]  economist    [0.8065841283073141, 0.19341587169268581]
[2]  nytimes      [0.855316939742769, 0.14468306025723088]
>>> topics_given_doc.column_names
[u'doc_id', u'topic_probabilities']
>>> word_given_topics = train_output['word_given_topics']
[===Job Progress===]
>>> word_given_topics.inspect()
[#]  word_id     topic_probabilities
=============================================================
[0]  harry       [0.005015572372943657, 0.2916109787103347]
[1]  realestate  [0.167941871746252, 0.032187084858186256]
[2]  secrets     [0.026543839878055035, 0.17103864163730945]
[3]  movies      [0.03704750433384287, 0.003294403360133419]
[4]  magic       [0.016497495727347045, 0.19676900962555072]
[5]  economy     [0.3805836266747442, 0.10952481503975171]
[6]  chamber     [0.0035944004256137523, 0.13168123398523954]
[7]  jobs        [0.36277568884120137, 0.06389383278349432]
>>> word_given_topics.column_names
[u'word_id', u'topic_probabilities']
>>> topics_given_word = train_output['topics_given_word']
[===Job Progress===]
>>> topics_given_word.inspect()
[#]  word_id     topic_probabilities
===========================================================
[0]  harry       [0.018375903962878668, 0.9816240960371213]
[1]  realestate  [0.8663322126823493, 0.13366778731765067]
[2]  secrets     [0.15694172611285945, 0.8430582738871405]
[3]  movies      [0.9444179131148587, 0.055582086885141324]
[4]  magic       [0.09026309091077593, 0.9097369090892241]
[5]  economy     [0.8098866029287505, 0.19011339707124958]
[6]  chamber     [0.0275551649439219, 0.9724448350560781]
[7]  jobs        [0.8748608515169193, 0.12513914848308066]
>>> topics_given_word.column_names
[u'word_id', u'topic_probabilities']
>>> prediction = model.predict(['harry', 'secrets', 'magic', 'harry', 'chamber' 'test'])
[===Job Progress===]
>>> prediction
{u'topics_given_doc': [0.3149285399451628, 0.48507146005483726], u'new_words_percentage': 20.0, u'new_words_count': 1}
>>> prediction['topics_given_doc']
[0.3149285399451628, 0.48507146005483726]
>>> prediction['new_words_percentage']
20.0
>>> prediction['new_words_count']
1
>>> prediction.has_key('topics_given_doc')
True
>>> prediction.has_key('new_words_percentage')
True
>>> prediction.has_key('new_words_count')
True
>>> model.publish()
[===Job Progress===]