Getting started¶
Installation¶
Requirements¶
Python 2.7 or Python 3.3+ is required. The following packages are required:
- numpy
- scipy
- scikit-learn
- futures (Python 2.7 only)
GSL is required for random number generation inside the Pólya-Gamma random variate generator. On Debian-based sytems, GSL may be installed with the following command:
sudo apt-get install libgsl0-dev
(horizont looks for GSL headers and libraries in /usr/include and /usr/lib/ respectively.)
Cython is needed if compiling from source.
Installation¶
On Debian-based systems the following command should be sufficient to install horizont:
pip install horizont
Quickstart¶
horizont.LDA implements latent Dirichlet allocation (LDA) using Gibbs sampling. The interface follows conventions in scikit-learn.
>>> import numpy as np
>>> from horizont import LDA
>>> X = np.array([[1,1], [2, 1], [3, 1], [4, 1], [5, 8], [6, 1]])
>>> model = LDA(n_topics=2, random_state=0, n_iter=100)
>>> doc_topic = model.fit_transform(X) # estimate of document-topic distributions
>>> model.components_ # estimate of topic-word distributions
Example¶
The following demonstrates fitting a small corpus of newswire articles using horizont.LDA.
First download the following files into the working directory:
ch.ldac (word frequencies stored in a format similar to LDA-C and SVMLight; indexes start with 1 (SVMLight))
ch.tokens (list of words)
>>> import numpy as np >>> import horizont >>> dtm = horizont.utils.ldac2dtm(open('ch.ldac')) >>> vocabulary = np.array([word.strip() for word in open('ch.tokens')]) >>> model = horizont.LDA(n_topics=20, random_state=0, n_iter=500) >>> doc_topic = model.fit_transform(dtm) # estimate of document-topic distributions
Having fit the model, the top words associated with each topic may be extracted with the following lines of code
>>> for i, dist in enumerate(model.components_): # model.components_ is an estimate of topic-word distributions
>>> top_words = vocabulary[dist.argsort()[::-1][0:10]]
>>> print("Topic {}: {}".format(i, ', '.join(top_words)))
The above should produce the following:
Topic 0: france, french, church, south, african, national, africa, buddhist, catholic, paris
Topic 1: mother, teresa, order, heart, nuns, charity, calcutta, missionaries, sister, hospital
Topic 2: bishop, bernardin, east, peace, catholic, cardinal, prize, timor, belo, indonesia
Topic 3: visit, michael, romania, trip, king, last, poles, country, romanian, poland
Topic 4: war, world, years, political, former, three, during, minister, country, leader
Topic 5: died, life, president, clinton, church, service, funeral, white, family, house
Topic 6: government, party, against, state, president, political, last, group, minister, parliament
Topic 7: city, years, year, churches, quebec, million, percent, irish, set, opera
Topic 8: pope, vatican, surgery, rome, hospital, pontiff, paul, roman, mass, john
Topic 9: russia, russian, soviet, museum, art, moscow, lenin, stalin, church, century
Topic 10: police, miami, versace, cunanan, home, family, city, york, beach, gay
Topic 11: yeltsin, kremlin, president, operation, russian, heart, russia, surgery, chernomyrdin, doctors
Topic 12: elvis, music, fans, king, concert, first, presley, every, death, stage
Topic 13: british, million, churchill, sale, letters, london, papers, former, britain, estate
Topic 14: charles, prince, diana, royal, king, queen, parker, bowles, camilla, marriage
Topic 15: harriman, u.s, paris, ambassador, churchill, france, clinton, pamela, british, american
Topic 16: against, city, bardot, salonika, cultural, animal, byzantine, second, works, off
Topic 17: film, simpson, wright, star, life, show, people, festival, hollywood, catholic
Topic 18: germany, german, nazi, letter, jews, scientology, israel, hitler, kohl, israeli
Topic 19: church, year, first, people, told, years, say, time, n't, later