InfoR package

Submodules

InfoR.LanguageModels module

class InfoR.LanguageModels.LanguageModel(directory)[source]

Implements lanuage models for information retrieval. Each document in the corpus is a language model and we compute the probability that the query has the same model.

document_logScore(document, query)[source]

Compute the log probability of the query coming from the given document.

Arguments:

String document : A textual document. String query : The search query.

Returns:

A floating variable logScore
logScoreDict(query)[source]

Compute the log probability of the query for all the documents.

Arguments:
String query: The search query
Returns:
A dictionary of all the documents in the corpus with their corresponding logScores.
search(query, n_docs)[source]

Returns documents which are most relavant to the query. Ranking is done by decreasing log probability of the query coming from the document.

Arguments:
String query : Search query Integer n_docs : Number of matching documents retrived.
Returns:
A list of length n_docs containing documents most relevant to the search query. The list if sorted in the descending order.
vocabalury()[source]

All the words in the corpus.

Returns:
A list of al the words in the corpus.
wordDict()[source]

Compute frequencies of occurance of the words in the corpus.

Returns:
A dictionary containing all the words in the corpus with the frequencies of their occurance in the whole corpus.
word_freq(wordlist)[source]

Build a dictionary of words with the frequencies of their occurance in the document.

Arguments:
Document : A list of all the words in a document.
Returns:
A dictionary containing all the words in the document with their frequencies.
words(document)[source]

All the words in a document.

Arguments:
Document : A textual document.
Returns:
A list containing all the words in the document.

InfoR.ProbabilisticModels module

InfoR.VectorSpaceModels module

class InfoR.VectorSpaceModels.VSM(directory)[source]

Implements a Vector space search engine. Each document is represented by a vector in a high dimensional vector space where there is a dimension corresponding to each unique word in the corpus. The contents of the vector are the frequencies or tf-idf scores of the term.

Latent Sementic Analysis (LSA) of the term-document matrix is performed by Singular Value Decomposition (SVD). A is the term-document matrix where each row corresponds to a row and each term is a column. The entries of the matrix a_ij contains the tf-idf score of the term i in document j. The SVD maps each document from term space to the (lower dimensional) concept space.

help()[source]

Description of the class and the methods.

search(q, n_docs, tf_idf=False, LSA=False, n_comp=None)[source]

Returns documents which are most relavant to the query. Ranking is done by decreasing cosine similarity with the query.

Arguments:
String q : Search query Integer n_docs : Number of matching documents retrived. Boolean tf_idf : If True, the vector features will have tf-idf scores. Boolean LSA : If True, the vectors will be mapped to a low dimenional concept space. Integer n_comp : Number of components for the LSA, dimension of the concept space.
Returns:
A list of length n_docs containing documents most relevant to the search query. The list if sorted in the descending order.

Module contents