InfoR package¶

Subpackages¶

InfoR.Examples package

Submodules¶

InfoR.LanguageModels module¶

class InfoR.LanguageModels.LanguageModel(directory)[source]¶

Implements lanuage models for information retrieval. Each document in the corpus is a language model and we compute the probability that the query has the same model.

document_logScore(document, query)[source]¶

Compute the log probability of the query coming from the given document.

Arguments:

String document : A textual document. String query : The search query.

Returns:

A floating variable logScore

logScoreDict(query)[source]¶

Compute the log probability of the query for all the documents.

Arguments:: String query: The search query
Returns:: A dictionary of all the documents in the corpus with their corresponding logScores.

search(query, n_docs)[source]¶

Returns documents which are most relavant to the query. Ranking is done by decreasing log probability of the query coming from the document.

Arguments:: String query : Search query Integer n_docs : Number of matching documents retrived.
Returns:: A list of length n_docs containing documents most relevant to the search query. The list if sorted in the descending order.

vocabalury()[source]¶

All the words in the corpus.

Returns:: A list of al the words in the corpus.

wordDict()[source]¶

Compute frequencies of occurance of the words in the corpus.

Returns:: A dictionary containing all the words in the corpus with the frequencies of their occurance in the whole corpus.

word_freq(wordlist)[source]¶

Build a dictionary of words with the frequencies of their occurance in the document.

Arguments:: Document : A list of all the words in a document.
Returns:: A dictionary containing all the words in the document with their frequencies.

words(document)[source]¶

All the words in a document.

Arguments:: Document : A textual document.
Returns:: A list containing all the words in the document.

InfoR.ProbabilisticModels module¶

InfoR.VectorSpaceModels module¶

class InfoR.VectorSpaceModels.VSM(directory)[source]¶

Implements a Vector space search engine. Each document is represented by a vector in a high dimensional vector space where there is a dimension corresponding to each unique word in the corpus. The contents of the vector are the frequencies or tf-idf scores of the term.

Latent Sementic Analysis (LSA) of the term-document matrix is performed by Singular Value Decomposition (SVD). A is the term-document matrix where each row corresponds to a row and each term is a column. The entries of the matrix a_ij contains the tf-idf score of the term i in document j. The SVD maps each document from term space to the (lower dimensional) concept space.

help()[source]¶: Description of the class and the methods.

search(q, n_docs, tf_idf=False, LSA=False, n_comp=None)[source]¶

Returns documents which are most relavant to the query. Ranking is done by decreasing cosine similarity with the query.

Arguments:: String q : Search query Integer n_docs : Number of matching documents retrived. Boolean tf_idf : If True, the vector features will have tf-idf scores. Boolean LSA : If True, the vectors will be mapped to a low dimenional concept space. Integer n_comp : Number of components for the LSA, dimension of the concept space.
Returns:: A list of length n_docs containing documents most relevant to the search query. The list if sorted in the descending order.

InfoR package¶

Subpackages¶

Submodules¶

InfoR.LanguageModels module¶

InfoR.ProbabilisticModels module¶

InfoR.VectorSpaceModels module¶

Module contents¶

Table Of Contents

Previous topic

Next topic

This Page

Navigation

InfoR package¶

Subpackages¶

Submodules¶

InfoR.LanguageModels module¶

InfoR.ProbabilisticModels module¶

InfoR.VectorSpaceModels module¶

Module contents¶

Table Of Contents

Previous topic

Next topic

This Page

Quick search

Navigation