InfoR package
Submodules
InfoR.LanguageModels module
-
class InfoR.LanguageModels.LanguageModel(directory)[source]
Implements lanuage models for information retrieval.
Each document in the corpus is a language model and
we compute the probability that the query has the
same model.
-
document_logScore(document, query)[source]
Compute the log probability of the query coming from the given document.
Arguments:
String document : A textual document.
String query : The search query.
Returns:
A floating variable logScore
-
logScoreDict(query)[source]
Compute the log probability of the query for all the documents.
- Arguments:
- String query: The search query
- Returns:
- A dictionary of all the documents in the corpus with their corresponding logScores.
-
search(query, n_docs)[source]
Returns documents which are most relavant to the query.
Ranking is done by decreasing log probability of the query coming from the document.
- Arguments:
- String query : Search query
Integer n_docs : Number of matching documents retrived.
- Returns:
- A list of length n_docs containing documents most relevant to the search query.
The list if sorted in the descending order.
-
vocabalury()[source]
All the words in the corpus.
- Returns:
- A list of al the words in the corpus.
-
wordDict()[source]
Compute frequencies of occurance of the words in the corpus.
- Returns:
- A dictionary containing all the words in the corpus with the frequencies
of their occurance in the whole corpus.
-
word_freq(wordlist)[source]
Build a dictionary of words with the frequencies of their occurance in the document.
- Arguments:
- Document : A list of all the words in a document.
- Returns:
- A dictionary containing all the words in the document with their frequencies.
-
words(document)[source]
All the words in a document.
- Arguments:
- Document : A textual document.
- Returns:
- A list containing all the words in the document.
InfoR.ProbabilisticModels module
InfoR.VectorSpaceModels module
-
class InfoR.VectorSpaceModels.VSM(directory)[source]
Implements a Vector space search engine. Each document is represented by a vector in a high dimensional
vector space where there is a dimension corresponding to each unique word in the corpus.
The contents of the vector are the frequencies or tf-idf scores of the term.
Latent Sementic Analysis (LSA) of the term-document matrix is performed by Singular Value Decomposition (SVD).
A is the term-document matrix where each row corresponds to a row and each term is a column.
The entries of the matrix a_ij contains the tf-idf score of the term i in document j.
The SVD maps each document from term space to the (lower dimensional) concept space.
-
help()[source]
Description of the class and the methods.
-
search(q, n_docs, tf_idf=False, LSA=False, n_comp=None)[source]
Returns documents which are most relavant to the query.
Ranking is done by decreasing cosine similarity with the query.
- Arguments:
- String q : Search query
Integer n_docs : Number of matching documents retrived.
Boolean tf_idf : If True, the vector features will have tf-idf scores.
Boolean LSA : If True, the vectors will be mapped to a low dimenional concept space.
Integer n_comp : Number of components for the LSA, dimension of the concept space.
- Returns:
- A list of length n_docs containing documents most relevant to the search query.
The list if sorted in the descending order.
Module contents