TextGraphics.src package

Submodules

TextGraphics.src.graph module

class TextGraphics.src.graph.TextGraph(directory)[source]

Graphical representation of a corpus of text files.

Two types of graphs can be created : SentenceGraph - Nodes are the sentences in the corpus and an edge is defined between sentences

based on their similarity.
KeywordGraph - Nodes are the keywords in the corpus and edges are defined between two words
based on their cooccurances in the documents.
docFreq()[source]

Compute the documents containg a given word.

Returns:
A dictionary of all the words in the corpus with the value a list of all the documents containg the word.
keywordGraph(cooccuranceThreshold=1)[source]

Build the keyword graph.

Arguments:
cooccuranceThreshold : If the cooccurances of two keywords is above coocuranceThreshold, there is an edge between the nodes represented by the keywords. Default value is 1
Returns:
A networkx graph with keywords as nodes and there is an edge between two nodes if their similarity value is greater than similarityThreshold.
sentenceGraph(similarityThreshold=0.2, stemming=False)[source]

Build the sentence graph.

Arguments:

similarityThreshold : If the similarity of two sentences is above similarityThreshold, there is an edge between the nodes represented by the sentences. Default value is 0.1 stemming : If True, words will be stemmed by Porter stemming.

Stemming requires package nltk.
Returns:
A networkx graph with sentences as nodes and there is an edge between two nodes if their similarity value is greater than similarityThreshold.
sentenceIntersection(s1, s2, stemming=False)[source]

Compute the cosine similarity of two sentences. Similarity is defined as the cosine similarity of the vectors representating the sentences.

Arguments:

s1 : sentence 1 s2 : sentence 2 stemming : If True, words will be stemmed by Porter stemming.

Stemming requires package nltk.
Returns:
A float number defining the similarity of the sentences.
vocabalury()[source]

Compute all the unique words in the corpus.

Returns:
A set of all the unique words in the whole corpus of documents.
wordDocs(d)[source]
Arguments:
d : a document
Returns:
A dictionary of all the words in d with d as value.
wordFrequency(sentence, stemming=False)[source]

Compute the normalized frequency of occurance of words in the sentence.

Arguments:

sentence : A sentence stemming : If True, words will be stemmed by Porter stemming.

Stemming requires package nltk.
Returns:
A dictionary of all the words with their normalized frequencies of occurance in the sentence.
words(d)[source]

Compute all the words in a document.

Arguments:
d : a document
Returns:
A list of all the words in the document.

Module contents

Table Of Contents

Previous topic

TextGraphics.Stopwords package

This Page