TextGraphics.src package¶

Submodules¶

class TextGraphics.src.graph.TextGraph(directory)[source]¶

Graphical representation of a corpus of text files.

Two types of graphs can be created : SentenceGraph - Nodes are the sentences in the corpus and an edge is defined between sentences

based on their similarity.

KeywordGraph - Nodes are the keywords in the corpus and edges are defined between two words: based on their cooccurances in the documents.

docFreq()[source]¶

Compute the documents containg a given word.

Returns:: A dictionary of all the words in the corpus with the value a list of all the documents containg the word.

keywordGraph(cooccuranceThreshold=1)[source]¶

Build the keyword graph.

Arguments:: cooccuranceThreshold : If the cooccurances of two keywords is above coocuranceThreshold, there is an edge between the nodes represented by the keywords. Default value is 1
Returns:: A networkx graph with keywords as nodes and there is an edge between two nodes if their similarity value is greater than similarityThreshold.

sentenceGraph(similarityThreshold=0.2, stemming=False)[source]¶

Build the sentence graph.

Arguments:: similarityThreshold : If the similarity of two sentences is above similarityThreshold, there is an edge between the nodes represented by the sentences. Default value is 0.1 stemming : If True, words will be stemmed by Porter stemming.

Stemming requires package nltk.
Returns:: A networkx graph with sentences as nodes and there is an edge between two nodes if their similarity value is greater than similarityThreshold.

sentenceIntersection(s1, s2, stemming=False)[source]¶

Compute the cosine similarity of two sentences. Similarity is defined as the cosine similarity of the vectors representating the sentences.

Arguments:: s1 : sentence 1 s2 : sentence 2 stemming : If True, words will be stemmed by Porter stemming.

Stemming requires package nltk.
Returns:: A float number defining the similarity of the sentences.

vocabalury()[source]¶

Compute all the unique words in the corpus.

wordDocs(d)[source]¶

wordFrequency(sentence, stemming=False)[source]¶

Compute the normalized frequency of occurance of words in the sentence.

Arguments:: sentence : A sentence stemming : If True, words will be stemmed by Porter stemming.

Stemming requires package nltk.
Returns:: A dictionary of all the words with their normalized frequencies of occurance in the sentence.

Compute all the words in a document.