dautil.nlp¶

Natural language processing utilities

class dautil.nlp.WebCorpus(dir)¶

A corpus for text downloaded from the Web.

get_text(name)¶

Gets the text for a file in the corpus.

Parameters:	name – The name of the file. url – The URL of the original document.
Returns:	The text of the file.

get_texts()¶

Gets all the texts of the corpus.

Returns:	The texts as a list.

store_text(name, txt, url, title, author)¶

Stores text in the corpus directory. Also updates a CSV file to avoid downloading the same file again.

Parameters:	name – The name of the file. txt – The text of the file. url – The URL of the original document. title – The title of the original document. author – The author of the original document.

dautil.nlp.calc_tfidf(corpus, sw='english', ngram_range=(2, 3))¶

Calculates TF-IDF for a list of text strings and sums it up by term.

Parameters:	corpus – A list of text strings. sw – A list of stop words.
Returns:	A pandas DataFrame with columns ‘term’ and ‘tfidf’.

dautil.nlp.has_digits(str)¶

Checks whether a string has digits in it.

Parameters:	str – A string.
Returns:	True if the string has digits else False.

dautil.nlp.has_duplicates(str)¶

Checks whether a string has repeating words in it.

Parameters:	str – A string.
Returns:	True if the string has repeating words else False.

dautil.nlp.lower_all(alist)¶

Lowercases all words/strings in a list

Parameters:	alist – A list of words/strings.
Returns:	The lowercased words.

dautil.nlp.select_terms(df, method='q3', select_func=None)¶

Select terms based on TF-IDF.

Parameters:	df – A pandas DataFrame as produced by calc_tfidf function. method – The selection method, default use the third quartile as cutoff. select_func – An optional selection function.
Returns:	A set containing the selected terms.