dautil.nlp

Natural language processing utilities

class dautil.nlp.WebCorpus(dir)

A corpus for text downloaded from the Web.

get_text(name)

Gets the text for a file in the corpus.

Parameters:
  • name – The name of the file.
  • url – The URL of the original document.
Returns:

The text of the file.

get_texts()

Gets all the texts of the corpus.

Returns:The texts as a list.
init_url_csv()

Initialize a CSV containing URLs of downloaded texts.

store_text(name, txt, url, title, author)

Stores text in the corpus directory. Also updates a CSV file to avoid downloading the same file again.

Parameters:
  • name – The name of the file.
  • txt – The text of the file.
  • url – The URL of the original document.
  • title – The title of the original document.
  • author – The author of the original document.
dautil.nlp.calc_tfidf(corpus, sw='english', ngram_range=(2, 3))

Calculates TF-IDF for a list of text strings and sums it up by term.

Parameters:
  • corpus – A list of text strings.
  • sw – A list of stop words.
Returns:

A pandas DataFrame with columns ‘term’ and ‘tfidf’.

dautil.nlp.has_digits(str)

Checks whether a string has digits in it.

Parameters:str – A string.
Returns:True if the string has digits else False.
dautil.nlp.has_duplicates(str)

Checks whether a string has repeating words in it.

Parameters:str – A string.
Returns:True if the string has repeating words else False.
dautil.nlp.lower_all(alist)

Lowercases all words/strings in a list

Parameters:alist – A list of words/strings.
Returns:The lowercased words.
dautil.nlp.select_terms(df, method='q3', select_func=None)

Select terms based on TF-IDF.

Parameters:
  • df – A pandas DataFrame as produced by calc_tfidf function.
  • method – The selection method, default use the third quartile as cutoff.
  • select_func – An optional selection function.
Returns:

A set containing the selected terms.

Previous topic

dautil.nb

Next topic

dautil.options

This Page