Natural language processing utilities
A corpus for text downloaded from the Web.
Gets the text for a file in the corpus.
Parameters: |
|
---|---|
Returns: | The text of the file. |
Gets all the texts of the corpus.
Returns: | The texts as a list. |
---|
Initialize a CSV containing URLs of downloaded texts.
Stores text in the corpus directory. Also updates a CSV file to avoid downloading the same file again.
Parameters: |
|
---|
Calculates TF-IDF for a list of text strings and sums it up by term.
Parameters: |
|
---|---|
Returns: | A pandas DataFrame with columns ‘term’ and ‘tfidf’. |
Checks whether a string has digits in it.
Parameters: | str – A string. |
---|---|
Returns: | True if the string has digits else False. |
Checks whether a string has repeating words in it.
Parameters: | str – A string. |
---|---|
Returns: | True if the string has repeating words else False. |
Lowercases all words/strings in a list
Parameters: | alist – A list of words/strings. |
---|---|
Returns: | The lowercased words. |
Select terms based on TF-IDF.
Parameters: |
|
---|---|
Returns: | A set containing the selected terms. |