twistml.filtering package¶

Subpackages¶

twistml.filtering.ldig package

Submodules¶

twistml.filtering.category module¶

Filter tweets into categories.

A category consists of a name and a list of keywords, e.g.::

exmpl_cat = {‘name’:’feeling’, ‘keywords’:[‘feel’,’makes me’]}

<routine listings>

<see also>

<notes>

<references>

<examples>

Author:	Matthias Manhertz
Copyright:	Matthias Manhertz 2015
Licence:	MIT

class twistml.filtering.category.Category(name, keywords)¶

Combines multiple keywords into a named catgory.

A category has a name and a list of keywords both of which have to be passed to the construtor. The Category class provides functions to save to and load from a .json file and defines equality / inequality between categories.

static load_categories_from_json(filename)¶: Loads a list of categories from a file in JSON-format.

static save_categories_to_json(categories, filename)¶: Saves a list of categories to a file in JSON-format.

twistml.filtering.category.filter_tweets_by_category(tweets, outdir, filename, categories='default', fields=['text'], logger=None)¶

Filters a list of tweets into categories, saves the results to files and returns a list of those files.

For each category in categories this function discards those tweets in tweets that do not contain at least one of the ‘category.keywords’ in at least one of the specified fields and saves the rest into a directory ‘category.name’ inside outdir as filename.

tweets : list[dict[str, str]]: A list of tweets in dict form (like one might obtain from the rawtweets-module.
outdir : str: The full path to the directory into which the result will be saved. Subdirectories for each Category will be created as needed.
filename : str: The filename the resulting list of tweets will be saved under.
categories : list[Category], optional: A list of categories. For each Category the tweets will be filtered by the keywords of that category and saved into a directory named after the category. (Default is ‘default’, which means a default list of categories will be generated.)
fields : list[str], optional: The fields of the tweets that will be checked against each Category’s keywords. (Default is [‘text’], which implies only the tweet’s text body will be checked.)
logger : logging.Logger, optional: A logger object, used to display / log console output (default is None, which implies quiet execution).

outputfiles : list[str]: A list of all the filepaths that were written as a result.

twistml.filtering.language module¶

Uses ldig to perform language filtering of tweets.

ldig is an external module written by Nakatani Shuyo to perform language detection especially on tweets. The module is not extensively documented and therefore difficult to adapt to one’s special needs. This module represents a sort of interface to ldig that allows to perform language detectiong on a list of tweets, instead of a specially formatted text-file.

<routine listings>

<see also>

<notes>

<references>

<examples>

Author:	Matthias Manhertz
Copyright:	Matthias Manhertz 2015
Licence:	MIT

twistml.filtering.language.filter_tweets_by_language(tweets, languages=['en'], field_to_filter='text', logger=None)¶

Filters a list of tweets by language. Returns the filtered list.

The tweets list contains a dict for each tweet. The ldig algorithm is applied to the field_to_filter of each tweet and only the once that are found to be in one of the languages specified in languages are retained.

tweets : list[dict[str, str]]: A list of dicts containing twitter data (tweet text and some metadata).
languages : list[str], optional: A list of two-letter language tags (e.g. ‘en’, ‘de’,...) specifying the laguage-filters. (Default is [‘en’], which implies only English tweets are retained.)
field_to_filter : str, optional: The (meta)data field in the tweets dict to be used for filtering. (Default is ‘text’, which implies the tweet body will be used for filtering.)
logger : logging.Logger, optional: A logger object, used to display / log console output (default is None, which implies quiet execution).

filtered_tweets : list[dict[str, str]]: A list of dict containing the twitter data of the filtered tweets.

<Notes>

twistml.filtering.rawtweets module¶

Main module for all preprocessing steps in twistml

<extended summary>

<routine listings>

<see also>

<notes>

<references>

<examples>

Author:	Matthias Manhertz
Copyright:	Matthias Manhertz 2015
Licence:	MIT

twistml.filtering.rawtweets.filter_multiple_raw_json(filepaths, outdir, meta_fields=['text', 'created_at'], filter_words=[], filter_language=['en'], logger=None)¶

Filters raw twitter .json files and writes filtered files.

Reads twitter data from multiple raw twitter .json files, discarding all tweets that do not contain at least one of the words from filter_words and those that are not in one of the filter_language’ languages. Keeps only the `meta_fields`of the remaining tweets. Writes the filtered tweets to `outdir keeping the original filenames.

filepaths : list[str]: Full paths to the .json files to be read.
outdir : str: Full path to the directory, the filtered tweets will be written to. Needs to end with ‘/’!
meta_fields : list[str], optional: Contains the names of the twitter data fields to be read. See the twitter documentation for a list of fields. (Default is [‘text’, ‘created_at’], which filters the tweet text and timestamp.)
filter_words : list[str], optional: A list of keywords. Only tweets containing at least of the keywords, are considered elligible for further processing (default is [], which implies all tweets are elligible).
filter_language : list[str], optional: Tweets that do not have at least one of the fields lang or user.lang set to one of the languages in the list are discarded. An empty list means no language filtering will be applied. (Default is [‘en’], which filters all English tweets)
logger : logging.Logger, optional: A logger object, used to display / log console output (default is None, which implies quiet execution).

twistml.filtering.rawtweets.filter_raw_json(filepath, meta_fields=['text', 'created_at'], filter_words=[], filter_language=['en'], logger=None)¶

Reads from a single raw twitter .json. Returns a list of dicts.

Reads twitter data from filepath, discarding all tweets that do not contain at least one of the words from filter_words and those that are not in one of the filter_language’ languages. Keeps only the `meta_fields`of the remaining tweets. Returns a list containing a dict[str, str] for each tweet. Each dict holds the fields specified in `meta_fields.

filepath : str: Full path to the .json file to be read.
meta_fields : list[str], optional: Contains the names of the twitter data fields to be read. See the twitter documentation for a list of fields. (Default is [‘text’, ‘created_at’], which filters the tweet text and timestamp.)
filter_words : list[str], optional: A list of keywords. Only tweets containing at least of the keywords, are considered elligible for further processing (default is [], which implies all tweets are elligible).
filter_language : list[str], optional: Tweets that do not have at least one of the fields lang or user.lang set to one of the languages in the list are discarded. An empty list means no language filtering will be applied. (Default is [‘en’], which filters all English tweets)
logger : logging.Logger, optional: A logger object, used to display / log console output (default is None, which implies quiet execution).

list[dict[str, str]]: A list of dicts (one per tweet). Each dict has meta_fields as keys and the corresponding field content as values.

Module contents¶

<extended summary>

<module listings>

Author:	Matthias Manhertz
Copyright:	Matthias Manhertz 2015
Licence:	MIT