twistml.features package¶

Submodules¶

twistml.features.combine module¶

This module contains functions to combine data from multiple

files into useful data structures (lists, dicts...) for further transformation into feature vectors.

<notes>

Author:	Matthias Manhertz
Copyright:	Matthias Manhertz 2015
Licence:	MIT

twistml.features.combine.combine_sentiments(filepaths, analyzer='pattern', timeformat='%a %b %d %H:%M:%S +0000 %Y')¶

Combines the sentiments from each input file. Returns a dict.

filepaths : list[str]

A list of filepaths to .json-files containing tweets in the format common to twistml. Each tweet must contain at least a ‘created_at’ field and the field specified in meta_fields.

analyzer : str

Identifier for the TextBlob.Analyzer to use. Currently two analyzers are supported:

‘pattern’, which uses the lexical

PatternAnalyzer based on the Pattern package

‘naivebayes’, which uses NaiveBayesAnalyzer based on the

nltk package.

daily_sentss : dict[datetime,ndarray]: The keys are datestamps and the values are x by 2 ndarrays of sentiment scores, where x is the number of tweets for that datestamp. The two scores per tweet are the polarity and subjectivity for the PatternAnalyzer and the p-values of a positive or negative classification result for the NaiveBayesAnalyzer.

twistml.features.combine.combine_tweets(filepaths, meta_fields=['text'], timeformat='%a %b %d %H:%M:%S +0000 %Y')¶

filepaths : list[str]: A list of filepaths to .json-files containing tweets in the format common to twistml. Each tweet must contain at least a ‘created_at’ field and the field specified in meta_fields.
meta_fields : list[str]: A list of meta_fields (keys in the tweet dictionaries). These are the fields, whose content will be combined in the result. (Default is [‘text’], which implies the tweet-texts will be combined.)

daily_texts : dict[datetime,str]: The keys are datestamps and the values are the concatenated contents of the meta_fields of all tweets in filepaths for that datestamp.

This function uses cStringIO to perform the many string concatenations necessary, as this has the best runtime to process-size tradeoff as detailed on waymoot.org.

twistml.features.combine.stack_features(feature_dicts, sparse=False, sparse_format='csr')¶

Stacks multiple feature dicts horizontally

Stacking multiple feature dictionaries horizontally is useful for combining the features of multiple categories. This is recommended for sentiment features, as these are only 6-dimensional per category, but not for the very highdimensional bag of words or character n-gram features.

feature_dicts : list[dict[datetime : array_like]]: A list of feature dictionaries as generated by the different FeatureTransformers. The array_likes can be either numpy ndarrays or scipy sparse matrices, but must be of the same type for all dicts in the list.
sparse : bool, optional: If the feature matrices in the dictionaries are sparse, setting this to True will enable stacking by using scipy.sparse.hstack in place of numpy.hstack. (Default is False, which implies numpy.hstack will be used.)
sparse_format : str, optional: Only used if sparse=True. The format of the stacked sparse matrices. (Default is ‘csr’, which implies compressed sparse row format.)

stacked : dict[datetime : arraylike]: A feature dictionary mapping each timestamp to the horizontally stacked arrays / sparse matrices.

ValueError: If the feature_dicts do not all have the same keys.

CountVectorTransformer, Doc2VecTransfomer and SentimentTransformer: For details on the feature dictionaries.
scipy.sparse.hstack(): For details on the sparse_format

twistml.features.countvector_transformer module¶

CountVectorTransformer uses: sklearn.feature_extraction.text.CountVectorizer to generate count vector features (Bag of Words, n-grams, ...).

Author:	Matthias Manhertz
Copyright:	Matthias Manhertz 2015
Licence:	MIT

class twistml.features.countvector_transformer.CountVectorTransformer(use_tfidf=True, **kwargs)¶

Bases: twistml.features.feature_transformer.FeatureTransformer

Transforms json files into count vector features

The CountVectorTransformer makes it easy to transform .json files containing twitter-data (like the ones generated by twistml’s filtering and / or preprocessing steps) into count vector features (e.g. bag of words or n-grams).

<Notes>

For possible arguments to CountVectorizer see scikit-learn_.

::

import twistml as tml

filepaths = tml.find_files(‘c:/data/’) cvg = tml.features.CountVectorTransformer(min_df=2,

analyzer=’word’)

features = cvg.transform(filepaths)

transform(filepaths)¶

Transforms twitter data in files into a dict mapping datestamps to count vectors.

The tweets contained in the files specified in filepaths are combined (their content for each date is concatenated) and the resulting concatenated texts are turned into one count vector per day.

filepaths : list(str): A list of files that contain tweets in the typical format (dict[str, str]) as generated by the filtering and / or preprocessing functions in twistml.

daily_counts : dict[datetime, csr-matrix]: A dict mapping datestamps to count vectors in scipy.sparse.csr_matrix format.

twistml.features.doc2vec_transformer module¶

SentimentTransformer uses TextBlob.sentiment to generate sentiment: based feature vectors (either lexical or naive bayes).

Author:	Matthias Manhertz
Copyright:	Matthias Manhertz 2015
Licence:	MIT

class twistml.features.doc2vec_transformer.Doc2VecTransformer(iterations=None, **kwargs)¶

Bases: twistml.features.feature_transformer.FeatureTransformer

Transforms json files into Doc2Vec features

The Doc2VecTransformer makes it easy to transform .json files containing twitter-data (like the ones generated by twistml’s filtering and / or preprocessing steps) into Doc2Vec features.

<Notes>

::

import twistml as tml

filepaths = tml.find_files(‘c:/data/’) d2v = tml.features.Doc2VecTransformer(???) features = d2v.transform(filepaths)

transform(filepaths)¶

Transforms twitter data in files into a dict mapping datestamps to Doc2Vec vectors.

filepaths : list(str): A list of files that contain tweets in the typical format (dict[str, str]) as generated by the filtering and / or preprocessing functions in twistml.

daily_docvecs : dict[datetime, ndarray]: A dict mapping datestamps to Doc2Vec vectors in numpy ndarray format.

class twistml.features.doc2vec_transformer.LabeledLineSentence(combined_tweets)¶: Bases: object

twistml.features.feature_transformer module¶

Base class for feature generation. Inherit from this to implement: custom feature transformers for twistml.

Author:	Matthias Manhertz
Copyright:	Matthias Manhertz 2015
Licence:	MIT

class twistml.features.feature_transformer.FeatureTransformer¶

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

FeatureTransformer is a base class for Transformers that generate complex features from tweets.

The intention is to have many different Transformers that inherit from FeatureTransformer and that can be interchangeably used in sklearn Pipelines. Child classes have to implement the transform() method required by sklearn Transformers.

fit() is inherited as empty method (FeatureTransformers should not usually need to fit to the data and fit_transform() is automatically inherited from TransformerMixin.

BaseEstimator provides get_params and set_params, which make the generators copy-able, so they can be used in multiple-jobs like in a GridSearchCV.

Stackoverflow ZacStewart scikit-learn_

See CountVectorTransformer for a possible implementation.

fit(X, y=None, **fit_params)¶

twistml.features.sentiment_transformer module¶

SentimentTransformer uses TextBlob.sentiment to generate sentiment: based feature vectors (either lexical or naive bayes).

Author:	Matthias Manhertz
Copyright:	Matthias Manhertz 2015
Licence:	MIT

class twistml.features.sentiment_transformer.SentimentTransformer(analyzer='pattern', timeformat='%a %b %d %H:%M:%S +0000 %Y')¶

Bases: twistml.features.feature_transformer.FeatureTransformer

Transforms json files into count sentiment features

The SentimentTransformer makes it easy to transform .json files containing twitter-data (like the ones generated by twistml’s filtering and / or preprocessing steps) sentiment vector features.

<Notes>

::

import twistml as tml

filepaths = tml.find_files(‘c:/data/’) snt = tml.features.SentimentTransformer(analyzer=’pattern’) features = snt.transform(filepaths)

transform(filepaths)¶

Transforms twitter data in files into a dict mapping datestamps to sentiment vectors.

filepaths : list(str): A list of files that contain tweets in the typical format (dict[str, str]) as generated by the filtering and / or preprocessing functions in twistml.

daily_sents : dict[datetime, ndarray]: A dict mapping datestamps to sentiment vectors in numpy ndarray format.

twistml.features.window module¶

Contains the Window class and related functions.

<extended summary>

<routine listings>

<see also>

<notes>

<references>

Suppose we have our features X = {x₁, ..., x_n} and our targets Y = {y₁, ..., y_n}. Using a Window of size 4 with offset 0, we can generate alternative features Z = {y₁, ..., y_n}, where

y_i = some_function(x_i, x_i-1, x_i-2, x_i-3), with i = {4, ..., n}.

For a Window of size 3 with offset 0:

y_i = some_function(x_i, x_i-1, x_i-2)

For a Window of size 3 with offset -1:

y_i = some_function(x_i-1, x_i-2, x_i-3)

Author:	Matthias Manhertz
Copyright:	Matthias Manhertz 2015
Licence:	MIT

class twistml.features.window.Window(size, offset)¶

twistml.features.window.get_windowed(features, targets, window, window_function=<function window_element_sum at 0x000000001AED1128>)¶

Generate a new, windowed feature vector for each target.

features : dict[datetime, array_like]: A dictionary with datetimes for keys and arrays for values.
targets : dict[datetime, float or class_id]: A dictionary with datetimes for keys and either floats for values (for regression tasks) or class ids (for classification tasks.
window : Window: An instance of the Window class.
window_function : callable, optional: The function that will be used to combine the features within a windows. The function needs to take a list of array_likes as argument (actually a list of whatever type the values of the features array are) and return a single array_like (the new “windowed” feature vector). (Default is window_element_sum, which simply calculates the element wise sum of all arrays within a window.)

X : list[array_like]: A list of the new windowed features. The exact type is determined by the window_function.
y : list[float or class_id]: A list of corresponding target values.
dates : list[datetime]: A list of the corresponding timestamps.

twistml.features.window.window_element_avg(window_features)¶: Calculates an element wise sum of mutiple np.arrays.

twistml.features.window.window_element_sum(window_features)¶: Calculates an element wise sum of mutiple np.arrays.

twistml.features.window.window_stack(window_features)¶: Horizontally stacks multiple np.arrays or sparse matrices.

Module contents¶

<extended summary>

<module listings>

Author:	Matthias Manhertz
Copyright:	Matthias Manhertz 2015
Licence:	MIT

twistml.features package¶

Submodules¶

twistml.features.combine module¶

twistml.features.countvector_transformer module¶

twistml.features.doc2vec_transformer module¶

twistml.features.feature_transformer module¶

twistml.features.sentiment_transformer module¶

twistml.features.window module¶

Module contents¶

Table Of Contents

Related Topics

This Page