twistml.features package¶
Submodules¶
twistml.features.combine module¶
- This module contains functions to combine data from multiple
files into useful data structures (lists, dicts...) for further transformation into feature vectors.
<extended summary>
<routine listings>
<see also>
<notes>
<references>
<examples>
Author: | Matthias Manhertz |
---|---|
Copyright: |
|
Licence: | MIT |
-
twistml.features.combine.
combine_sentiments
(filepaths, analyzer='pattern', timeformat='%a %b %d %H:%M:%S +0000 %Y')¶ Combines the sentiments from each input file. Returns a dict.
<Extended Summary>
- filepaths : list[str]
- A list of filepaths to .json-files containing tweets in the format common to twistml. Each tweet must contain at least a ‘created_at’ field and the field specified in meta_fields.
- analyzer : str
Identifier for the TextBlob.Analyzer to use. Currently two analyzers are supported:
- ‘pattern’, which uses the lexical
- PatternAnalyzer based on the Pattern package
- ‘naivebayes’, which uses NaiveBayesAnalyzer based on the
nltk package.
- daily_sentss : dict[datetime,ndarray]
- The keys are datestamps and the values are x by 2 ndarrays of sentiment scores, where x is the number of tweets for that datestamp. The two scores per tweet are the polarity and subjectivity for the PatternAnalyzer and the p-values of a positive or negative classification result for the NaiveBayesAnalyzer.
-
twistml.features.combine.
combine_tweets
(filepaths, meta_fields=['text'], timeformat='%a %b %d %H:%M:%S +0000 %Y')¶ <Summary>
<Extended Summary>
- filepaths : list[str]
- A list of filepaths to .json-files containing tweets in the format common to twistml. Each tweet must contain at least a ‘created_at’ field and the field specified in meta_fields.
- meta_fields : list[str]
- A list of meta_fields (keys in the tweet dictionaries). These are the fields, whose content will be combined in the result. (Default is [‘text’], which implies the tweet-texts will be combined.)
- daily_texts : dict[datetime,str]
- The keys are datestamps and the values are the concatenated contents of the meta_fields of all tweets in filepaths for that datestamp.
This function uses cStringIO to perform the many string concatenations necessary, as this has the best runtime to process-size tradeoff as detailed on waymoot.org.
-
twistml.features.combine.
stack_features
(feature_dicts, sparse=False, sparse_format='csr')¶ Stacks multiple feature dicts horizontally
Stacking multiple feature dictionaries horizontally is useful for combining the features of multiple categories. This is recommended for sentiment features, as these are only 6-dimensional per category, but not for the very highdimensional bag of words or character n-gram features.
- feature_dicts : list[dict[datetime : array_like]]
- A list of feature dictionaries as generated by the different FeatureTransformers. The array_likes can be either numpy ndarrays or scipy sparse matrices, but must be of the same type for all dicts in the list.
- sparse : bool, optional
- If the feature matrices in the dictionaries are sparse, setting this to True will enable stacking by using scipy.sparse.hstack in place of numpy.hstack. (Default is False, which implies numpy.hstack will be used.)
- sparse_format : str, optional
- Only used if sparse=True. The format of the stacked sparse matrices. (Default is ‘csr’, which implies compressed sparse row format.)
- stacked : dict[datetime : arraylike]
- A feature dictionary mapping each timestamp to the horizontally stacked arrays / sparse matrices.
- ValueError
- If the feature_dicts do not all have the same keys.
- CountVectorTransformer, Doc2VecTransfomer and SentimentTransformer
- For details on the feature dictionaries.
- scipy.sparse.hstack()
- For details on the sparse_format
twistml.features.countvector_transformer module¶
- CountVectorTransformer uses
- sklearn.feature_extraction.text.CountVectorizer to generate count vector features (Bag of Words, n-grams, ...).
Author: | Matthias Manhertz |
---|---|
Copyright: |
|
Licence: | MIT |
-
class
twistml.features.countvector_transformer.
CountVectorTransformer
(use_tfidf=True, **kwargs)¶ Bases:
twistml.features.feature_transformer.FeatureTransformer
Transforms json files into count vector features
The CountVectorTransformer makes it easy to transform .json files containing twitter-data (like the ones generated by twistml’s filtering and / or preprocessing steps) into count vector features (e.g. bag of words or n-grams).
<Notes>
For possible arguments to CountVectorizer see scikit-learn_.
- ::
import twistml as tml
filepaths = tml.find_files(‘c:/data/’) cvg = tml.features.CountVectorTransformer(min_df=2,
analyzer=’word’)features = cvg.transform(filepaths)
-
transform
(filepaths)¶ Transforms twitter data in files into a dict mapping datestamps to count vectors.
The tweets contained in the files specified in filepaths are combined (their content for each date is concatenated) and the resulting concatenated texts are turned into one count vector per day.
- filepaths : list(str)
- A list of files that contain tweets in the typical format (dict[str, str]) as generated by the filtering and / or preprocessing functions in twistml.
- daily_counts : dict[datetime, csr-matrix]
- A dict mapping datestamps to count vectors in scipy.sparse.csr_matrix format.
twistml.features.doc2vec_transformer module¶
- SentimentTransformer uses TextBlob.sentiment to generate sentiment
- based feature vectors (either lexical or naive bayes).
Author: | Matthias Manhertz |
---|---|
Copyright: |
|
Licence: | MIT |
-
class
twistml.features.doc2vec_transformer.
Doc2VecTransformer
(iterations=None, **kwargs)¶ Bases:
twistml.features.feature_transformer.FeatureTransformer
Transforms json files into Doc2Vec features
The Doc2VecTransformer makes it easy to transform .json files containing twitter-data (like the ones generated by twistml’s filtering and / or preprocessing steps) into Doc2Vec features.
<Notes>
- ::
import twistml as tml
filepaths = tml.find_files(‘c:/data/’) d2v = tml.features.Doc2VecTransformer(???) features = d2v.transform(filepaths)
-
transform
(filepaths)¶ Transforms twitter data in files into a dict mapping datestamps to Doc2Vec vectors.
- filepaths : list(str)
- A list of files that contain tweets in the typical format (dict[str, str]) as generated by the filtering and / or preprocessing functions in twistml.
- daily_docvecs : dict[datetime, ndarray]
- A dict mapping datestamps to Doc2Vec vectors in numpy ndarray format.
-
class
twistml.features.doc2vec_transformer.
LabeledLineSentence
(combined_tweets)¶ Bases:
object
twistml.features.feature_transformer module¶
- Base class for feature generation. Inherit from this to implement
- custom feature transformers for twistml.
Author: | Matthias Manhertz |
---|---|
Copyright: |
|
Licence: | MIT |
-
class
twistml.features.feature_transformer.
FeatureTransformer
¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
FeatureTransformer is a base class for Transformers that generate complex features from tweets.
The intention is to have many different Transformers that inherit from FeatureTransformer and that can be interchangeably used in sklearn Pipelines. Child classes have to implement the transform() method required by sklearn Transformers.
fit() is inherited as empty method (FeatureTransformers should not usually need to fit to the data and fit_transform() is automatically inherited from TransformerMixin.
BaseEstimator provides get_params and set_params, which make the generators copy-able, so they can be used in multiple-jobs like in a GridSearchCV.
Stackoverflow ZacStewart scikit-learn_
See CountVectorTransformer for a possible implementation.
-
fit
(X, y=None, **fit_params)¶
-
twistml.features.sentiment_transformer module¶
- SentimentTransformer uses TextBlob.sentiment to generate sentiment
- based feature vectors (either lexical or naive bayes).
Author: | Matthias Manhertz |
---|---|
Copyright: |
|
Licence: | MIT |
-
class
twistml.features.sentiment_transformer.
SentimentTransformer
(analyzer='pattern', timeformat='%a %b %d %H:%M:%S +0000 %Y')¶ Bases:
twistml.features.feature_transformer.FeatureTransformer
Transforms json files into count sentiment features
The SentimentTransformer makes it easy to transform .json files containing twitter-data (like the ones generated by twistml’s filtering and / or preprocessing steps) sentiment vector features.
<Notes>
- ::
import twistml as tml
filepaths = tml.find_files(‘c:/data/’) snt = tml.features.SentimentTransformer(analyzer=’pattern’) features = snt.transform(filepaths)
-
transform
(filepaths)¶ Transforms twitter data in files into a dict mapping datestamps to sentiment vectors.
- filepaths : list(str)
- A list of files that contain tweets in the typical format (dict[str, str]) as generated by the filtering and / or preprocessing functions in twistml.
- daily_sents : dict[datetime, ndarray]
- A dict mapping datestamps to sentiment vectors in numpy ndarray format.
twistml.features.window module¶
Contains the Window class and related functions.
<extended summary>
<routine listings>
<see also>
<notes>
<references>
Suppose we have our features X = {x1, ..., xn} and our targets Y = {y1, ..., yn}. Using a Window of size 4 with offset 0, we can generate alternative features Z = {y1, ..., yn}, where
yi = some_function(xi, xi-1, xi-2, xi-3), with i = {4, ..., n}.For a Window of size 3 with offset 0:
yi = some_function(xi, xi-1, xi-2)For a Window of size 3 with offset -1:
yi = some_function(xi-1, xi-2, xi-3)
Author: | Matthias Manhertz |
---|---|
Copyright: |
|
Licence: | MIT |
-
class
twistml.features.window.
Window
(size, offset)¶
-
twistml.features.window.
get_windowed
(features, targets, window, window_function=<function window_element_sum at 0x000000001AED1128>)¶ Generate a new, windowed feature vector for each target.
- features : dict[datetime, array_like]
- A dictionary with datetimes for keys and arrays for values.
- targets : dict[datetime, float or class_id]
- A dictionary with datetimes for keys and either floats for values (for regression tasks) or class ids (for classification tasks.
- window : Window
- An instance of the Window class.
- window_function : callable, optional
- The function that will be used to combine the features within a windows. The function needs to take a list of array_likes as argument (actually a list of whatever type the values of the features array are) and return a single array_like (the new “windowed” feature vector). (Default is window_element_sum, which simply calculates the element wise sum of all arrays within a window.)
- X : list[array_like]
- A list of the new windowed features. The exact type is determined by the window_function.
- y : list[float or class_id]
- A list of corresponding target values.
- dates : list[datetime]
- A list of the corresponding timestamps.
-
twistml.features.window.
window_element_avg
(window_features)¶ Calculates an element wise sum of mutiple np.arrays.
-
twistml.features.window.
window_element_sum
(window_features)¶ Calculates an element wise sum of mutiple np.arrays.
-
twistml.features.window.
window_stack
(window_features)¶ Horizontally stacks multiple np.arrays or sparse matrices.
Module contents¶
<package summary>
<extended summary>
<module listings>
Author: | Matthias Manhertz |
---|---|
Copyright: |
|
Licence: | MIT |