yellowbrick.text package¶

Submodules¶

yellowbrick.text.base module¶

Base classes for text feature visualizers and text feature selection tools.

class yellowbrick.text.base.TextVisualizer(ax=None, **kwargs)[source]¶

Bases: yellowbrick.base.Visualizer, sklearn.base.TransformerMixin

Base class for text feature visualization to investigate documents individually or as a full corpus.

TextVisualizers are used after a text corpus has been transformed in some way (e.g. normalized through stemming or lemmatization, via stopwords removal, or through vectorization). Thus a TextVisualizer is itself a transformer and can be used in a Scikit-Learn Pipeline to perform automatic visual analysis during build.

Accepts as input a DataFrame or Numpy array.

fit(X, y=None, **fit_params)[source]¶

This method performs preliminary computations in order to set up the figure, compute statistics, or perform other analyses. It can also call drawing methods in order to set up various non-instance-related figure elements.

Parameters:

X : ndarray or DataFrame of shape n x m

A matrix of n instances with m features

y : ndarray or Series of length n

An array or series of target or class values

fit_params: dict

keyword arguments for parameter fitting.

Returns:

self : instance

Returns the instance of the transformer/visualizer

fit_transform_poof(X, y=None, **kwargs)[source]¶

Fit to data, transform it, then visualize it.

Fits the text visualizer to X and y with optional parameters by passing in all of kwargs, then calls poof with the same kwargs. This method must return the result of the transform method.

Parameters:

X : ndarray or DataFrame of shape n x m

A matrix of n instances with m features

y : ndarray or Series of length n

An array or series of target or class values

kwargs : dict

Pass generic arguments to the drawing method

Returns:

X : numpy array

This method must return a numpy array with the same shape as X.

transform(X)[source]¶

Primarily a pass-through to ensure that the text visualizer will work in a pipeline setting. This method can also call drawing methods in order to ensure that the visualization is constructed.

Returns:

X : numpy array

This method must return a numpy array with the same shape as X.

yellowbrick.text.tsne module¶

Implements TSNE visualizations of documents in 2D space.

class yellowbrick.text.tsne.TSNEVisualizer(ax=None, decompose='svd', decompose_by=50, classes=None, colors=None, colormap=None, **kwargs)[source]¶

Bases: yellowbrick.text.base.TextVisualizer

Display a projection of a vectorized corpus in two dimensions using TSNE, a nonlinear dimensionality reduction method that is particularly well suited to embedding in two or three dimensions for visualization as a scatter plot. TSNE is widely used in text analysis to show clusters or groups of documents or utterances and their relative proximities.

TSNE will return a scatter plot of the vectorized corpus, such that each point represents a document or utterance. The distance between two points in the visual space is embedded using the probability distribution of pairwise similarities in the higher dimensionality; thus TSNE shows clusters of similar documents and the relationships between groups of documents as a scatter plot.

TSNE can be used with either clustering or classification; by specifying the classes argument, points will be colored based on their similar traits. For example, by passing cluster.labels_ as y in fit(), all points in the same cluster will be grouped together. This extends the neighbor embedding with more information about similarity, and can allow better interpretation of both clusters and classes.

For more, see https://lvdmaaten.github.io/tsne/

Parameters:

ax : matplotlib axes

The axes to plot the figure on.

decompose : string or None

A preliminary decomposition is often used prior to TSNE to make the projection faster. Specify “svd” for sparse data or “pca” for dense data. If decompose is None, the original data set will be used.

decompose_by : int

Specify the number of components for preliminary decomposition, by default this is 50; the more components, the slower TSNE will be.

classes : list of strings

The names of the classes in the target, used to create a legend.

colors : list or tuple of colors

Specify the colors for each individual class

colormap : string or matplotlib cmap

Sequential colormap for continuous target

kwargs : dict

Pass any additional keyword arguments to the TSNE transformer.

draw(points, target=None, **kwargs)[source]¶: Called from the fit method, this method draws the TSNE scatter plot, from a set of decomposed points in 2 dimensions. This method also accepts a third dimension, target, which is used to specify the colors of each of the points. If the target is not specified, then the points are plotted as a single cloud to show similar documents.

finalize(**kwargs)[source]¶: Finalize the drawing by adding a title and legend, and removing the axes objects that do not convey information about TNSE.

fit(X, y=None, **kwargs)[source]¶

The fit method is the primary drawing input for the TSNE projection since the visualization requires both X and an optional y value. The fit method expects an array of numeric vectors, so text documents must be vectorized before passing them to this method.

Parameters:

X : ndarray or DataFrame of shape n x m

A matrix of n instances with m features representing the corpus of vectorized documents to visualize with tsne.

y : ndarray or Series of length n

An optional array or series of target or class values for instances. If this is specified, then the points will be colored according to their class. Often cluster labels are passed in to color the documents in cluster space, so this method is used both for classification and clustering methods.

kwargs : dict

Pass generic arguments to the drawing method

Returns:

self : instance

Returns the instance of the transformer/visualizer

make_transformer(decompose='svd', decompose_by=50, tsne_kwargs={})[source]¶

Creates an internal transformer pipeline to project the data set into 2D space using TSNE, applying an pre-decomposition technique ahead of embedding if necessary. This method will reset the transformer on the class, and can be used to explore different decompositions.

Parameters:

decompose : string or None

A preliminary decomposition is often used prior to TSNE to make the projection faster. Specify “svd” for sparse data or “pca” for dense data. If decompose is None, the original data set will be used.

decompose_by : int

Specify the number of components for preliminary decomposition, by default this is 50; the more components, the slower TSNE will be.

Returns:

transformer : Pipeline

Pipelined transformer for TSNE projections

yellowbrick.text.tsne.tsne(X, y=None, ax=None, decompose='svd', decompose_by=50, classes=None, colors=None, colormap=None, **kwargs)[source]¶

Display a projection of a vectorized corpus in two dimensions using TSNE, a nonlinear dimensionality reduction method that is particularly well suited to embedding in two or three dimensions for visualization as a scatter plot. TSNE is widely used in text analysis to show clusters or groups of documents or utterances and their relative proximities.

Parameters:

X : ndarray or DataFrame of shape n x m

A matrix of n instances with m features representing the corpus of vectorized documents to visualize with tsne.

y : ndarray or Series of length n

An optional array or series of target or class values for instances. If this is specified, then the points will be colored according to their class. Often cluster labels are passed in to color the documents in cluster space, so this method is used both for classification and clustering methods.

ax : matplotlib axes

The axes to plot the figure on.

decompose : string or None

A preliminary decomposition is often used prior to TSNE to make the projection faster. Specify “svd” for sparse data or “pca” for dense data. If decompose is None, the original data set will be used.

decompose_by : int

Specify the number of components for preliminary decomposition, by default this is 50; the more components, the slower TSNE will be.

classes : list of strings

The names of the classes in the target, used to create a legend.

colors : list or tuple of colors

Specify the colors for each individual class

colormap : string or matplotlib cmap

Sequential colormap for continuous target

kwargs : dict

Pass any additional keyword arguments to the TSNE transformer.

Returns:

ax : matplotlib axes

Returns the axes that the parallel coordinates were drawn on.

yellowbrick.text.freqdist module¶

Implementations of frequency distributions for text visualization

class yellowbrick.text.freqdist.FreqDistVisualizer(ax=None, color=None, N=50, **kwargs)[source]¶

Bases: yellowbrick.text.base.TextVisualizer

A frequency distribution tells us the frequency of each vocabulary item in the text. In general, it could count any kind of observable event. It is a distribution because it tells us how the total number of word tokens in the text are distributed across the vocabulary items.

Parameters:

ax : matplotlib axes

The axes to plot the figure on.

color : list or tuple of colors

Specify color for bars

N: integer

Top N tokens to be plotted.

kwargs : dict

Pass any additional keyword arguments to the super class.

These parameters can be influenced later on in the visualization

process, but can and should be set as early as possible.

draw(**kwargs)[source]¶

Called from the fit method, this method creates the canvas and draws the distribution plot on it.

Parameters:	kwargs: generic keyword arguments.

finalize(**kwargs)[source]¶

The finalize method executes any subclass-specific axes finalization steps. The user calls poof & poof calls finalize.

Parameters:	kwargs: generic keyword arguments.

fit(docs, features)[source]¶

The fit method is the primary drawing input for the frequency distribution visualization. It requires vectorized lists of documents and a list of features, which are the actual words from the original corpus (needed to label the x-axis ticks).

Parameters:

docs : ndarray or DataFrame of shape n x m

A matrix of n instances with m features representing the corpus of vectorized documents.

features : list

List of corpus vocabulary words

Text documents must be vectorized before passing to fit()

freq_dist()[source]¶

Called from the fit method, this method gets all the words from the corpus and their corresponding frequency counts.

Parameters:	kwargs: generic keyword arguments.

get_counts()[source]¶

Called from the fit method, this method sorts the words from the corpus with their corresponding frequency counts in reverse order.

Parameters:	kwargs: generic keyword arguments.

yellowbrick.text.freqdist.freqdist(X, y=None, ax=None, color=None, N=50, **kwargs)[source]¶

Displays frequency distribution plot for text.

This helper function is a quick wrapper to utilize the FreqDist Visualizer (Transformer) for one-off analysis.

Parameters:

X: ndarray or DataFrame of shape n x m

A matrix of n instances with m features. In the case of text, X is a list of list of already preprocessed words

y: ndarray or Series of length n

An array or series of target or class values

ax: matplotlib axes

The axes to plot the figure on.

color: string

Specify color for barchart

N: integer

Top N tokens to be plotted.

kwargs: dict

Keyword arguments passed to the super class.

Returns:

ax: matplotlib axes

Returns the axes that the plot was drawn on.