Models PrincipalComponentsModel


class PrincipalComponentsModel

Entity PrincipalComponentsModel

Attributes

last_read_date Read-only property - Last time this model’s data was accessed.
name Set or get the name of the model object.
status Read-only property - Current model life cycle status.

Methods

__init__(self[, name, _info]) Create a ‘new’ instance of a Principal Components model.
predict(self, frame[, mean_centered, t_squared_index, observation_columns, ...]) [ALPHA] Predict using principal components model.
publish(self) [BETA] Creates a tar file that will be used as input to the scoring engine
train(self, frame, observation_columns[, mean_centered, k]) Build principal components model.
__init__(self, name=None)

Create a ‘new’ instance of a Principal Components model.

Parameters:

name : unicode (default=None)

User supplied name.

Returns:

: Model

A new instance of PrincipalComponentsModel

Principal component analysis [R43] is a statistical algorithm that converts possibly correlated features to linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables. This implementation of computing Principal Components is done by Singular Value Decomposition [R44] of the data, providing the user with an option to mean center the data. The Principal Components model is initialized; trained on specifying the observation columns of the frame and the number of components; used to predict principal components. The MLLib Singular Value Decomposition [R45] implementation has been used for this, with additional features to 1) mean center the data during train and predict and 2) compute the t-squared index during prediction.

footnotes

[R43]https://en.wikipedia.org/wiki/Principal_component_analysis
[R44]https://en.wikipedia.org/wiki/Singular_value_decomposition
[R45]https://spark.apache.org/docs/1.5.0/mllib-dimensionality-reduction.html

Examples

Consider the following model trained and tested on the sample data set in frame ‘frame’.

Consider the following frame containing six columns.

>>> frame.inspect()
[#]  1    2    3    4    5    6
=================================
[0]  2.6  1.7  0.3  1.5  0.8  0.7
[1]  3.3  1.8  0.4  0.7  0.9  0.8
[2]  3.5  1.7  0.3  1.7  0.6  0.4
[3]  3.7  1.0  0.5  1.2  0.6  0.3
[4]  1.5  1.2  0.5  1.4  0.6  0.4
>>> model = ta.PrincipalComponentsModel()
[===Job Progress===]
>>> train_output = model.train(frame, ['1','2','3','4','5','6'], mean_centered=True, k=6)
[===Job Progress===]
>>> train_output
{u'k': 6, u'column_means': [2.92, 1.48, 0.4, 1.3, 0.7, 0.52], u'observation_columns': [u'1', u'2', u'3', u'4', u'5', u'6'], u'mean_centered': True, u'right_singular_vectors': [[-0.9906468642089332, 0.11801374544146297, 0.025647010353320242, 0.048525096275535286, -0.03252674285233843, 0.02492194235385788], [-0.07735139793384983, -0.6023104604841424, 0.6064054412059493, -0.4961696216881456, -0.12443126544906798, -0.042940400527513106], [0.028850639537397756, 0.07268697636708575, -0.2446393640059097, -0.17103491337994586, -0.9368360903028429, 0.16468461966702994], [0.10576208410025369, 0.5480329468552815, 0.75230590898727, 0.2866144016081251, -0.20032699877119212, 0.015210798298156058], [-0.024072151446194606, -0.30472267167437633, -0.01125936644585159, 0.48934541040601953, -0.24758962014033054, -0.7782533654748628], [-0.0061729539518418355, -0.47414707747028795, 0.07533458226215438, 0.6329307498105832, -0.06607181431092408, 0.6037419362435869]], u'singular_values': [1.8048170096632419, 0.8835344148403882, 0.7367461843294286, 0.15234027471064404, 5.90167578565564e-09, 4.478916578455115e-09]}
>>> train_output['column_means']
[2.92, 1.48, 0.4, 1.3, 0.7, 0.52]
>>> predicted_frame = model.predict(frame, mean_centered=True, t_squared_index=True, observation_columns=['1','2','3','4','5','6'], c=3)
[===Job Progress===]
>>> predicted_frame.inspect()
[#]  1    2    3    4    5    6    p_1              p_2
===================================================================
[0]  2.6  1.7  0.3  1.5  0.8  0.7   0.314738695012  -0.183753549226
[1]  3.3  1.8  0.4  0.7  0.9  0.8  -0.471198363594  -0.670419608227
[2]  3.5  1.7  0.3  1.7  0.6  0.4  -0.549024749481   0.235254068619
[3]  3.7  1.0  0.5  1.2  0.6  0.3  -0.739501762517   0.468409769639
[4]  1.5  1.2  0.5  1.4  0.6  0.4    1.44498618058   0.150509319195

[#]  p_3              t_squared_index
=====================================
[0]   0.312561560113   0.253649649849
[1]  -0.228746130528   0.740327252782
[2]   0.465756549839   0.563086507007
[3]  -0.386212142456   0.723748467549
[4]  -0.163359836968   0.719188122813
>>> model.publish()
[===Job Progress===]