Table Of Contents

PrincipalComponentsModel __init__


__init__(self, name=None)

Create a ‘new’ instance of a Principal Components model.

Parameters:

name : unicode (default=None)

User supplied name.

Returns:

: Model

A new instance of PrincipalComponentsModel

Principal component analysis [R46] is a statistical algorithm that converts possibly correlated features to linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables. This implementation of computing Principal Components is done by Singular Value Decomposition [R47] of the data, providing the user with an option to mean center the data. The Principal Components model is initialized; trained on specifying the observation columns of the frame and the number of components; used to predict principal components. The MLLib Singular Value Decomposition [R48] implementation has been used for this, with additional features to 1) mean center the data during train and predict and 2) compute the t-squared index during prediction.

footnotes

[R46]https://en.wikipedia.org/wiki/Principal_component_analysis
[R47]https://en.wikipedia.org/wiki/Singular_value_decomposition
[R48]https://spark.apache.org/docs/1.5.0/mllib-dimensionality-reduction.html

Examples

Consider the following model trained and tested on the sample data set in frame ‘frame’.

Consider the following frame containing six columns.

>>> frame.inspect()
[#]  1    2    3    4    5    6
=================================
[0]  2.6  1.7  0.3  1.5  0.8  0.7
[1]  3.3  1.8  0.4  0.7  0.9  0.8
[2]  3.5  1.7  0.3  1.7  0.6  0.4
[3]  3.7  1.0  0.5  1.2  0.6  0.3
[4]  1.5  1.2  0.5  1.4  0.6  0.4
>>> model = ta.PrincipalComponentsModel()
[===Job Progress===]
>>> train_output = model.train(frame, ['1','2','3','4','5','6'], mean_centered=True, k=6)
[===Job Progress===]
>>> train_output
{u'k': 6, u'column_means': [2.92, 1.48, 0.4, 1.3, 0.7, 0.52], u'observation_columns': [u'1', u'2', u'3', u'4', u'5', u'6'], u'mean_centered': True, u'right_singular_vectors': [[-0.9906468642089332, 0.11801374544146297, 0.025647010353320242, 0.048525096275535286, -0.03252674285233843, 0.02492194235385788], [-0.07735139793384983, -0.6023104604841424, 0.6064054412059493, -0.4961696216881456, -0.12443126544906798, -0.042940400527513106], [0.028850639537397756, 0.07268697636708575, -0.2446393640059097, -0.17103491337994586, -0.9368360903028429, 0.16468461966702994], [0.10576208410025369, 0.5480329468552815, 0.75230590898727, 0.2866144016081251, -0.20032699877119212, 0.015210798298156058], [-0.024072151446194606, -0.30472267167437633, -0.01125936644585159, 0.48934541040601953, -0.24758962014033054, -0.7782533654748628], [-0.0061729539518418355, -0.47414707747028795, 0.07533458226215438, 0.6329307498105832, -0.06607181431092408, 0.6037419362435869]], u'singular_values': [1.8048170096632419, 0.8835344148403882, 0.7367461843294286, 0.15234027471064404, 5.90167578565564e-09, 4.478916578455115e-09]}
>>> train_output['column_means']
[2.92, 1.48, 0.4, 1.3, 0.7, 0.52]
>>> predicted_frame = model.predict(frame, mean_centered=True, t_squared_index=True, observation_columns=['1','2','3','4','5','6'], c=3)
[===Job Progress===]
>>> predicted_frame.inspect()
[#]  1    2    3    4    5    6    p_1              p_2
===================================================================
[0]  2.6  1.7  0.3  1.5  0.8  0.7   0.314738695012  -0.183753549226
[1]  3.3  1.8  0.4  0.7  0.9  0.8  -0.471198363594  -0.670419608227
[2]  3.5  1.7  0.3  1.7  0.6  0.4  -0.549024749481   0.235254068619
[3]  3.7  1.0  0.5  1.2  0.6  0.3  -0.739501762517   0.468409769639
[4]  1.5  1.2  0.5  1.4  0.6  0.4    1.44498618058   0.150509319195

[#]  p_3              t_squared_index
=====================================
[0]   0.312561560113   0.253649649849
[1]  -0.228746130528   0.740327252782
[2]   0.465756549839   0.563086507007
[3]  -0.386212142456   0.723748467549
[4]  -0.163359836968   0.719188122813
>>> model.publish()
[===Job Progress===]