PrincipalComponentsModel __init__¶
-
__init__
(self, name=None)¶ Create a ‘new’ instance of a Principal Components model.
Parameters: name : unicode (default=None)
User supplied name.
Returns: : Model
A new instance of PrincipalComponentsModel
Principal component analysis [R46] is a statistical algorithm that converts possibly correlated features to linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables. This implementation of computing Principal Components is done by Singular Value Decomposition [R47] of the data, providing the user with an option to mean center the data. The Principal Components model is initialized; trained on specifying the observation columns of the frame and the number of components; used to predict principal components. The MLLib Singular Value Decomposition [R48] implementation has been used for this, with additional features to 1) mean center the data during train and predict and 2) compute the t-squared index during prediction.
footnotes
[R46] https://en.wikipedia.org/wiki/Principal_component_analysis [R47] https://en.wikipedia.org/wiki/Singular_value_decomposition [R48] https://spark.apache.org/docs/1.5.0/mllib-dimensionality-reduction.html Examples
Consider the following model trained and tested on the sample data set in frame ‘frame’.
Consider the following frame containing six columns.
>>> frame.inspect() [#] 1 2 3 4 5 6 ================================= [0] 2.6 1.7 0.3 1.5 0.8 0.7 [1] 3.3 1.8 0.4 0.7 0.9 0.8 [2] 3.5 1.7 0.3 1.7 0.6 0.4 [3] 3.7 1.0 0.5 1.2 0.6 0.3 [4] 1.5 1.2 0.5 1.4 0.6 0.4 >>> model = ta.PrincipalComponentsModel() [===Job Progress===] >>> train_output = model.train(frame, ['1','2','3','4','5','6'], mean_centered=True, k=6) [===Job Progress===] >>> train_output {u'k': 6, u'column_means': [2.92, 1.48, 0.4, 1.3, 0.7, 0.52], u'observation_columns': [u'1', u'2', u'3', u'4', u'5', u'6'], u'mean_centered': True, u'right_singular_vectors': [[-0.9906468642089332, 0.11801374544146297, 0.025647010353320242, 0.048525096275535286, -0.03252674285233843, 0.02492194235385788], [-0.07735139793384983, -0.6023104604841424, 0.6064054412059493, -0.4961696216881456, -0.12443126544906798, -0.042940400527513106], [0.028850639537397756, 0.07268697636708575, -0.2446393640059097, -0.17103491337994586, -0.9368360903028429, 0.16468461966702994], [0.10576208410025369, 0.5480329468552815, 0.75230590898727, 0.2866144016081251, -0.20032699877119212, 0.015210798298156058], [-0.024072151446194606, -0.30472267167437633, -0.01125936644585159, 0.48934541040601953, -0.24758962014033054, -0.7782533654748628], [-0.0061729539518418355, -0.47414707747028795, 0.07533458226215438, 0.6329307498105832, -0.06607181431092408, 0.6037419362435869]], u'singular_values': [1.8048170096632419, 0.8835344148403882, 0.7367461843294286, 0.15234027471064404, 5.90167578565564e-09, 4.478916578455115e-09]} >>> train_output['column_means'] [2.92, 1.48, 0.4, 1.3, 0.7, 0.52] >>> predicted_frame = model.predict(frame, mean_centered=True, t_squared_index=True, observation_columns=['1','2','3','4','5','6'], c=3) [===Job Progress===] >>> predicted_frame.inspect() [#] 1 2 3 4 5 6 p_1 p_2 =================================================================== [0] 2.6 1.7 0.3 1.5 0.8 0.7 0.314738695012 -0.183753549226 [1] 3.3 1.8 0.4 0.7 0.9 0.8 -0.471198363594 -0.670419608227 [2] 3.5 1.7 0.3 1.7 0.6 0.4 -0.549024749481 0.235254068619 [3] 3.7 1.0 0.5 1.2 0.6 0.3 -0.739501762517 0.468409769639 [4] 1.5 1.2 0.5 1.4 0.6 0.4 1.44498618058 0.150509319195 [#] p_3 t_squared_index ===================================== [0] 0.312561560113 0.253649649849 [1] -0.228746130528 0.740327252782 [2] 0.465756549839 0.563086507007 [3] -0.386212142456 0.723748467549 [4] -0.163359836968 0.719188122813 >>> model.publish() [===Job Progress===]