Models PrincipalComponentsModel¶
-
class
PrincipalComponentsModel
¶ Entity PrincipalComponentsModel
Attributes
last_read_date Read-only property - Last time this model’s data was accessed. name Set or get the name of the model object. status Read-only property - Current model life cycle status. Methods
__init__(self[, name, _info]) Create a ‘new’ instance of a Principal Components model. predict(self, frame[, mean_centered, t_squared_index, observation_columns, ...]) [ALPHA] Predict using principal components model. publish(self) [BETA] Creates a tar file that will be used as input to the scoring engine train(self, frame, observation_columns[, mean_centered, k]) Build principal components model.
-
__init__
(self, name=None)¶ Create a ‘new’ instance of a Principal Components model.
Parameters: name : unicode (default=None)
User supplied name.
Returns: : Model
A new instance of PrincipalComponentsModel
Principal component analysis [R43] is a statistical algorithm that converts possibly correlated features to linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables. This implementation of computing Principal Components is done by Singular Value Decomposition [R44] of the data, providing the user with an option to mean center the data. The Principal Components model is initialized; trained on specifying the observation columns of the frame and the number of components; used to predict principal components. The MLLib Singular Value Decomposition [R45] implementation has been used for this, with additional features to 1) mean center the data during train and predict and 2) compute the t-squared index during prediction.
footnotes
[R43] https://en.wikipedia.org/wiki/Principal_component_analysis [R44] https://en.wikipedia.org/wiki/Singular_value_decomposition [R45] https://spark.apache.org/docs/1.5.0/mllib-dimensionality-reduction.html Examples
Consider the following model trained and tested on the sample data set in frame ‘frame’.
Consider the following frame containing six columns.
>>> frame.inspect() [#] 1 2 3 4 5 6 ================================= [0] 2.6 1.7 0.3 1.5 0.8 0.7 [1] 3.3 1.8 0.4 0.7 0.9 0.8 [2] 3.5 1.7 0.3 1.7 0.6 0.4 [3] 3.7 1.0 0.5 1.2 0.6 0.3 [4] 1.5 1.2 0.5 1.4 0.6 0.4 >>> model = ta.PrincipalComponentsModel() [===Job Progress===] >>> train_output = model.train(frame, ['1','2','3','4','5','6'], mean_centered=True, k=6) [===Job Progress===] >>> train_output {u'k': 6, u'column_means': [2.92, 1.48, 0.4, 1.3, 0.7, 0.52], u'observation_columns': [u'1', u'2', u'3', u'4', u'5', u'6'], u'mean_centered': True, u'right_singular_vectors': [[-0.9906468642089332, 0.11801374544146297, 0.025647010353320242, 0.048525096275535286, -0.03252674285233843, 0.02492194235385788], [-0.07735139793384983, -0.6023104604841424, 0.6064054412059493, -0.4961696216881456, -0.12443126544906798, -0.042940400527513106], [0.028850639537397756, 0.07268697636708575, -0.2446393640059097, -0.17103491337994586, -0.9368360903028429, 0.16468461966702994], [0.10576208410025369, 0.5480329468552815, 0.75230590898727, 0.2866144016081251, -0.20032699877119212, 0.015210798298156058], [-0.024072151446194606, -0.30472267167437633, -0.01125936644585159, 0.48934541040601953, -0.24758962014033054, -0.7782533654748628], [-0.0061729539518418355, -0.47414707747028795, 0.07533458226215438, 0.6329307498105832, -0.06607181431092408, 0.6037419362435869]], u'singular_values': [1.8048170096632419, 0.8835344148403882, 0.7367461843294286, 0.15234027471064404, 5.90167578565564e-09, 4.478916578455115e-09]} >>> train_output['column_means'] [2.92, 1.48, 0.4, 1.3, 0.7, 0.52] >>> predicted_frame = model.predict(frame, mean_centered=True, t_squared_index=True, observation_columns=['1','2','3','4','5','6'], c=3) [===Job Progress===] >>> predicted_frame.inspect() [#] 1 2 3 4 5 6 p_1 p_2 =================================================================== [0] 2.6 1.7 0.3 1.5 0.8 0.7 0.314738695012 -0.183753549226 [1] 3.3 1.8 0.4 0.7 0.9 0.8 -0.471198363594 -0.670419608227 [2] 3.5 1.7 0.3 1.7 0.6 0.4 -0.549024749481 0.235254068619 [3] 3.7 1.0 0.5 1.2 0.6 0.3 -0.739501762517 0.468409769639 [4] 1.5 1.2 0.5 1.4 0.6 0.4 1.44498618058 0.150509319195 [#] p_3 t_squared_index ===================================== [0] 0.312561560113 0.253649649849 [1] -0.228746130528 0.740327252782 [2] 0.465756549839 0.563086507007 [3] -0.386212142456 0.723748467549 [4] -0.163359836968 0.719188122813 >>> model.publish() [===Job Progress===]