Models RandomForestClassifierModel¶
-
class
RandomForestClassifierModel
¶ Entity RandomForestClassifierModel
Attributes
last_read_date Read-only property - Last time this model’s data was accessed. name Set or get the name of the model object. status Read-only property - Current model life cycle status. Methods
__init__(self[, name, _info]) Create a ‘new’ instance of a Random Forest Classifier model. predict(self, frame[, observation_columns]) [ALPHA] Predict the labels for the data points. publish(self) [BETA] Creates a tar file that will be used as input to the scoring engine test(self, frame, label_column[, observation_columns]) [ALPHA] Predict test frame labels and return metrics. train(self, frame, label_column, observation_columns[, num_classes, ...]) [ALPHA] Build Random Forests Classifier model.
-
__init__
(self, name=None)¶ Create a ‘new’ instance of a Random Forest Classifier model.
Parameters: name : unicode (default=None)
User supplied name.
Returns: : Model
A new instance of RandomForestClassifierModel
Random Forest [R49] is a supervised ensemble learning algorithm which can be used to perform binary and multi-class classification. The Random Forest Classifier model is initialized, trained on columns of a frame, used to predict the labels of observations in a frame, and tests the predicted labels against the true labels. This model runs the MLLib implementation of Random Forest [R50]. During training, the decision trees are trained in parallel. During prediction, each tree’s prediction is counted as vote for one class. The label is predicted to be the class which receives the most votes. During testing, labels of the observations are predicted and tested against the true labels using built-in binary and multi-class Classification Metrics.
footnotes
[R49] https://en.wikipedia.org/wiki/Random_forest [R50] https://spark.apache.org/docs/1.5.0/mllib-ensembles.html#random-forests Examples
Consider the following model trained and tested on the sample data set in frame ‘frame’.
Consider the following frame containing three columns.
>>> frame.inspect() [#] Class Dim_1 Dim_2 ======================================= [0] 1 19.8446136104 2.2985856384 [1] 1 16.8973559126 2.6933495054 [2] 1 5.5548729596 2.7777687995 [3] 0 46.1810010826 3.1611961917 [4] 0 44.3117586448 3.3458963222 [5] 0 34.6334526911 3.6429838715 >>> model = ta.RandomForestClassifierModel() [===Job Progress===] >>> train_output = model.train(frame, 'Class', ['Dim_1', 'Dim_2'], num_classes=2, num_trees=1, impurity="entropy", max_depth=4, max_bins=100) [===Job Progress===] >>> train_output {u'impurity': u'entropy', u'max_bins': 100, u'observation_columns': [u'Dim_1', u'Dim_2'], u'num_nodes': 3, u'max_depth': 4, u'seed': 157264076, u'num_trees': 1, u'label_column': u'Class', u'feature_subset_category': u'all', u'num_classes': 2} >>> train_output['num_nodes'] 3 >>> train_output['label_column'] u'Class' >>> predicted_frame = model.predict(frame, ['Dim_1', 'Dim_2']) [===Job Progress===] >>> predicted_frame.inspect() [#] Class Dim_1 Dim_2 predicted_class ======================================================== [0] 1 19.8446136104 2.2985856384 1 [1] 1 16.8973559126 2.6933495054 1 [2] 1 5.5548729596 2.7777687995 1 [3] 0 46.1810010826 3.1611961917 0 [4] 0 44.3117586448 3.3458963222 0 [5] 0 34.6334526911 3.6429838715 0 >>> test_metrics = model.test(frame, 'Class', ['Dim_1','Dim_2']) [===Job Progress===] >>> test_metrics Precision: 1.0 Recall: 1.0 Accuracy: 1.0 FMeasure: 1.0 Confusion Matrix: Predicted_Pos Predicted_Neg Actual_Pos 3 0 Actual_Neg 0 3 >>> model.publish() [===Job Progress===]