Models RandomForestClassifierModel


class RandomForestClassifierModel

Entity RandomForestClassifierModel

Attributes

last_read_date Read-only property - Last time this model’s data was accessed.
name Set or get the name of the model object.
status Read-only property - Current model life cycle status.

Methods

__init__(self[, name, _info]) Create a ‘new’ instance of a Random Forest Classifier model.
predict(self, frame[, observation_columns]) [ALPHA] Predict the labels for the data points.
publish(self) [BETA] Creates a tar file that will be used as input to the scoring engine
test(self, frame, label_column[, observation_columns]) [ALPHA] Predict test frame labels and return metrics.
train(self, frame, label_column, observation_columns[, num_classes, ...]) [ALPHA] Build Random Forests Classifier model.
__init__(self, name=None)

Create a ‘new’ instance of a Random Forest Classifier model.

Parameters:

name : unicode (default=None)

User supplied name.

Returns:

: Model

A new instance of RandomForestClassifierModel

Random Forest [R49] is a supervised ensemble learning algorithm which can be used to perform binary and multi-class classification. The Random Forest Classifier model is initialized, trained on columns of a frame, used to predict the labels of observations in a frame, and tests the predicted labels against the true labels. This model runs the MLLib implementation of Random Forest [R50]. During training, the decision trees are trained in parallel. During prediction, each tree’s prediction is counted as vote for one class. The label is predicted to be the class which receives the most votes. During testing, labels of the observations are predicted and tested against the true labels using built-in binary and multi-class Classification Metrics.

footnotes

[R49]https://en.wikipedia.org/wiki/Random_forest
[R50]https://spark.apache.org/docs/1.5.0/mllib-ensembles.html#random-forests

Examples

Consider the following model trained and tested on the sample data set in frame ‘frame’.

Consider the following frame containing three columns.

>>> frame.inspect()
[#]  Class  Dim_1          Dim_2
=======================================
[0]      1  19.8446136104  2.2985856384
[1]      1  16.8973559126  2.6933495054
[2]      1   5.5548729596  2.7777687995
[3]      0  46.1810010826  3.1611961917
[4]      0  44.3117586448  3.3458963222
[5]      0  34.6334526911  3.6429838715
>>> model = ta.RandomForestClassifierModel()
[===Job Progress===]
>>> train_output = model.train(frame, 'Class', ['Dim_1', 'Dim_2'], num_classes=2, num_trees=1, impurity="entropy", max_depth=4, max_bins=100)
[===Job Progress===]
>>> train_output
{u'impurity': u'entropy', u'max_bins': 100, u'observation_columns': [u'Dim_1', u'Dim_2'], u'num_nodes': 3, u'max_depth': 4, u'seed': 157264076, u'num_trees': 1, u'label_column': u'Class', u'feature_subset_category': u'all', u'num_classes': 2}
>>> train_output['num_nodes']
3
>>> train_output['label_column']
u'Class'
>>> predicted_frame = model.predict(frame, ['Dim_1', 'Dim_2'])
[===Job Progress===]
>>> predicted_frame.inspect()
[#]  Class  Dim_1          Dim_2         predicted_class
========================================================
[0]      1  19.8446136104  2.2985856384                1
[1]      1  16.8973559126  2.6933495054                1
[2]      1   5.5548729596  2.7777687995                1
[3]      0  46.1810010826  3.1611961917                0
[4]      0  44.3117586448  3.3458963222                0
[5]      0  34.6334526911  3.6429838715                0
>>> test_metrics = model.test(frame, 'Class', ['Dim_1','Dim_2'])
[===Job Progress===]
>>> test_metrics
Precision: 1.0
Recall: 1.0
Accuracy: 1.0
FMeasure: 1.0
Confusion Matrix:
            Predicted_Pos  Predicted_Neg
Actual_Pos              3              0
Actual_Neg              0              3
>>> model.publish()
[===Job Progress===]