Models KMeansModel


class KMeansModel

Entity KMeansModel

Attributes

last_read_date Read-only property - Last time this model’s data was accessed.
name Set or get the name of the model object.
status Read-only property - Current model life cycle status.

Methods

__init__(self[, name, _info]) Create a ‘new’ instance of a k-means model.
predict(self, frame[, observation_columns]) [BETA] Predict the cluster assignments for the data points.
publish(self) [BETA] Creates a tar file that will be used as input to the scoring engine
train(self, frame, observation_columns, column_scalings[, k, max_iterations, ...]) [BETA] Creates KMeans Model from train frame.
__init__(self, name=None)

Create a ‘new’ instance of a k-means model.

Parameters:

name : unicode (default=None)

Name for the model.

Returns:

: Model

A new instance of KMeansModel

k-means [R11] is an unsupervised algorithm used to partition the data into ‘k’ clusters. Each observation can belong to only one cluster, the cluster with the nearest mean. The k-means model is initialized, trained on columns of a frame, and used to predict cluster assignments for a frame. This model runs the MLLib implementation of k-means [R12] with enhanced features, computing the number of elements in each cluster during training. During predict, it computes the distance of each observation from its cluster center and also from every other cluster center.

footnotes

[R11]https://en.wikipedia.org/wiki/K-means_clustering
[R12]https://spark.apache.org/docs/1.5.0/mllib-clustering.html#k-means

Examples

Consider the following model trained and tested on the sample data set in frame ‘frame’.

Consider the following frame containing two columns.

>>> frame.inspect()
[#]  data  name
===============
[0]   2.0  ab
[1]   1.0  cd
[2]   7.0  ef
[3]   1.0  gh
[4]   9.0  ij
[5]   2.0  kl
[6]   0.0  mn
[7]   6.0  op
[8]   5.0  qr
>>> model = ta.KMeansModel()
[===Job Progress===]
>>> train_output = model.train(frame, ["data"], [1], 3)
[===Job Progress===]
>>> train_output
{u'within_set_sum_of_squared_error': 5.3, u'cluster_size': {u'Cluster:1': 5, u'Cluster:3': 2, u'Cluster:2': 2}}
>>> train_output.has_key('within_set_sum_of_squared_error')
True
>>> predicted_frame = model.predict(frame, ["data"])
[===Job Progress===]
>>> predicted_frame.column_names
[u'data', u'name', u'distance_from_cluster_1', u'distance_from_cluster_2', u'distance_from_cluster_3', u'predicted_cluster']
>>> model.publish()
[===Job Progress===]