K-means clustering

In-Database K-means. Copies the interface of sklearn.cluster.KMeans

Initiate clustering

Create a K-means object

class ibmdbpy.learn.kmeans.KMeans(n_clusters=3, modelname=None, max_iter=5, distance=u'euclidean', random_state=12345, idbased=False, statistics=None)[source]

The K-means algorithm is the most widely used clustering algorithm that uses an explicit distance measure to partition the data set into clusters.

The K-means algorithm represents each cluster by the vector of the mean attribute values of all training instances - for numeric attributes - and by the vector of modal (most frequent) values - for nominal attributes - that are assigned to that cluster. This cluster representation is called cluster center.

The KMeans class provides an interface for using the KMEANS and PREDICT_KMEANS IDAX methods of dashDB/DB2.

__init__(n_clusters=3, modelname=None, max_iter=5, distance=u'euclidean', random_state=12345, idbased=False, statistics=None)[source]

Constructor for K-means clustering.

Parameters:

n_cluster : int, optional, default: 3

The number of cluster centers. Range : > 2

modelname : str, optional

The name of the clustering model that is built. If it is not given, it is generated automatically. If the parameter corresponds to an existing model in the database, it is replaced during the fitting step.

max_iter : int, > 1 and <= 1000, default = 5

The maximum number of iterations.

distance : str, default: “euclidean”

The distance function. The following values are allowed: “euclidean” and “norm_euclidean”.

random_state : int, default: 12345

The random seed of the generator.

idbased : bool, optional, default: False

Specifies that the random seed of the generator is based on the value of the ID column.

statistics : str, optional

Indicates the statistics that are collected.

The following values are allowed: ‘none’, ‘columns’, ‘values:n’, and ‘all’:
  • If statistics=’none’ is specified, no statistics are collected.
  • If statistics=’columns’ is specified, statistics on the columns of the input table are collected, for example, mean values.
  • If statistics=’values:n’ is specified, and if n is a positive number, statistics on the columns and the column values are collected.
    Up to <n> column value statistics are collected.
    • If a nominal column contains more than <n> values, only the <n> most frequent column statistics are kept.
    • If a numeric column contains more than <n> values, the values are discretized, and the statistics are collected on the discretized values.
  • statistics=all is identical to statistics=values:100.
Returns:

The KMeans object, ready to be used for fitting and prediction

Notes

Inner parameters of the model can be printed and modified by using get_params and set_params. But we recommend creating a new KMeans model instead of modifying it.

Examples

>>> idadb = IdaDataBase("DASHDB")
>>> idadf = IdaDataFrame(idadb, "IRIS", indexer = "ID")
>>> kmeans = KMeans(3) # clustering with 3 clusters
>>> kmeans.fit(idadf)
>>> kmeans.predict(idadf)

Attributes

centers TODO
cluster_centers_ TODO
withinss TODO
size_clusters TODO
inertia_ TODO

Get parameters

KMeans.get_params()[source]

Return the parameters of the K-means clustering.

Set parameters

KMeans.set_params(**params)[source]

Change the parameters of the K-means clustering.

Methods

Fit and predict

fit

KMeans.fit(idadf, column_id=u'ID', incolumn=None, coldeftype=None, coldefrole=None, colPropertiesTable=None, verbose=False)[source]

Use the KMEANS stored procedure to build a K-means clustering model that clusters the input data into k centers.

Parameters:

idadf : IdaDataFrame

The name of the input IdaDataFrame.

column_id : str, default: “ID”

The column of the input IdaDataFrame that identifies a unique instance ID.

incolumn : dict, optional

The columns of the input table that have specific properties, which are separated by a semi-colon (;). Each column is succeeded by one or more of the following properties:

  • By type nominal (‘:nom’) or by type continuous (‘:cont’). By default, numerical types are continuous, and all other types nominal.
  • By role ‘:id’, ‘:target’, ‘:input’, or ‘:ignore’.

If this parameter is not specified, all columns of the input table have default properties.

coldeftype : dict, optional

The default type of the input table columns. The following values are allowed: ‘nom’ and ‘cont’. If the parameter is not specified, numeric columns are continuous and all other columns are nominal.

coldefrole : dict, optional

The default role of the input table columns. The following values are allowed: ‘input’ and ‘ignore’. If the parameter is not specified, all columns are input columns.

colPropertiesTable : idaDataFrame, optional

The input IdaDataFrame where the properties of the columns of the input IdaDataFrame (idadf) are stored. If this parameter is not specified, the column properties of the input table column properties are detected automatically.

verbose : bool, default: False

Verbosity mode.

predict

KMeans.predict(idadf, column_id=None, outtable=None)[source]

Apply the K-means clustering model to new data.

Parameters:

idadf : IdaDataFrame

IdaDataFrame to be used as input.

column_id : str

The column of the input table that identifies a unique instance ID. Default: the same id column that is specified in the stored procedure to build the model.

outtable : str

The name of the output table where the assigned clusters are stored. If this parameter is not specified, it is generated automatically. If the parameter corresponds to an existing table in the database, it is replaced.

Returns:

IdaDataFrame

IdaDataFrame containing the closest cluster for each data point referenced by its ID.

fit_predict

KMeans.fit_predict(idadf, column_id=u'ID', incolumn=None, coldeftype=None, coldefrole=None, colPropertiesTable=None, outtable=None, verbose=False)[source]

Convenience function for fitting the model and using it to make predictions on the same dataset. See the fit and predict documentation for an explanation about their attributes.

Explore result

describe

KMeans.describe()[source]

Return a description of the K-means clustering, if a prediction was made. Otherwise, this function returns the parameters of the model.

get labels

KMeans.labels_()[source]

Return the corresponding labels for each ID.

_retrieve_KMeans_Model

KMeans._retrieve_KMeans_Model(modelname, verbose=False)[source]

Retrieve information about the model to print the results. The KMEANS IDAX function stores its result in 4 tables:

  • <MODELNAME>_MODEL
  • <MODELNAME>_COLUMNS
  • <MODELNAME>_COLUMN_STATISTICS
  • <MODELNAME>_CLUSTERS
Parameters:

modelname : str

The name of the model that is retrieved.

verbose : bol, default: False

Verbosity mode.