K-means clustering¶
In-Database K-means. Copies the interface of sklearn.cluster.KMeans
Initiate clustering¶
Create a K-means object¶
-
class
ibmdbpy.learn.kmeans.
KMeans
(n_clusters=3, modelname=None, max_iter=5, distance=u'euclidean', random_state=12345, idbased=False, statistics=None)[source]¶ The K-means algorithm is the most widely used clustering algorithm that uses an explicit distance measure to partition the data set into clusters.
The K-means algorithm represents each cluster by the vector of the mean attribute values of all training instances - for numeric attributes - and by the vector of modal (most frequent) values - for nominal attributes - that are assigned to that cluster. This cluster representation is called cluster center.
The KMeans class provides an interface for using the KMEANS and PREDICT_KMEANS IDAX methods of dashDB/DB2.
-
__init__
(n_clusters=3, modelname=None, max_iter=5, distance=u'euclidean', random_state=12345, idbased=False, statistics=None)[source]¶ Constructor for K-means clustering.
Parameters: n_cluster : int, optional, default: 3
The number of cluster centers. Range : > 2
modelname : str, optional
The name of the clustering model that is built. If it is not given, it is generated automatically. If the parameter corresponds to an existing model in the database, it is replaced during the fitting step.
max_iter : int, > 1 and <= 1000, default = 5
The maximum number of iterations.
distance : str, default: “euclidean”
The distance function. The following values are allowed: “euclidean” and “norm_euclidean”.
random_state : int, default: 12345
The random seed of the generator.
idbased : bool, optional, default: False
Specifies that the random seed of the generator is based on the value of the ID column.
statistics : str, optional
Indicates the statistics that are collected.
- The following values are allowed: ‘none’, ‘columns’, ‘values:n’, and ‘all’:
- If statistics=’none’ is specified, no statistics are collected.
- If statistics=’columns’ is specified, statistics on the columns of the input table are collected, for example, mean values.
- If statistics=’values:n’ is specified, and if n is a positive number, statistics on the columns and the column values are collected.
- Up to <n> column value statistics are collected.
- If a nominal column contains more than <n> values, only the <n> most frequent column statistics are kept.
- If a numeric column contains more than <n> values, the values are discretized, and the statistics are collected on the discretized values.
- statistics=all is identical to statistics=values:100.
Returns: The KMeans object, ready to be used for fitting and prediction
Notes
Inner parameters of the model can be printed and modified by using get_params and set_params. But we recommend creating a new KMeans model instead of modifying it.
Examples
>>> idadb = IdaDataBase("DASHDB") >>> idadf = IdaDataFrame(idadb, "IRIS", indexer = "ID") >>> kmeans = KMeans(3) # clustering with 3 clusters >>> kmeans.fit(idadf) >>> kmeans.predict(idadf)
Attributes
centers TODO cluster_centers_ TODO withinss TODO size_clusters TODO inertia_ TODO
-
Fit and predict¶
fit¶
-
KMeans.
fit
(idadf, column_id=u'ID', incolumn=None, coldeftype=None, coldefrole=None, colPropertiesTable=None, verbose=False)[source]¶ Use the KMEANS stored procedure to build a K-means clustering model that clusters the input data into k centers.
Parameters: idadf : IdaDataFrame
The name of the input IdaDataFrame.
column_id : str, default: “ID”
The column of the input IdaDataFrame that identifies a unique instance ID.
incolumn : dict, optional
The columns of the input table that have specific properties, which are separated by a semi-colon (;). Each column is succeeded by one or more of the following properties:
- By type nominal (‘:nom’) or by type continuous (‘:cont’). By default, numerical types are continuous, and all other types nominal.
- By role ‘:id’, ‘:target’, ‘:input’, or ‘:ignore’.
If this parameter is not specified, all columns of the input table have default properties.
coldeftype : dict, optional
The default type of the input table columns. The following values are allowed: ‘nom’ and ‘cont’. If the parameter is not specified, numeric columns are continuous and all other columns are nominal.
coldefrole : dict, optional
The default role of the input table columns. The following values are allowed: ‘input’ and ‘ignore’. If the parameter is not specified, all columns are input columns.
colPropertiesTable : idaDataFrame, optional
The input IdaDataFrame where the properties of the columns of the input IdaDataFrame (idadf) are stored. If this parameter is not specified, the column properties of the input table column properties are detected automatically.
verbose : bool, default: False
Verbosity mode.
predict¶
-
KMeans.
predict
(idadf, column_id=None, outtable=None)[source]¶ Apply the K-means clustering model to new data.
Parameters: idadf : IdaDataFrame
IdaDataFrame to be used as input.
column_id : str
The column of the input table that identifies a unique instance ID. Default: the same id column that is specified in the stored procedure to build the model.
outtable : str
The name of the output table where the assigned clusters are stored. If this parameter is not specified, it is generated automatically. If the parameter corresponds to an existing table in the database, it is replaced.
Returns: IdaDataFrame
IdaDataFrame containing the closest cluster for each data point referenced by its ID.
fit_predict¶
-
KMeans.
fit_predict
(idadf, column_id=u'ID', incolumn=None, coldeftype=None, coldefrole=None, colPropertiesTable=None, outtable=None, verbose=False)[source]¶ Convenience function for fitting the model and using it to make predictions on the same dataset. See the fit and predict documentation for an explanation about their attributes.
Explore result¶
describe¶
_retrieve_KMeans_Model¶
-
KMeans.
_retrieve_KMeans_Model
(modelname, verbose=False)[source]¶ Retrieve information about the model to print the results. The KMEANS IDAX function stores its result in 4 tables:
- <MODELNAME>_MODEL
- <MODELNAME>_COLUMNS
- <MODELNAME>_COLUMN_STATISTICS
- <MODELNAME>_CLUSTERS
Parameters: modelname : str
The name of the model that is retrieved.
verbose : bol, default: False
Verbosity mode.