API Documentation

Milk

Machine learning in Python

Toplevel functions

  • nfoldcrossvalidation: n-fold crossvalidation
  • defaultclassifier: get a general purpose classifier
  • kmeans: kmeans clustering

Modules

  • supervised
  • unsupervised
  • measures

Example

features = np.random.randn(100,20)
features[:50] *= 2
labels = np.repeat((0,1), 50)

classifier = milk.defaultclassifier()
model = classifier.train(features, labels)
new_label = model.apply(np.random.randn(100))
new_label2 = model.apply(np.random.randn(100)*2)
milk.kmeans(fmatrix, k, distance='euclidean', max_iter=1000, R=None, icov=None, covmat=None) centroids = kmeans(fmatrix, k, distance='euclidean', max_iter=1000, R=None, icov=None, covmat=None, return_assignments=False) assignments= kmeans(fmatrix, k, distance='euclidean', max_iter=1000, R=None, icov=None, covmat=None, return_centroids=False)

k-Means Clustering

Parameters:

fmatrix : ndarray

2-ndarray (Nelements x Nfeatures)

distance: string, optional

one of: - ‘euclidean’ : euclidean distance (default) - ‘seuclidean’ : standartised euclidean distance. This is equivalent to first normalising the features. - ‘mahalanobis’ : mahalanobis distance.

This can make use of the following keyword arguments:
  • ‘icov’ (the inverse of the covariance matrix),
  • ‘covmat’ (the covariance matrix)

If neither is passed, then the function computes the covariance from the feature matrix

max_iter : integer, optional

Maximum number of iteration (default: 1000)

R : source of randomness, optional

return_centroids : boolean, optional

Whether to return centroids (default: True)

return_assignments: boolean, optional

Whether to return centroid assignments (default: True)

centroids: ndarray (optional)

Initial centroids to use for clustering. If not supplied, centroids will be randomly initialized. 2-ndarray (k x Nfeatures)

Returns:

assignments : ndarray

An 1-D array of size len(fmatrix)

centroids : ndarray

An array of k’ centroids

milk.pdist(X, Y={X}, distance='euclidean2')

Compute distance matrix:

D[i,j] == np.sum( (X[i] - Y[j])**2 )

Parameters:

X : feature matrix

Y : feature matrix (default: use X) distance : one of ‘euclidean’ or ‘euclidean2’ (default)

Returns:

D : matrix of doubles

milk.zscore(features, axis=0, can_have_nans=True, inplace=False)

Returns a copy of features which has been normalised to zscores

Parameters:

features : ndarray

2-D input array

axis : integer, optional

which axis to normalise (default: 0)

can_have_nans : boolean, optional

whether features is allowed to have NaNs (default: True)

inplace : boolean, optional

Whether to operate inline (i.e., potentially change the input array). Default is False

Returns:

features : ndarray

zscored version of features

milk.defaultclassifier(mode='medium')

Return the default classifier learner

This is an SVM based classifier using the 1-vs-1 technique for multi-class problems (by default, see the multi_strategy parameter). The features will be first cleaned up (normalised to [-1, +1]) and go through SDA feature selection.

Parameters:

mode : string, optional

One of (‘fast’,’medium’,’slow’, ‘really-slow’). This defines the speed accuracy trade-off. It essentially defines how large the SVM parameter range is.

multi_strategy : str, optional

One of (‘1-vs-1’, ‘1-vs-rest’, ‘ecoc’). This defines the strategy used to convert the base binary classifier to a multi-class classifier.

expanded : boolean, optional

If true, then instead of a single learner, it returns a list of possible learners.

Returns:

learner : classifier learner object or list

If expanded, then it returns a list

See also

feature_selection_simple
Just perform the feature selection
svm_simple
Perform classification
milk.defaultlearner(mode='medium')

Return the default classifier learner

This is an SVM based classifier using the 1-vs-1 technique for multi-class problems (by default, see the multi_strategy parameter). The features will be first cleaned up (normalised to [-1, +1]) and go through SDA feature selection.

Parameters:

mode : string, optional

One of (‘fast’,’medium’,’slow’, ‘really-slow’). This defines the speed accuracy trade-off. It essentially defines how large the SVM parameter range is.

multi_strategy : str, optional

One of (‘1-vs-1’, ‘1-vs-rest’, ‘ecoc’). This defines the strategy used to convert the base binary classifier to a multi-class classifier.

expanded : boolean, optional

If true, then instead of a single learner, it returns a list of possible learners.

Returns:

learner : classifier learner object or list

If expanded, then it returns a list

See also

feature_selection_simple
Just perform the feature selection
svm_simple
Perform classification
milk.nfoldcrossvalidation(features, labels, nfolds=None, learner=None, origins=None, return_predictions=False, folds=None, initial_measure=0, classifier=None)

Perform n-fold cross validation

cmatrix,names = nfoldcrossvalidation(features, labels, nfolds=10, learner={defaultclassifier()}, origins=None, return_predictions=False) cmatrix,names,predictions = nfoldcrossvalidation(features, labels, nfolds=10, learner={defaultclassifier()}, origins=None, return_predictions=True)

cmatrix will be a N x N matrix, where N is the number of classes

cmatrix[i,j] will be the number of times that an element of class i was classified as class j

names[i] will correspond to the label name of class i

Parameters:

features : a sequence

labels : an array of labels, where label[i] is the label corresponding to features[i]

nfolds : integer, optional

Nr of folds. Default: 10

learner : learner object, optional

learner should implement the train() method to return a model (something with an apply() method). defaultclassifier() by default This parameter used to be called classifier and that name is still supported

origins : sequence, optional

Origin ID (see foldgenerator)

return_predictions : bool, optional

whether to return predictions (default: False)

folds : sequence of int, optional

which folds to generate

initial_measure : any, optional

what initial value to use for the results reduction (default: 0)

Returns:

cmatrix : ndarray

confusion matrix

names : sequence

sequence of labels so that cmatrix[i,j] corresponds to names[i], names[j]

predictions : sequence

predicted output for each element

milk.supervised

This hold the supervised classification modules:

Submodules

  • defaultclassifier: contains a default “good enough” classifier

  • svm: related to SVMs

  • grouped: contains objects to transform single object classifiers into group classifiers

    by voting

  • multi: transforms binary classifiers into multi-class classifiers (1-vs-1 or 1-vs-rest)

  • featureselection: feature selection

  • knn: k-nearest neighbours

  • tree: decision tree classifiers

Classifiers

All classifiers have a train function which takes 2 arguments:
  • features : sequence of features
  • labels : sequence of labels

They return a model object, which has an apply function which takes a single input and returns its label.

Note that there are always two objects: the learned and the model and they are independent. Every time you call learner.train() you get a new model.

Both classifiers and models are pickle()able.

Example

features = np.random.randn(100,20)
features[:50] *= 2
labels = np.repeat((0,1), 50)

classifier = milk.defaultclassifier()
model = classifier.train(features, labels)
new_label = model.apply(np.random.randn(100))
new_label2 = model.apply(np.random.randn(100)*2)
milk.supervised.normaliselabels(labels, multi_label=False)

If not multi_label (the default), normalises the labels to be integers from 0 through N-1. Otherwise, assume that each label is actually a sequence of labels.

normalised is a np.array, while names is a list mapping the indices to the old names.

Parameters:

labels : any iterable of labels

multi_label : bool, optional

Whether labels are actually composed of multiple labels

Returns

——

normalised : a numpy ndarray

If not multi_label, this is an array of integers 0 .. N-1; otherwise, it is a boolean array of size len(labels) x N

names : list of label names

milk.supervised.defaultclassifier(mode='medium')

Return the default classifier learner

This is an SVM based classifier using the 1-vs-1 technique for multi-class problems (by default, see the multi_strategy parameter). The features will be first cleaned up (normalised to [-1, +1]) and go through SDA feature selection.

Parameters:

mode : string, optional

One of (‘fast’,’medium’,’slow’, ‘really-slow’). This defines the speed accuracy trade-off. It essentially defines how large the SVM parameter range is.

multi_strategy : str, optional

One of (‘1-vs-1’, ‘1-vs-rest’, ‘ecoc’). This defines the strategy used to convert the base binary classifier to a multi-class classifier.

expanded : boolean, optional

If true, then instead of a single learner, it returns a list of possible learners.

Returns:

learner : classifier learner object or list

If expanded, then it returns a list

See also

feature_selection_simple
Just perform the feature selection
svm_simple
Perform classification
milk.supervised.svm_simple(C, kernel)

Returns a one-against-one SVM based classifier with C and kernel

Parameters:

C : double

C parameter

kernel : kernel

Kernel to use

Returns:

learner : supervised learner

See also

feature_selection_simple
Perform feature selection
defaultlearner
feature selection and gridsearch for SVM parameters
class milk.supervised.gridsearch(base, measure=accuracy, nfolds=10, params={ param1 : [...], param2 : [...]}, annotate=False)

Perform a grid search for the best parameter values.

When G.train() is called, then for each combination of p1 in param1, p2 in param2, ... it performs (effectively):

base.param1 = p1
base.param2 = p2
...
value[p1, p2,...] = measure(nfoldcrossvalidation(..., learner=base))

it then picks the highest set of parameters and re-learns a model on the whole data.

Parameters:

base : classifier to use

measure : function, optional

a function that takes labels and outputs and returns the loss. Default: 0/1 loss. This must be an additive function.

nfolds : integer, optional

Nr of folds

params : dictionary

annotate : boolean

Whether to annotate the returned model with arguments and value fields with the result of cross-validation. Defaults to False.

All of the above can be *passed as parameters to the constructor or set as

attributes*.

See also

gridminimise
function Implements the basic functionality behind this object

Methods

milk.supervised.lasso(X, Y, B={np.zeros()}, lam=1. max_iter={1024}, tol={1e-5})

Solve LASSO Optimisation

B* = arg min_B ½/n || Y - BX ||₂² + λ||B||₁

where $n$ is the number of samples.

Milk uses coordinate descent, looping through the coordinates in order (with an active set strategy to update only non-zero βs, if possible). The problem is convex and the solution is guaranteed to be optimal (within floating point accuracy).

Parameters:

X : ndarray

Design matrix

Y : ndarray

Matrix of outputs

B : ndarray, optional

Starting values for approximation. This can be used for a warm start if you have an estimate of where the solution should be. If used, the solution might be written in-place (if the array has the right format).

lam : float, optional

λ (default: 1.0)

max_iter : int, optional

Maximum nr of iterations (default: 1024)

tol : float, optional

Tolerance. Whenever a parameter is to be updated by a value smaller than tolerance, that is considered a null update. Be careful that if the value is too small, performance will degrade horribly. (default: 1e-5)

Returns:

B : ndarray

milk.supervised.lasso_walk(X, Y, B={np.zeros()}, nr_steps={64}, start={automatically inferred}, step={.9}, tol=None, return_lams=False) Bs, lams = lasso_walk(X, Y, B={np.zeros()}, nr_steps={64}, start={automatically inferred}, step={.9}, tol=None, return_lams=True)

Repeatedly solve LASSO Optimisation

B* = arg min_B ½/n || Y - BX ||₂² + λ||B||₁

for different values of λ.

Parameters:

X : ndarray

Design matrix

Y : ndarray

Matrix of outputs

B : ndarray, optional

Starting values for approximation. This can be used for a warm start if you have an estimate of where the solution should be.

start : float, optional

first λ to use (default is np.abs(Y).max())

nr_steps : int, optional

How many steps in the path (default is 64)

step : float, optional

Multiplicative step to take (default is 0.9)

tol : float, optional

This is the tolerance parameter. It is passed to the lasso function unmodified.

return_lams : bool, optional

Whether to return the values of λ used (default: False)

Returns:

Bs : ndarray

class milk.supervised.tree_learner() model = tree.train(features, labels) model2 = tree.train(features, labels, weights=weights) predicted = model.apply(testfeatures)

A decision tree classifier (currently, implements the greedy ID3 algorithm without any pruning).

Attributes

criterion (function, optional) criterion to use for tree construction, this should be a function that receives a set of labels (default: information_gain).
min_split (integer, optional) minimum size to split on (default: 4).

Methods

milk.unsupervised

Unsupervised Learning

  • kmeans: This is a highly optimised implementation of kmeans
  • PCA: Simple implementation
  • Non-negative matrix factorisation: both direct and with sparsity constraints
milk.unsupervised.center(features, axis=0, inplace=False)

Center data

Parameters:

features : ndarray

2-D input array

axis : integer, optional

which axis to normalise (default: 0)

can_have_nans : boolean, optional

whether features is allowed to have NaNs (default: True)

inplace : boolean, optional

Whether to operate inline (i.e., potentially change the input array). Default is False

Returns:

features : ndarray

centered version of features

mean : ndarray

mean values

milk.unsupervised.kmeans(fmatrix, k, distance='euclidean', max_iter=1000, R=None, icov=None, covmat=None) centroids = kmeans(fmatrix, k, distance='euclidean', max_iter=1000, R=None, icov=None, covmat=None, return_assignments=False) assignments= kmeans(fmatrix, k, distance='euclidean', max_iter=1000, R=None, icov=None, covmat=None, return_centroids=False)

k-Means Clustering

Parameters:

fmatrix : ndarray

2-ndarray (Nelements x Nfeatures)

distance: string, optional

one of: - ‘euclidean’ : euclidean distance (default) - ‘seuclidean’ : standartised euclidean distance. This is equivalent to first normalising the features. - ‘mahalanobis’ : mahalanobis distance.

This can make use of the following keyword arguments:
  • ‘icov’ (the inverse of the covariance matrix),
  • ‘covmat’ (the covariance matrix)

If neither is passed, then the function computes the covariance from the feature matrix

max_iter : integer, optional

Maximum number of iteration (default: 1000)

R : source of randomness, optional

return_centroids : boolean, optional

Whether to return centroids (default: True)

return_assignments: boolean, optional

Whether to return centroid assignments (default: True)

centroids: ndarray (optional)

Initial centroids to use for clustering. If not supplied, centroids will be randomly initialized. 2-ndarray (k x Nfeatures)

Returns:

assignments : ndarray

An 1-D array of size len(fmatrix)

centroids : ndarray

An array of k’ centroids

milk.unsupervised.mds(features, ndims, zscore=False)

Euclidean Multi-dimensional Scaling

Parameters:

features : ndarray

data matrix

ndims : int

Number of dimensions to return

zscore : boolean, optional

Whether to zscore the features (default: False)

Returns:

X : ndarray

array of size (m, ndims) where m = len(features)

See also

mds_dists
function
milk.unsupervised.mds_dists(distances, ndims)

Euclidean Multi-dimensional Scaling based on a distance matrix

Parameters:

distances : ndarray

data matrix

ndims : int

Number of dimensions to return

Returns:

X : ndarray

array of size (m, ndims) where m = len(features)

See also

mds
function
milk.unsupervised.pca(X, zscore=True)

Principal Component Analysis

Performs principal component analysis. Returns transformed matrix and principal components

Parameters:

X : 2-dimensional ndarray

data matrix

zscore : boolean, optional

whether to normalise to zscores (default: True)

Returns:

Y : ndarray

Transformed matrix (of same dimension as X)

V : ndarray

principal components

milk.unsupervised.pdist(X, Y={X}, distance='euclidean2')

Compute distance matrix:

D[i,j] == np.sum( (X[i] - Y[j])**2 )

Parameters:

X : feature matrix

Y : feature matrix (default: use X) distance : one of ‘euclidean’ or ‘euclidean2’ (default)

Returns:

D : matrix of doubles

milk.unsupervised.plike(X, sigma2={guess based on X})

Compute likelihood that any two objects come from the same distribution under a Gaussian distribution hypothesis:

L[i,j] = exp( ||X[i] - X[j]||^2 / sigma2 )
Parameters:

X : ndarray

feature matrix

sigma2 : float, optional

bandwidth

Returns:

L : ndarray

likelihood matrix

See also

pdist
function Compute distances between objects
milk.unsupervised.repeated_kmeans(fmatrix, k, repeats, distance='euclidean', max_iter=1000, **kwargs)

Runs kmeans repeats times and returns the best result as evaluated according to distance

Parameters:

fmatrix : feature matrix

k : nr of centroids

iterations : Nr of repetitions

distance : ‘euclidean’ (default) or ‘seuclidean’

max_iter : Max nr of iterations per kmeans run

R : random source

Returns:

assignments : 1-D array of assignments

centroids : centroids

These are the same returns as the kmeans function

See also

kmeans
runs kmeans once
milk.unsupervised.select_best_kmeans(fmatrix, ks, repeats=1, method='AIC', R=None, **kwargs)

Runs kmeans repeats times and returns the best result as evaluated according to distance

Parameters:

fmatrix : feature matrix

ks : sequence of integers

nr of centroids to try

iterations : integer, optional

Nr of repetitions for each value of k

R : random source, optional

Returns:

assignments : 1-D array of assignments

centroids : 2-D ndarray

centroids

These are the same returns as the kmeans function

See also

kmeans
runs kmeans once
milk.unsupervised.som(data, shape, iterations=1000, L=0.2, radius=4, R=None)

grid = som(data, shape, iterations=1000, L=.2, radius=4, R=None):

Self-organising maps

Parameters:

points : ndarray

data to feed to array

shape : tuple

Desired shape of output. Must be 2-dimensional.

L : float, optional

How much to influence neighbouring points (default: .2)

radius : integer, optional

Maximum radius of influence (in L_1 distance, default: 4)

iterations : integer, optional

Number of iterations

R : source of randomness

Returns:

grid : ndarray

Map

milk.unsupervised.zscore(features, axis=0, can_have_nans=True, inplace=False)

Returns a copy of features which has been normalised to zscores

Parameters:

features : ndarray

2-D input array

axis : integer, optional

which axis to normalise (default: 0)

can_have_nans : boolean, optional

whether features is allowed to have NaNs (default: True)

inplace : boolean, optional

Whether to operate inline (i.e., potentially change the input array). Default is False

Returns:

features : ndarray

zscored version of features

milk.unsupervised.lee_seung(X, r, cost='norm2', tol=1e-8, R=None)

Implement Lee & Seung’s algorithm

Parameters:

V : 2-ndarray

input matrix

r : integer

nr of latent features

cost : one of:

‘norm2’ : minimise || X - AS ||_2 (default) ‘i-div’ : minimise D(X||AS), where D is I-divergence (generalisation of K-L divergence)

max_iter : integer, optional

maximum number of iterations (default: 10000)

tol : double

tolerance threshold for early exit (when the update factor is with tol of 1., the function exits)

R : integer, optional

random seed

Returns:

A : 2-ndarray

S : 2-ndarray

milk.unsupervised.sparse_nnmf(V, r, sparsenessW = None, sparsenessH = None, max_iter=10000, R=None)

Implement sparse nonnegative matrix factorisation.

Parameters:

V : 2-D matrix

input feature matrix

r : integer

number of latent features

sparsenessW : double, optional

sparseness contraint on W (default: no sparsity contraint)

sparsenessH : double, optional

sparseness contraint on H (default: no sparsity contraint)

max_iter : integer, optional

maximum nr of iterations (default: 10000)

R : integer, optional

source of randomness

Returns:

W : 2-ndarray

H : 2-ndarray