API Documentation¶
Milk
Machine learning in Python
Toplevel functions¶
- nfoldcrossvalidation: n-fold crossvalidation
- defaultclassifier: get a general purpose classifier
- kmeans: kmeans clustering
Modules¶
- supervised
- unsupervised
- measures
Example¶
features = np.random.randn(100,20)
features[:50] *= 2
labels = np.repeat((0,1), 50)
classifier = milk.defaultclassifier()
model = classifier.train(features, labels)
new_label = model.apply(np.random.randn(100))
new_label2 = model.apply(np.random.randn(100)*2)
-
milk.
kmeans
(fmatrix, k, distance='euclidean', max_iter=1000, R=None, icov=None, covmat=None) centroids = kmeans(fmatrix, k, distance='euclidean', max_iter=1000, R=None, icov=None, covmat=None, return_assignments=False) assignments= kmeans(fmatrix, k, distance='euclidean', max_iter=1000, R=None, icov=None, covmat=None, return_centroids=False)¶ k-Means Clustering
Parameters: fmatrix : ndarray
2-ndarray (Nelements x Nfeatures)
distance: string, optional
one of: - ‘euclidean’ : euclidean distance (default) - ‘seuclidean’ : standartised euclidean distance. This is equivalent to first normalising the features. - ‘mahalanobis’ : mahalanobis distance.
- This can make use of the following keyword arguments:
- ‘icov’ (the inverse of the covariance matrix),
- ‘covmat’ (the covariance matrix)
If neither is passed, then the function computes the covariance from the feature matrix
max_iter : integer, optional
Maximum number of iteration (default: 1000)
R : source of randomness, optional
return_centroids : boolean, optional
Whether to return centroids (default: True)
return_assignments: boolean, optional
Whether to return centroid assignments (default: True)
centroids: ndarray (optional)
Initial centroids to use for clustering. If not supplied, centroids will be randomly initialized. 2-ndarray (k x Nfeatures)
Returns: assignments : ndarray
An 1-D array of size len(fmatrix)
centroids : ndarray
An array of k’ centroids
-
milk.
pdist
(X, Y={X}, distance='euclidean2')¶ Compute distance matrix:
D[i,j] == np.sum( (X[i] - Y[j])**2 )
Parameters: X : feature matrix
Y : feature matrix (default: use X) distance : one of ‘euclidean’ or ‘euclidean2’ (default)
Returns: D : matrix of doubles
-
milk.
zscore
(features, axis=0, can_have_nans=True, inplace=False)¶ Returns a copy of features which has been normalised to zscores
Parameters: features : ndarray
2-D input array
axis : integer, optional
which axis to normalise (default: 0)
can_have_nans : boolean, optional
whether
features
is allowed to have NaNs (default: True)inplace : boolean, optional
Whether to operate inline (i.e., potentially change the input array). Default is False
Returns: features : ndarray
zscored version of features
-
milk.
defaultclassifier
(mode='medium')¶ Return the default classifier learner
This is an SVM based classifier using the 1-vs-1 technique for multi-class problems (by default, see the
multi_strategy
parameter). The features will be first cleaned up (normalised to [-1, +1]) and go through SDA feature selection.Parameters: mode : string, optional
One of (‘fast’,’medium’,’slow’, ‘really-slow’). This defines the speed accuracy trade-off. It essentially defines how large the SVM parameter range is.
multi_strategy : str, optional
One of (‘1-vs-1’, ‘1-vs-rest’, ‘ecoc’). This defines the strategy used to convert the base binary classifier to a multi-class classifier.
expanded : boolean, optional
If true, then instead of a single learner, it returns a list of possible learners.
Returns: learner : classifier learner object or list
If expanded, then it returns a list
See also
feature_selection_simple
- Just perform the feature selection
svm_simple
- Perform classification
-
milk.
defaultlearner
(mode='medium')¶ Return the default classifier learner
This is an SVM based classifier using the 1-vs-1 technique for multi-class problems (by default, see the
multi_strategy
parameter). The features will be first cleaned up (normalised to [-1, +1]) and go through SDA feature selection.Parameters: mode : string, optional
One of (‘fast’,’medium’,’slow’, ‘really-slow’). This defines the speed accuracy trade-off. It essentially defines how large the SVM parameter range is.
multi_strategy : str, optional
One of (‘1-vs-1’, ‘1-vs-rest’, ‘ecoc’). This defines the strategy used to convert the base binary classifier to a multi-class classifier.
expanded : boolean, optional
If true, then instead of a single learner, it returns a list of possible learners.
Returns: learner : classifier learner object or list
If expanded, then it returns a list
See also
feature_selection_simple
- Just perform the feature selection
svm_simple
- Perform classification
-
milk.
nfoldcrossvalidation
(features, labels, nfolds=None, learner=None, origins=None, return_predictions=False, folds=None, initial_measure=0, classifier=None)¶ Perform n-fold cross validation
cmatrix,names = nfoldcrossvalidation(features, labels, nfolds=10, learner={defaultclassifier()}, origins=None, return_predictions=False) cmatrix,names,predictions = nfoldcrossvalidation(features, labels, nfolds=10, learner={defaultclassifier()}, origins=None, return_predictions=True)
cmatrix will be a N x N matrix, where N is the number of classes
cmatrix[i,j] will be the number of times that an element of class i was classified as class j
names[i] will correspond to the label name of class i
Parameters: features : a sequence
labels : an array of labels, where label[i] is the label corresponding to features[i]
nfolds : integer, optional
Nr of folds. Default: 10
learner : learner object, optional
learner should implement the train() method to return a model (something with an apply() method). defaultclassifier() by default This parameter used to be called classifier and that name is still supported
origins : sequence, optional
Origin ID (see foldgenerator)
return_predictions : bool, optional
whether to return predictions (default: False)
folds : sequence of int, optional
which folds to generate
initial_measure : any, optional
what initial value to use for the results reduction (default: 0)
Returns: cmatrix : ndarray
confusion matrix
names : sequence
sequence of labels so that cmatrix[i,j] corresponds to names[i], names[j]
predictions : sequence
predicted output for each element
milk.supervised
This hold the supervised classification modules:
Submodules¶
defaultclassifier: contains a default “good enough” classifier
svm: related to SVMs
- grouped: contains objects to transform single object classifiers into group classifiers
by voting
multi: transforms binary classifiers into multi-class classifiers (1-vs-1 or 1-vs-rest)
featureselection: feature selection
knn: k-nearest neighbours
tree: decision tree classifiers
Classifiers¶
- All classifiers have a train function which takes 2 arguments:
- features : sequence of features
- labels : sequence of labels
They return a model object, which has an apply function which takes a single input and returns its label.
Note that there are always two objects: the learned and the model and they are independent. Every time you call learner.train() you get a new model.
Both classifiers and models are pickle()able.
Example¶
features = np.random.randn(100,20)
features[:50] *= 2
labels = np.repeat((0,1), 50)
classifier = milk.defaultclassifier()
model = classifier.train(features, labels)
new_label = model.apply(np.random.randn(100))
new_label2 = model.apply(np.random.randn(100)*2)
-
milk.supervised.
normaliselabels
(labels, multi_label=False)¶ If not
multi_label
(the default), normalises the labels to be integers from 0 through N-1. Otherwise, assume that each label is actually a sequence of labels.normalised
is a np.array, whilenames
is a list mapping the indices to the old names.Parameters: labels : any iterable of labels
multi_label : bool, optional
Whether labels are actually composed of multiple labels
Returns
——
normalised : a numpy ndarray
If not
multi_label
, this is an array of integers 0 .. N-1; otherwise, it is a boolean array of size len(labels) x Nnames : list of label names
-
milk.supervised.
defaultclassifier
(mode='medium')¶ Return the default classifier learner
This is an SVM based classifier using the 1-vs-1 technique for multi-class problems (by default, see the
multi_strategy
parameter). The features will be first cleaned up (normalised to [-1, +1]) and go through SDA feature selection.Parameters: mode : string, optional
One of (‘fast’,’medium’,’slow’, ‘really-slow’). This defines the speed accuracy trade-off. It essentially defines how large the SVM parameter range is.
multi_strategy : str, optional
One of (‘1-vs-1’, ‘1-vs-rest’, ‘ecoc’). This defines the strategy used to convert the base binary classifier to a multi-class classifier.
expanded : boolean, optional
If true, then instead of a single learner, it returns a list of possible learners.
Returns: learner : classifier learner object or list
If expanded, then it returns a list
See also
feature_selection_simple
- Just perform the feature selection
svm_simple
- Perform classification
-
milk.supervised.
svm_simple
(C, kernel)¶ Returns a one-against-one SVM based classifier with C and kernel
Parameters: C : double
C parameter
kernel : kernel
Kernel to use
Returns: learner : supervised learner
See also
feature_selection_simple
- Perform feature selection
defaultlearner
- feature selection and gridsearch for SVM parameters
-
class
milk.supervised.
gridsearch
(base, measure=accuracy, nfolds=10, params={ param1 : [...], param2 : [...]}, annotate=False)¶ Perform a grid search for the best parameter values.
When G.train() is called, then for each combination of p1 in param1, p2 in param2, ... it performs (effectively):
base.param1 = p1 base.param2 = p2 ... value[p1, p2,...] = measure(nfoldcrossvalidation(..., learner=base))
it then picks the highest set of parameters and re-learns a model on the whole data.
Parameters: base : classifier to use
measure : function, optional
a function that takes labels and outputs and returns the loss. Default: 0/1 loss. This must be an additive function.
nfolds : integer, optional
Nr of folds
params : dictionary
annotate : boolean
Whether to annotate the returned model with
arguments
andvalue
fields with the result of cross-validation. Defaults to False.All of the above can be *passed as parameters to the constructor or set as
attributes*.
See also
gridminimise
- function Implements the basic functionality behind this object
Methods
-
milk.supervised.
lasso
(X, Y, B={np.zeros()}, lam=1. max_iter={1024}, tol={1e-5})¶ Solve LASSO Optimisation
B* = arg min_B ½/n || Y - BX ||₂² + λ||B||₁where $n$ is the number of samples.
Milk uses coordinate descent, looping through the coordinates in order (with an active set strategy to update only non-zero βs, if possible). The problem is convex and the solution is guaranteed to be optimal (within floating point accuracy).
Parameters: X : ndarray
Design matrix
Y : ndarray
Matrix of outputs
B : ndarray, optional
Starting values for approximation. This can be used for a warm start if you have an estimate of where the solution should be. If used, the solution might be written in-place (if the array has the right format).
lam : float, optional
λ (default: 1.0)
max_iter : int, optional
Maximum nr of iterations (default: 1024)
tol : float, optional
Tolerance. Whenever a parameter is to be updated by a value smaller than
tolerance
, that is considered a null update. Be careful that if the value is too small, performance will degrade horribly. (default: 1e-5)Returns: B : ndarray
-
milk.supervised.
lasso_walk
(X, Y, B={np.zeros()}, nr_steps={64}, start={automatically inferred}, step={.9}, tol=None, return_lams=False) Bs, lams = lasso_walk(X, Y, B={np.zeros()}, nr_steps={64}, start={automatically inferred}, step={.9}, tol=None, return_lams=True)¶ Repeatedly solve LASSO Optimisation
B* = arg min_B ½/n || Y - BX ||₂² + λ||B||₁for different values of λ.
Parameters: X : ndarray
Design matrix
Y : ndarray
Matrix of outputs
B : ndarray, optional
Starting values for approximation. This can be used for a warm start if you have an estimate of where the solution should be.
start : float, optional
first λ to use (default is
np.abs(Y).max()
)nr_steps : int, optional
How many steps in the path (default is 64)
step : float, optional
Multiplicative step to take (default is 0.9)
tol : float, optional
This is the tolerance parameter. It is passed to the lasso function unmodified.
return_lams : bool, optional
Whether to return the values of λ used (default: False)
Returns: Bs : ndarray
-
class
milk.supervised.
tree_learner
() model = tree.train(features, labels) model2 = tree.train(features, labels, weights=weights) predicted = model.apply(testfeatures)¶ A decision tree classifier (currently, implements the greedy ID3 algorithm without any pruning).
Attributes
criterion (function, optional) criterion to use for tree construction, this should be a function that receives a set of labels (default: information_gain). min_split (integer, optional) minimum size to split on (default: 4). Methods
milk.unsupervised
Unsupervised Learning¶
- kmeans: This is a highly optimised implementation of kmeans
- PCA: Simple implementation
- Non-negative matrix factorisation: both direct and with sparsity constraints
-
milk.unsupervised.
center
(features, axis=0, inplace=False)¶ Center data
Parameters: features : ndarray
2-D input array
axis : integer, optional
which axis to normalise (default: 0)
can_have_nans : boolean, optional
whether
features
is allowed to have NaNs (default: True)inplace : boolean, optional
Whether to operate inline (i.e., potentially change the input array). Default is False
Returns: features : ndarray
centered version of features
mean : ndarray
mean values
-
milk.unsupervised.
kmeans
(fmatrix, k, distance='euclidean', max_iter=1000, R=None, icov=None, covmat=None) centroids = kmeans(fmatrix, k, distance='euclidean', max_iter=1000, R=None, icov=None, covmat=None, return_assignments=False) assignments= kmeans(fmatrix, k, distance='euclidean', max_iter=1000, R=None, icov=None, covmat=None, return_centroids=False)¶ k-Means Clustering
Parameters: fmatrix : ndarray
2-ndarray (Nelements x Nfeatures)
distance: string, optional
one of: - ‘euclidean’ : euclidean distance (default) - ‘seuclidean’ : standartised euclidean distance. This is equivalent to first normalising the features. - ‘mahalanobis’ : mahalanobis distance.
- This can make use of the following keyword arguments:
- ‘icov’ (the inverse of the covariance matrix),
- ‘covmat’ (the covariance matrix)
If neither is passed, then the function computes the covariance from the feature matrix
max_iter : integer, optional
Maximum number of iteration (default: 1000)
R : source of randomness, optional
return_centroids : boolean, optional
Whether to return centroids (default: True)
return_assignments: boolean, optional
Whether to return centroid assignments (default: True)
centroids: ndarray (optional)
Initial centroids to use for clustering. If not supplied, centroids will be randomly initialized. 2-ndarray (k x Nfeatures)
Returns: assignments : ndarray
An 1-D array of size len(fmatrix)
centroids : ndarray
An array of k’ centroids
-
milk.unsupervised.
mds
(features, ndims, zscore=False)¶ Euclidean Multi-dimensional Scaling
Parameters: features : ndarray
data matrix
ndims : int
Number of dimensions to return
zscore : boolean, optional
Whether to zscore the features (default: False)
Returns: X : ndarray
array of size
(m, ndims)
wherem = len(features)
See also
mds_dists
- function
-
milk.unsupervised.
mds_dists
(distances, ndims)¶ Euclidean Multi-dimensional Scaling based on a distance matrix
Parameters: distances : ndarray
data matrix
ndims : int
Number of dimensions to return
Returns: X : ndarray
array of size
(m, ndims)
wherem = len(features)
See also
mds
- function
-
milk.unsupervised.
pca
(X, zscore=True)¶ Principal Component Analysis
Performs principal component analysis. Returns transformed matrix and principal components
Parameters: X : 2-dimensional ndarray
data matrix
zscore : boolean, optional
whether to normalise to zscores (default: True)
Returns: Y : ndarray
Transformed matrix (of same dimension as X)
V : ndarray
principal components
-
milk.unsupervised.
pdist
(X, Y={X}, distance='euclidean2')¶ Compute distance matrix:
D[i,j] == np.sum( (X[i] - Y[j])**2 )
Parameters: X : feature matrix
Y : feature matrix (default: use X) distance : one of ‘euclidean’ or ‘euclidean2’ (default)
Returns: D : matrix of doubles
-
milk.unsupervised.
plike
(X, sigma2={guess based on X})¶ Compute likelihood that any two objects come from the same distribution under a Gaussian distribution hypothesis:
L[i,j] = exp( ||X[i] - X[j]||^2 / sigma2 )
Parameters: X : ndarray
feature matrix
sigma2 : float, optional
bandwidth
Returns: L : ndarray
likelihood matrix
See also
pdist
- function Compute distances between objects
-
milk.unsupervised.
repeated_kmeans
(fmatrix, k, repeats, distance='euclidean', max_iter=1000, **kwargs)¶ Runs kmeans repeats times and returns the best result as evaluated according to distance
Parameters: fmatrix : feature matrix
k : nr of centroids
iterations : Nr of repetitions
distance : ‘euclidean’ (default) or ‘seuclidean’
max_iter : Max nr of iterations per kmeans run
R : random source
Returns: assignments : 1-D array of assignments
centroids : centroids
These are the same returns as the kmeans function
See also
kmeans
- runs kmeans once
-
milk.unsupervised.
select_best_kmeans
(fmatrix, ks, repeats=1, method='AIC', R=None, **kwargs)¶ Runs kmeans repeats times and returns the best result as evaluated according to distance
Parameters: fmatrix : feature matrix
ks : sequence of integers
nr of centroids to try
iterations : integer, optional
Nr of repetitions for each value of k
R : random source, optional
Returns: assignments : 1-D array of assignments
centroids : 2-D ndarray
centroids
These are the same returns as the kmeans function
See also
kmeans
- runs kmeans once
-
milk.unsupervised.
som
(data, shape, iterations=1000, L=0.2, radius=4, R=None)¶ grid = som(data, shape, iterations=1000, L=.2, radius=4, R=None):
Self-organising maps
Parameters: points : ndarray
data to feed to array
shape : tuple
Desired shape of output. Must be 2-dimensional.
L : float, optional
How much to influence neighbouring points (default: .2)
radius : integer, optional
Maximum radius of influence (in L_1 distance, default: 4)
iterations : integer, optional
Number of iterations
R : source of randomness
Returns: grid : ndarray
Map
-
milk.unsupervised.
zscore
(features, axis=0, can_have_nans=True, inplace=False)¶ Returns a copy of features which has been normalised to zscores
Parameters: features : ndarray
2-D input array
axis : integer, optional
which axis to normalise (default: 0)
can_have_nans : boolean, optional
whether
features
is allowed to have NaNs (default: True)inplace : boolean, optional
Whether to operate inline (i.e., potentially change the input array). Default is False
Returns: features : ndarray
zscored version of features
-
milk.unsupervised.
lee_seung
(X, r, cost='norm2', tol=1e-8, R=None)¶ Implement Lee & Seung’s algorithm
Parameters: V : 2-ndarray
input matrix
r : integer
nr of latent features
cost : one of:
‘norm2’ : minimise || X - AS ||_2 (default) ‘i-div’ : minimise D(X||AS), where D is I-divergence (generalisation of K-L divergence)
max_iter : integer, optional
maximum number of iterations (default: 10000)
tol : double
tolerance threshold for early exit (when the update factor is with tol of 1., the function exits)
R : integer, optional
random seed
Returns: A : 2-ndarray
S : 2-ndarray
-
milk.unsupervised.
sparse_nnmf
(V, r, sparsenessW = None, sparsenessH = None, max_iter=10000, R=None)¶ Implement sparse nonnegative matrix factorisation.
Parameters: V : 2-D matrix
input feature matrix
r : integer
number of latent features
sparsenessW : double, optional
sparseness contraint on W (default: no sparsity contraint)
sparsenessH : double, optional
sparseness contraint on H (default: no sparsity contraint)
max_iter : integer, optional
maximum nr of iterations (default: 10000)
R : integer, optional
source of randomness
Returns: W : 2-ndarray
H : 2-ndarray