Cross-validation¶
Cross validation is one of the better ways to evaluate the performance of supervised classification.
Cross validation consists of separating the data into fold (hence the name _n_-fold cross-validation, where _n_ is a positive integer). For the purpose o this discussion, we consider 10 folds. In the first round, we leave the first fold out. This means we train on the other 9 folds and then evaluate the model on this left-out fold. On the second round, we leave the second fold out. This continues until every fold has been left out exactly once.
Milk support what is often explicitly called stratified cross validation, which means that it takes the class distributions into account (so that, in 10 fold cross validation, each fold will have 10% of each class per round).
An additional functionality, not normally found in machine learning packages or
in machine learning theory, but very useful in practice is the use of the
origins
parameter. Every datapoint can have an associated origin. This is
a an integer and its meaning is the following: all examples with the same
origin will be in the same fold (so testing will never be performed where there
was an object of the same origin used for training).
This can model cases such as the following: you have collected patient data, which includes both some health measurement and an outcome of interest (for example, how the patient was doing a year after the initial exam). You wish to evaluate a supervised classification algorithm for predicting outcomes. In particular, you wish for an estimate of how well the system would perform on patients in any location (you know that the data collection has some site effects, perhaps because each person runs the test a little bit differently). Fortunately, you have the data to test this: the patients come from several clinics. Now, you set each patient origin to be the ID of the clinic and evaluate the per patient accuracy.
API Documentation¶
-
milk.measures.nfoldcrossvalidation.
foldgenerator
(labels, nfolds=None, origins=None, folds=None, multi_label=False)¶ - for train,test in foldgenerator(labels, nfolds=None, origins=None)
- ...
This generator breaks up the data into n folds (default 10).
If origins is given, then all elements that share the same origin will either be in testing or in training (never in both). This is useful when you have several replicates that shouldn’t be mixed together between training&testing but that can be otherwise be treated as independent for learning.
Parameters: labels : a sequence
the labels
nfolds : integer
nr of folds (default 10 or minimum label size)
origins : sequence, optional
if present, must be an array of indices of the same size as labels.
folds : sequence of int, optional
which folds to generate
Returns: iterator over train, test, two boolean arrays
-
milk.measures.nfoldcrossvalidation.
getfold
(labels, fold, nfolds=None, origins=None)¶ Get the training and testing set for fold fold in nfolds
Arguments are the same as for foldgenerator
Parameters: labels : ndarray of labels
fold : integer
nfolds : integer
number of folds (default 10 or size of smallest class)
origins : sequence, optional
if given, then objects with same origin are not scattered across folds
-
milk.measures.nfoldcrossvalidation.
nfoldcrossvalidation
(features, labels, nfolds=None, learner=None, origins=None, return_predictions=False, folds=None, initial_measure=0, classifier=None)¶ Perform n-fold cross validation
cmatrix,names = nfoldcrossvalidation(features, labels, nfolds=10, learner={defaultclassifier()}, origins=None, return_predictions=False) cmatrix,names,predictions = nfoldcrossvalidation(features, labels, nfolds=10, learner={defaultclassifier()}, origins=None, return_predictions=True)
cmatrix will be a N x N matrix, where N is the number of classes
cmatrix[i,j] will be the number of times that an element of class i was classified as class j
names[i] will correspond to the label name of class i
Parameters: features : a sequence
labels : an array of labels, where label[i] is the label corresponding to features[i]
nfolds : integer, optional
Nr of folds. Default: 10
learner : learner object, optional
learner should implement the train() method to return a model (something with an apply() method). defaultclassifier() by default This parameter used to be called classifier and that name is still supported
origins : sequence, optional
Origin ID (see foldgenerator)
return_predictions : bool, optional
whether to return predictions (default: False)
folds : sequence of int, optional
which folds to generate
initial_measure : any, optional
what initial value to use for the results reduction (default: 0)
Returns: cmatrix : ndarray
confusion matrix
names : sequence
sequence of labels so that cmatrix[i,j] corresponds to names[i], names[j]
predictions : sequence
predicted output for each element