Cross validation is one of the better ways to evaluate the performance of supervised classification.
Cross validation consists of separating the data into fold (hence the name _n_-fold cross-validation, where _n_ is a positive integer). For the purpose o this discussion, we consider 10 folds. In the first round, we leave the first fold out. This means we train on the other 9 folds and then evaluate the model on this left-out fold. On the second round, we leave the second fold out. This continues until every fold has been left out exactly once.
Milk support what is often explicitly called stratified cross validation, which means that it takes the class distributions into account (so that, in 10 fold cross validation, each fold will have 10% of each class per round).
An additional functionality, not normally found in machine learning packages or in machine learning theory, but very useful in practice is the use of the origins parameter. Every datapoint can have an associated origin. This is a an integer and its meaning is the following: all examples with the same origin will be in the same fold (so testing will never be performed where there was an object of the same origin used for training).
This can model cases such as the following: you have collected patient data, which includes both some health measurement and an outcome of interest (for example, how the patient was doing a year after the initial exam). You wish to evaluate a supervised classification algorithm for predicting outcomes. In particular, you wish for an estimate of how well the system would perform on patients in any location (you know that the data collection has some site effects, perhaps because each person runs the test a little bit differently). Fortunately, you have the data to test this: the patients come from several clinics. Now, you set each patient origin to be the ID of the clinic and evaluate the per patient accuracy.
This generator breaks up the data into n folds (default 10).
If origins is given, then all elements that share the same origin will either be in testing or in training (never in both). This is useful when you have several replicates that shouldn’t be mixed together between training&testing but that can be otherwise be treated as independent for learning.
| Parameters : | labels : a sequence
nfolds : integer
origins : sequence, optional
folds : sequence of int, optional
|
|---|---|
| Returns : | iterator over `train, test`, two boolean arrays : |
Get the training and testing set for fold fold in nfolds
Arguments are the same as for foldgenerator
| Parameters : | labels : ndarray of labels fold : integer nfolds : integer
origins : sequence, optional
|
|---|
Perform n-fold cross validation
cmatrix,names = nfoldcrossvalidation(features, labels, nfolds=10, learner={defaultclassifier()}, origins=None, return_predictions=False) cmatrix,names,predictions = nfoldcrossvalidation(features, labels, nfolds=10, learner={defaultclassifier()}, origins=None, return_predictions=True)
cmatrix will be a N x N matrix, where N is the number of classes
cmatrix[i,j] will be the number of times that an element of class i was classified as class j
names[i] will correspond to the label name of class i
| Parameters : | features : a sequence labels : an array of labels, where label[i] is the label corresponding to features[i] nfolds : integer, optional
learner : learner object, optional
origins : sequence, optional
return_predictions : bool, optional
folds : sequence of int, optional
initial_measure : any, optional
|
|---|---|
| Returns : | cmatrix : ndarray
names : sequence
predictions : sequence
|