7.6.5. mclearn.preprocessing.balanced_train_test_split

mclearn.preprocessing.balanced_train_test_split(data, features, target, train_size, test_size, random_state=None)[source]

Split the data into a balanced training set and test set of some given size.

For a dataset with an unequal numer of samples in each class, one useful procedure is to split the data into a training and a test set in such a way that the classes are balanced.

Parameters:
  • data (DataFrame, shape = [n_samples, n_features]) – Where each row is a sample point and each column is a feature.
  • features (array, shape = [n_features]) – The names of the columns in data that are used as feature vectors.
  • target (str) – The name of the column in data that is used as the traget vector
  • train_size (int) – Number of sample points from each class in the training set.
  • test_size (int) – Number of sample points from each class in the test set.
  • random_state (int, optional (default=None)) – Random seed.
Returns:

  • X_train (array) – The feature vectors (stored as columns) in the training set.
  • X_test (array) – The feature vectors (stored as columns) in the test set.
  • y_train (array) – The target vector in the training set.
  • y_test (array) – The target vector in the test set.