3. Data Preprocessing¶
3.1. Normalisation¶
If a dataset contains features with vastly different scales, it is advisible to normalise the features first. There are a few option:
- Normalise the features to zero mean and unit variance.
- Normalise the features to unit variance.
- Normalise the features to unit interval.
3.2. Balanced Train-Test Split¶
Often, the class distribution in a dataset is not balanced.
For example, in the SDSS dataset, we have three times as many
galaxies as quasars. To correct for this bias, we might want
to select a balanced training and test set. This is achieved
by balanced_train_test_split()
.