Currently, Pebl only includes one discretization implementation but more may come. Discretization and other data pre-processing steps can have a big impact on the final results.
Performs a maximum-entropy discretization of data in-place.
Requirements for this implementation:
- Try to make all bins equal sized (maximize the entropy)
- If datum x==y in the original dataset, then disc(x)==disc(y) For example, all datapoints with value 3.245 discretize to 1 even if it violates requirement 1.
Example:
input: [3,7,4,4,4,5] output: [0,1,0,0,0,1]
Note that all 4s discretize to 0, which makes bin sizes unequal.