Table Of Contents

Frame bin_column_equal_depth


bin_column_equal_depth(self, column_name, num_bins=None, bin_column_name=None)

Classify column into groups with the same frequency.

Parameters:

column_name : unicode

The column whose values are to be binned.

num_bins : int32 (default=None)

The maximum number of bins. Default is the Square-root choice \lfloor \sqrt{m} \rfloor, where m is the number of rows.

bin_column_name : unicode (default=None)

The name for the new column holding the grouping labels. Default is <column_name>_binned.

Returns:

: dict

A list containing the edges of each bin.

Group rows of data based on the value in a single column and add a label to identify grouping.

Equal depth binning attempts to label rows such that each bin contains the same number of elements. For n bins of a column C of length m, the bin number is determined by:

\lceil n * \frac { f(C) }{ m } \rceil

where f is a tie-adjusted ranking function over values of C. If there are multiples of the same value in C, then their tie-adjusted rank is the average of their ordered rank values.

Notes

  1. Unicode in column names is not supported and will likely cause the drop_frames() method (and others) to fail!
  2. The num_bins parameter is considered to be the maximum permissible number of bins because the data may dictate fewer bins. For example, if the column to be binned has a quantity of :math”X elements with only 2 distinct values and the num_bins parameter is greater than 2, then the actual number of bins will only be 2. This is due to a restriction that elements with an identical value must belong to the same bin.

Examples

Given a frame with column a accessed by a Frame object my_frame:

>>> my_frame.inspect( n=11 )
[##]  a
========
[0]    1
[1]    1
[2]    2
[3]    3
[4]    5
[5]    8
[6]   13
[7]   21
[8]   34
[9]   55
[10]  89

Modify the frame, adding a column showing what bin the data is in. The data should be grouped into a maximum of five bins. Note that each bin will have the same quantity of members (as much as possible):

>>> cutoffs = my_frame.bin_column_equal_depth('a', 5, 'aEDBinned')
[===Job Progress===]
>>> my_frame.inspect( n=11 )
[##]  a   aEDBinned
===================
[0]    1          0
[1]    1          0
[2]    2          1
[3]    3          1
[4]    5          2
[5]    8          2
[6]   13          3
[7]   21          3
[8]   34          4
[9]   55          4
[10]  89          4
>>> print cutoffs
[1.0, 2.0, 5.0, 13.0, 34.0, 89.0]