Table Of Contents

VertexFrame column_median


column_median(self, data_column, weights_column=None)

Calculate the (weighted) median of a column.

Parameters:

data_column : unicode

The column whose median is to be calculated.

weights_column : unicode (default=None)

The column that provides weights (frequencies) for the median calculation. Must contain numerical data. Default is all items have a weight of 1.

Returns:

: None

varies

The median of the values. If a weight column is provided and no weights are finite numbers greater than 0, None is returned. The type of the median returned is the same as the contents of the data column, so a column of Longs will result in a Long median and a column of Floats will result in a Float median.

The median is the least value X in the range of the distribution so that the cumulative weight of values strictly below X is strictly less than half of the total weight and the cumulative weight of values up to and including X is greater than or equal to one-half of the total weight.

All data elements of weight less than or equal to 0 are excluded from the calculation, as are all data elements whose weight is NaN or infinite. If a weight column is provided and no weights are finite numbers greater than 0, None is returned.

Examples

Given a frame with column ‘a’ accessed by a Frame object ‘my_frame’:

>>> import trustedanalytics as ta
>>> ta.connect()
Connected ...
>>> data = [[2],[3],[3],[5],[7],[10],[30]]
>>> schema = [('a', ta.int32)]
>>> my_frame = ta.Frame(ta.UploadRows(data, schema))
[===Job Progress===]

Inspect my_frame

>>> my_frame.inspect()
[#]  a
=======
[0]   2
[1]   3
[2]   3
[3]   5
[4]   7
[5]  10
[6]  30

Compute and return middle number of values in column a:

>>> median = my_frame.column_median('a')
[===Job Progress===]
>>> print median
5

Given a frame with column ‘a’ and column ‘w’ as weights accessed by a Frame object ‘my_frame’:

>>> data = [[2,1.7],[3,0.5],[3,1.2],[5,0.8],[7,1.1],[10,0.8],[30,0.1]]
>>> schema = [('a', ta.int32), ('w', ta.float32)]
>>> my_frame = ta.Frame(ta.UploadRows(data, schema))
[===Job Progress===]

Inspect my_frame

>>> my_frame.inspect()
[#]  a   w
=======================
[0]   2   1.70000004768
[1]   3             0.5
[2]   3   1.20000004768
[3]   5  0.800000011921
[4]   7   1.10000002384
[5]  10  0.800000011921
[6]  30   0.10000000149

Compute and return middle number of values in column ‘a’ with weights ‘w’:

>>> median = my_frame.column_median('a', weights_column='w')
[===Job Progress===]
>>> print median
3