chicken_turtle_util.data_frame

Extensions to pandas.DataFrame

Warning

Module contents have only been tested on DataFrames with an Index, DataFrames using a MultiIndex may not work with this module’s functions.

assert_equals Assert 2 data frames are equal
equals Get whether 2 data frames are equal
replace_na_with_none Replace NaN values in pd.DataFrame with None
split_array_like Split cells with array_like values along row axis.
chicken_turtle_util.data_frame.assert_equals(df1, df2, ignore_order=set(), ignore_indices=set(), all_close=False, _return_reason=False)[source]

Assert 2 data frames are equal

Like assert equals(df1, df2, ...), but with better hints at where the data frames differ. See chicken_turtle_util.data_frame.equals() for detailed parameter doc.

Parameters:

df1, df2 : pd.DataFrame

ignore_order : {int}

ignore_indices : {int}

all_close : bool

chicken_turtle_util.data_frame.equals(df1, df2, ignore_order=set(), ignore_indices=set(), all_close=False, _return_reason=False)[source]

Get whether 2 data frames are equal

NaNs are considered equal (which is consistent with pandas.DataFrame.equals). None is considered equal to NaN.

Parameters:

df1, df2 : pd.DataFrame

Data frames to compare

ignore_order : {int}

Axi in which to ignore order

ignore_indices : {int}

Axi of which to ignore the index. E.g. {1} allows differences in df.columns.name and df.columns.equals(df2.columns)`.

all_close : bool

If False, values must match exactly, if True, floats are compared as if compared with np.isclose.

_return_reason : bool

Internal. If True, equals returns a tuple containing the reason, else equals only returns a bool indicating equality (or equivalence rather).

Returns:

equal : bool

Whether they’re equal (after ignoring according to the parameters)

reason : str or None

If equal, None, otherwise short explanation of why the data frames aren’t equal. Omitted if not _return_reason.

Notes

All values (including those of indices) must be copyable and __eq__ must be such that a copy must equal its original. A value must equal itself unless it’s np.nan. Values needn’t be orderable or hashable (however pandas requires index values to be orderable and hashable). By consequence, this is not an efficient function, but it is flexible.

Examples

>>> from chicken_turtle_util import data_frame as df_
>>> import pandas as pd
>>> df = pd.DataFrame([
...        [1, 2, 3],
...        [4, 5, 6],
...        [7, 8, 9]
...    ],
...    index=pd.Index(('i1', 'i2', 'i3'), name='index1'),
...    columns=pd.Index(('c1', 'c2', 'c3'), name='columns1')
... )
>>> df
columns1  c1  c2  c3
index1              
i1         1   2   3
i2         4   5   6
i3         7   8   9
>>> df2 = df.reindex(('i3', 'i1', 'i2'), columns=('c2', 'c1', 'c3'))
>>> df2
columns1  c2  c1  c3
index1              
i3         8   7   9
i1         2   1   3
i2         5   4   6
>>> df_.equals(df, df2)
False
>>> df_.equals(df, df2, ignore_order=(0,1))
True
>>> df2 = df.copy()
>>> df2.index = [1,2,3]
>>> df2
columns1  c1  c2  c3
1          1   2   3
2          4   5   6
3          7   8   9
>>> df_.equals(df, df2)
False
>>> df_.equals(df, df2, ignore_indices={0})
True
>>> df2 = df.reindex(('i3', 'i1', 'i2'))
>>> df2
columns1  c1  c2  c3
index1              
i3         7   8   9
i1         1   2   3
i2         4   5   6
>>> df_.equals(df, df2, ignore_indices={0})  # does not ignore row order!
False
>>> df_.equals(df, df2, ignore_order={0})
True
>>> df2 = df.copy()
>>> df2.index.name = 'other'
>>> df_.equals(df, df2)  # df.index.name must match as well, same goes for df.columns.name
False
chicken_turtle_util.data_frame.replace_na_with_none(df)[source]

Replace NaN values in pd.DataFrame with None

Parameters:

df : pd.DataFrame

DataFrame whose NaN values to replace

Returns:

pd.DataFrame

df with NaN values replaced by None

Notes

Like DataFrame.fillna, but replaces NaN values with None, which DataFrame.fillna cannot do.

These None values will not be treated as NA by DataFrame, as the dtypes will be set to object

chicken_turtle_util.data_frame.split_array_like(df, columns=None)[source]

Split cells with array_like values along row axis.

Column names are maintained. The index is dropped, but this may change in the future.

Parameters:

df : pd.DataFrame

Data frame df[columns] should have cell values of type np.array_like.

columns : iterable(str) or str or None

Columns (or column) whose values to split. If None, df.columns is used.

Returns:

pd.DataFrame

Data frame with array_like values in df[columns] split across rows, and corresponding values in other columns repeated.

Examples

>>> df = pd.DataFrame([[1,[1,2],[1]],[1,[1,2],[3,4,5]],[2,[1],[1,2]]], columns=('check', 'a', 'b'))
>>> df
   check       a          b
0      1  [1, 2]        [1]
1      1  [1, 2]  [3, 4, 5]
2      2     [1]     [1, 2]
>>> split_array_like(df, ['a', 'b'])
  check  a  b
0     1  1  1
1     1  2  1
2     1  1  3
3     1  1  4
4     1  1  5
5     1  2  3
6     1  2  4
7     1  2  5
8     2  1  1
9     2  1  2