pandas-ply: functional data manipulation for pandas

pandas-ply is a thin layer which makes it easier to manipulate data with pandas. In particular, it provides elegant, functional, chainable syntax in cases where pandas would require mutation, saved intermediate values, or other awkward constructions. In this way, it aims to move pandas closer to the “grammar of data manipulation” provided by the dplyr package for R.

For example, take the dplyr code below:

flights %>%
  group_by(year, month, day) %>%
  summarise(
    arr = mean(arr_delay, na.rm = TRUE),
    dep = mean(dep_delay, na.rm = TRUE)
  ) %>%
  filter(arr > 30 & dep > 30)

The most common way to express this in pandas is probably:

grouped_flights = flights.groupby(['year', 'month', 'day'])
output = pd.DataFrame()
output['arr'] = grouped_flights.arr_delay.mean()
output['dep'] = grouped_flights.dep_delay.mean()
filtered_output = output[(output.arr > 30) & (output.dep > 30)]

pandas-ply lets you instead write:

(flights
  .groupby(['year', 'month', 'day'])
  .ply_select(
    arr = X.arr_delay.mean(),
    dep = X.dep_delay.mean())
  .ply_where(X.arr > 30, X.dep > 30))

In our opinion, this pandas-ply code is cleaner, more expressive, more readable, more concise, and less error-prone than the original pandas code.

Explanatory notes on the pandas-ply code sample above:

  • pandas-ply‘s methods (like ply_select and ply_where above) are attached directly to pandas objects and can be used immediately, without any wrapping or redirection. They start with a ply_ prefix to distinguish them from built-in pandas methods.
  • pandas-ply‘s methods are named for (and modelled after) SQL’s operators. (But keep in mind that these operators will not always appear in the same order as they do in a SQL statement: SELECT a FROM b WHERE c GROUP BY d probably maps to b.ply_where(c).groupby(d).ply_select(a).)
  • pandas-ply includes a simple system for building “symbolic expressions” to provide as arguments to its methods. X above is an instance of ply.symbolic.Symbol. Operations on this symbol produce larger compound symbolic expressions. When pandas-ply receives a symbolic expression as an argument, it converts it into a function. So, for instance, X.arr > 30 in the above code could have instead been provided as lambda x: x.arr > 30. Use of symbolic expressions allows the lambda x: to be left off, resulting in less cluttered code.

Warning

pandas-ply is new, and in an experimental stage of its development. The API is not yet stable. Expect the unexpected.

(Pull requests are welcome. Feel free to contact us at pandas-ply@coursera.org.)

Using pandas-ply

Install pandas-ply with:

$ pip install pandas-ply

Typical use of pandas-ply starts with:

import pandas as pd
from pandas_ply import install_ply, X, sym_call

install_ply(pd)

After calling install_ply, all pandas objects have pandas-ply‘s methods attached.

API reference

pandas extensions

This module contains the pandas-ply methods which are designed to be added to panda objects. The methods in this module should not be used directly. Instead, the function install_ply should be used to attach them to the pandas classes.

class ply.methods.PlyDataFrame

The following methods are added to pandas.DataFrame:

ply_select(*args, **kwargs)

Transform a dataframe by selecting old columns and new (computed) columns.

Analogous to SQL’s SELECT, or dplyr’s select / rename / mutate / transmute.

Parameters:
  • *args

    Each should be one of:

    '*'
    says that all columns in the input dataframe should be included
    'column_name'
    says that column_name in the input dataframe should be included
    '-column_name'
    says that column_name in the input dataframe should be excluded.

    If any ‘-column_name’ is present, then ‘*’ should be present, and if ‘*’ is present, no ‘column_name’ should be present. Column-includes and column-excludes should not overlap.

  • **kwargs – Each argument name will be the name of a new column in the output dataframe, with the column’s contents determined by the argument contents. These contents can be given as a dataframe, a function (taking the input dataframe as its single argument), or a symbolic expression (taking the input dataframe as Symbol(0)). kwarg-provided columns override arg-provided columns.

Example

>>> flights.ply_select('*',
...     gain = X.arr_delay - X.dep_delay,
...     speed = X.distance / X.air_time * 60)
[ original dataframe, with two new computed columns added ]
ply_where(*conditions)

Filter a dataframe/series to only include rows/entries satisfying a given set of conditions.

Analogous to SQL’s WHERE, or dplyr’s filter.

Parameters:*conditions – Each should be a dataframe/series of booleans, a function returning such an object when run on the input dataframe, or a symbolic expression yielding such an object when evaluated with Symbol(0) mapped to the input dataframe. The input dataframe will be filtered by the AND of all the conditions.

Example

>>> flights.ply_where(X.month == 1, X.day == 1)
[ same result as `flights[(flights.month == 1) & (flights.day == 1)]` ]
class ply.methods.PlyDataFrameGroupBy

The following methods are added to pandas.core.groupby.DataFrameGroupBy:

ply_select(**kwargs)

Summarize a grouped dataframe or series.

Analogous to SQL’s SELECT (when a GROUP BY is present), or dplyr’s summarise.

Parameters:**kwargs – Each argument name will be the name of a new column in the output dataframe, with the column’s contents determined by the argument contents. These contents can be given as a dataframe, a function (taking the input grouped dataframe as its single argument), or a symbolic expression (taking the input grouped dataframe as Symbol(0)).
class ply.methods.PlySeries

The following methods are added to pandas.Series:

ply_where(*conditions)

Filter a dataframe/series to only include rows/entries satisfying a given set of conditions.

Analogous to SQL’s WHERE, or dplyr’s filter.

Parameters:*conditions – Each should be a dataframe/series of booleans, a function returning such an object when run on the input dataframe, or a symbolic expression yielding such an object when evaluated with Symbol(0) mapped to the input dataframe. The input dataframe will be filtered by the AND of all the conditions.

Example

>>> flights.ply_where(X.month == 1, X.day == 1)
[ same result as `flights[(flights.month == 1) & (flights.day == 1)]` ]
class ply.methods.PlySeriesGroupBy

The following methods are added to pandas.core.groupby.SeriesGroupBy:

ply_select(**kwargs)

Summarize a grouped dataframe or series.

Analogous to SQL’s SELECT (when a GROUP BY is present), or dplyr’s summarise.

Parameters:**kwargs – Each argument name will be the name of a new column in the output dataframe, with the column’s contents determined by the argument contents. These contents can be given as a dataframe, a function (taking the input grouped dataframe as its single argument), or a symbolic expression (taking the input grouped dataframe as Symbol(0)).
ply.methods.install_ply(pandas_to_use)

Install pandas-ply onto the objects in a copy of pandas.

ply.symbolic

ply.symbolic is a simple system for building “symbolic expressions” to provide as arguments to pandas-ply‘s methods (in place of lambda expressions).

class ply.symbolic.Call(func, args=[], kwargs={})

Bases: ply.symbolic.Expression

Call(func, args, kwargs) is a symbolic expression representing the result of func(*args, **kwargs). (func, each member of the args iterable, and each value in the kwargs dictionary can themselves be symbolic).

_eval(context, **options)
class ply.symbolic.Expression

Bases: object

Expression is the (abstract) base class for symbolic expressions. Symbolic expressions are encoded representations of Python expressions, kept on ice until you are ready to evaluate them. Operations on symbolic expressions (like my_expr.some_attr or my_expr(some_arg) or my_expr + 7) are automatically turned into symbolic representations thereof – nothing is actually done until the special evaluation method _eval is called.

__call__(*args, **kwargs)

Construct a symbolic representation of self(*args, **kwargs).

__getattr__(name)

Construct a symbolic representation of getattr(self, name).

_eval(context, **options)

Evaluate a symbolic expression.

Parameters:
  • context – The context object for evaluation. Currently, this is a dictionary mapping symbol names to values,
  • **options – Options for evaluation. Currently, the only option is log, which results in some debug output during evaluation if it is set to True.
Returns:

anything

class ply.symbolic.GetAttr(obj, name)

Bases: ply.symbolic.Expression

GetAttr(obj, name) is a symbolic expression representing the result of getattr(obj, name). (obj and name can themselves be symbolic.)

_eval(context, **options)
class ply.symbolic.Symbol(name)

Bases: ply.symbolic.Expression

Symbol(name) is an atomic symbolic expression, labelled with an arbitrary name.

_eval(context, **options)
ply.symbolic.X

Symbol(name) is an atomic symbolic expression, labelled with an arbitrary name.

ply.symbolic._get_sym_magic_method(name)
ply.symbolic.eval_if_symbolic(obj, context, **options)

Evaluate an object if it is a symbolic expression, or otherwise just returns it back.

Parameters:
  • obj – Either a symbolic expression, or anything else (in which case this is a noop).
  • context – Passed as an argument to obj._eval if obj is symbolic.
  • **options – Passed as arguments to obj._eval if obj is symbolic.
Returns:

anything

Examples

>>> eval_if_symbolic(Symbol('x'), {'x': 10})
10
>>> eval_if_symbolic(7, {'x': 10})
7
ply.symbolic.sym_call(func, *args, **kwargs)

Construct a symbolic representation of func(*args, **kwargs).

This is necessary because func(symbolic) will not (ordinarily) know to construct a symbolic expression when it receives the symbolic expression symbolic as a parameter (if func is not itself symbolic). So instead, we write sym_call(func, symbolic).

Tip: If the main argument of the function is a (symbolic) DataFrame, then pandas’ pipe method takes care of this problem without sym_call. For instance, while np.sqrt(X) won’t work, X.pipe(np.sqrt) will.

Parameters:
  • func – Function to call on evaluation (can be symbolic).
  • *args – Arguments to provide to func on evaluation (can be symbolic).
  • **kwargs – Keyword arguments to provide to func on evaluation (can be symbolic).
Returns:

ply.symbolic.Expression

Example

>>> sym_call(math.sqrt, Symbol('x'))._eval({'x': 16})
4
ply.symbolic.to_callable(obj)

Turn an object into a callable.

Parameters:obj

This can be

  • a symbolic expression, in which case the output callable evaluates the expression with symbols taking values from the callable’s arguments (listed arguments named according to their numerical index, keyword arguments named according to their string keys),
  • a callable, in which case the output callable is just the input object, or
  • anything else, in which case the output callable is a constant function which always returns the input object.
Returns:callable

Examples

>>> to_callable(Symbol(0) + Symbol('x'))(3, x=4)
7
>>> to_callable(lambda x: x + 1)(10)
11
>>> to_callable(12)(3, x=4)
12