Welcome to dplython’s documentation!

Welcome to Dplython: Dplyr for Python.

Dplyr is a library for the language R designed to make data analysis fast and easy. The philosophy of Dplyr is to constrain data manipulation to a few simple functions that correspond to the most common tasks. This maps thinking closer to the process of writing code, helping you move closer to analyze data at the “speed of thought”.

The goal of this project is to implement the functionality of the R package Dplyr on top of Python’s pandas.

This is version 0.0.4. It’s experimental and subject to change.

Installation

To install, use pip:

pip install dplython

To get the latest development version, you can clone this repo or use the command:

pip install git+https://github.com/dodger487/dplython.git

Example usage

import pandas
from dplython import (DplyFrame, X, diamonds, select, sift,
  sample_n, sample_frac, head, arrange, mutate, group_by,
  summarize, DelayFunction)

# The example `diamonds` DataFrame is included in this package, but
# you can cast a DataFrame to a DplyFrame in this simple way:
# diamonds = DplyFrame(pandas.read_csv('./diamonds.csv'))

# Select specific columns of the DataFrame using select, and
#   get the first few using head
diamonds >> select(X.carat, X.cut, X.price) >> head(5)
"""
Out:
   carat        cut  price
0   0.23      Ideal    326
1   0.21    Premium    326
2   0.23       Good    327
3   0.29    Premium    334
4   0.31       Good    335
"""

# Filter out rows using sift
diamonds >> sift(X.carat > 4) >> select(X.carat, X.cut,
                                        X.depth, X.price)
"""
Out:
       carat      cut  depth  price
25998   4.01  Premium   61.0  15223
25999   4.01  Premium   62.5  15223
27130   4.13     Fair   64.8  17329
27415   5.01     Fair   65.5  18018
27630   4.50     Fair   65.8  18531
"""

# Sample with sample_n or sample_frac, sort with arrange
(diamonds >>
  sample_n(10) >>
  arrange(X.carat) >>
  select(X.carat, X.cut, X.depth, X.price))
"""
Out:
       carat        cut  depth  price
37277   0.23  Very Good   61.5    484
17728   0.30  Very Good   58.8    614
33255   0.32      Ideal   61.1    825
38911   0.33      Ideal   61.6   1052
31491   0.34    Premium   60.3    765
37227   0.40    Premium   61.9    975
2578    0.81    Premium   60.8   3213
15888   1.01       Fair   64.6   6353
26594   1.74      Ideal   62.9  16316
25727   2.38    Premium   62.4  14648
"""

# You can:
#   add columns with mutate (referencing other columns!)
#   group rows into dplyr-style groups with group_by
#   collapse rows into single rows using sumarize
(diamonds >>
  mutate(carat_bin=X.carat.round()) >>
  group_by(X.cut, X.carat_bin) >>
  summarize(avg_price=X.price.mean()))
"""
Out:
       avg_price  carat_bin        cut
0     863.908535          0      Ideal
1    4213.864948          1      Ideal
2   12838.984078          2      Ideal
...
27  13466.823529          3       Fair
28  15842.666667          4       Fair
29  18018.000000          5       Fair
"""

# If you have column names that don't work as attributes, you can use an
# alternate "get item" notation with X.
diamonds["column w/ spaces"] = range(len(diamonds))
diamonds >> select(X["column w/ spaces"]) >> head()
"""
Out:
   column w/ spaces
0                 0
1                 1
2                 2
3                 3
4                 4
5                 5
6                 6
7                 7
8                 8
9                 9
"""

# It's possible to pass the entire dataframe using X._
diamonds >> sample_n(6) >> select(X.carat, X.price) >> X._.T
"""
Out:
         18966    19729   9445   49951    3087    33128
carat     1.16     1.52     0.9    0.3     0.74    0.31
price  7803.00  8299.00  4593.0  540.0  3315.00  816.00
"""

# To pass the DataFrame or columns into functions, apply @DelayFunction
@DelayFunction
def PairwiseGreater(series1, series2):
  index = series1.index
  newSeries = pandas.Series([max(s1, s2) for s1, s2 in zip(series1, series2)])
  newSeries.index = index
  return newSeries

diamonds >> PairwiseGreater(X.x, X.y)

API reference

Dplyr-style operations on top of pandas DataFrame.

class dplython.dplython.DplyFrame(*args, **kwargs)[source]

A subclass of the pandas DataFrame with methods for function piping.

This class implements two main features on top of the pandas DataFrame. First, dplyr-style groups. In contrast to SQL-style or pandas style groups, rows are not collapsed and replaced with a function value. Second, >> is overloaded on the DataFrame so that functions on the right-hand side of this equation are called on the object. For example,

>>> df >> select(X.carat)

will call a function (created from the “select” call) on df.

Currently, these inputs need to be one of the following:

  • A “Later”
  • The “ungroup” function call
  • A function that returns a pandas DataFrame or DplyFrame.
class dplython.dplython.Later(name)[source]

Object which represents a computation to be carried out later.

The Later object allows us to save computation that cannot currently be executed. It will later receive a DataFrame as an input, and all computation will be carried out upon this DataFrame object.

Thus, we can refer to columns of the DataFrame as inputs to functions without having the DataFrame currently available:

>>> diamonds >> sift(X.carat > 4) >> select(X.carat, X.price)
Out:
       carat  price
25998   4.01  15223
25999   4.01  15223
27130   4.13  17329
27415   5.01  18018
27630   4.50  18531

The special Later name, "_" will refer to the entire DataFrame. For example,

>>> diamonds >> sample_n(6) >> select(X.carat, X.price) >> X._.T
Out:
         18966    19729   9445   49951    3087    33128
carat     1.16     1.52     0.9    0.3     0.74    0.31
price  7803.00  8299.00  4593.0  540.0  3315.00  816.00
class dplython.dplython.Manager[source]

Object which helps create a delayed computational unit.

Typically will be set as a global variable X. X.foo will refer to the "foo" column of the DataFrame in which it is later applied.

Manager can be used in two ways:

  1. attribute notation: X.foo
  2. item notation: X["foo"]

Attribute notation is preferred but item notation can be used in cases where column names contain characters on which python will choke, such as spaces, periods, and so forth.

dplython.dplython.arrange(*args, **kwargs)[source]

Sort DataFrame by the input column arguments.

>>> diamonds >> sample_n(5) >> arrange(X.price) >> select(X.depth, X.price)
Out:
       depth  price
28547   61.0    675
35132   59.1    889
42526   61.3   1323
3468    61.6   3392
23829   62.0  11903
dplython.dplython.head(*args, **kwargs)[source]

Returns first n rows

dplython.dplython.mutate(*args, **kwargs)[source]

Adds a column to the DataFrame.

This can use existing columns of the DataFrame as input.

>>> (diamonds >> 
        mutate(carat_bin=X.carat.round()) >> 
        group_by(X.cut, X.carat_bin) >> 
        summarize(avg_price=X.price.mean()))
Out:
       avg_price  carat_bin        cut
0     863.908535          0      Ideal
1    4213.864948          1      Ideal
2   12838.984078          2      Ideal
...
27  13466.823529          3       Fair
28  15842.666667          4       Fair
29  18018.000000          5       Fair
dplython.dplython.sample(*args, **kwargs)[source]

Convenience method that calls into pandas DataFrame’s sample method

dplython.dplython.sample_frac(*args, **kwargs)[source]

Randomly sample frac fraction of the DataFrame

dplython.dplython.sample_n(*args, **kwargs)[source]

Randomly sample n rows from the DataFrame

dplython.dplython.select(*args, **kwargs)[source]

Select specific columns from DataFrame.

Output will be DplyFrame type. Order of columns will be the same as input into select.

>>> diamonds >> select(X.color, X.carat) >> head(3)
Out:
  color  carat
0     E   0.23
1     E   0.21
2     E   0.23
dplython.dplython.sift(*args, **kwargs)[source]

Filters rows of the data that meet input criteria.

Giving multiple arguments to sift is equivalent to a logical “and”.

>>> df >> sift(X.carat > 4, X.cut == "Premium")
# Out:
# carat      cut color clarity  depth  table  price      x  ...
#  4.01  Premium     I      I1   61.0     61  15223  10.14
#  4.01  Premium     J      I1   62.5     62  15223  10.02

As in pandas, use bitwise logical operators like |, &:

>>> df >> sift((X.carat > 4) | (X.cut == "Ideal")) >> head(2)
# Out:  carat    cut color clarity  depth ...
#        0.23  Ideal     E     SI2   61.5     
#        0.23  Ideal     J     VS1   62.8