Welcome to dplython’s documentation!¶
Welcome to Dplython: Dplyr for Python.
Dplyr is a library for the language R designed to make data analysis fast and easy. The philosophy of Dplyr is to constrain data manipulation to a few simple functions that correspond to the most common tasks. This maps thinking closer to the process of writing code, helping you move closer to analyze data at the “speed of thought”.
The goal of this project is to implement the functionality of the R package Dplyr on top of Python’s pandas.
This is version 0.0.4. It’s experimental and subject to change.
Installation¶
To install, use pip:
pip install dplython
To get the latest development version, you can clone this repo or use the command:
pip install git+https://github.com/dodger487/dplython.git
Example usage¶
import pandas
from dplython import (DplyFrame, X, diamonds, select, sift,
sample_n, sample_frac, head, arrange, mutate, group_by,
summarize, DelayFunction)
# The example `diamonds` DataFrame is included in this package, but
# you can cast a DataFrame to a DplyFrame in this simple way:
# diamonds = DplyFrame(pandas.read_csv('./diamonds.csv'))
# Select specific columns of the DataFrame using select, and
# get the first few using head
diamonds >> select(X.carat, X.cut, X.price) >> head(5)
"""
Out:
carat cut price
0 0.23 Ideal 326
1 0.21 Premium 326
2 0.23 Good 327
3 0.29 Premium 334
4 0.31 Good 335
"""
# Filter out rows using sift
diamonds >> sift(X.carat > 4) >> select(X.carat, X.cut,
X.depth, X.price)
"""
Out:
carat cut depth price
25998 4.01 Premium 61.0 15223
25999 4.01 Premium 62.5 15223
27130 4.13 Fair 64.8 17329
27415 5.01 Fair 65.5 18018
27630 4.50 Fair 65.8 18531
"""
# Sample with sample_n or sample_frac, sort with arrange
(diamonds >>
sample_n(10) >>
arrange(X.carat) >>
select(X.carat, X.cut, X.depth, X.price))
"""
Out:
carat cut depth price
37277 0.23 Very Good 61.5 484
17728 0.30 Very Good 58.8 614
33255 0.32 Ideal 61.1 825
38911 0.33 Ideal 61.6 1052
31491 0.34 Premium 60.3 765
37227 0.40 Premium 61.9 975
2578 0.81 Premium 60.8 3213
15888 1.01 Fair 64.6 6353
26594 1.74 Ideal 62.9 16316
25727 2.38 Premium 62.4 14648
"""
# You can:
# add columns with mutate (referencing other columns!)
# group rows into dplyr-style groups with group_by
# collapse rows into single rows using sumarize
(diamonds >>
mutate(carat_bin=X.carat.round()) >>
group_by(X.cut, X.carat_bin) >>
summarize(avg_price=X.price.mean()))
"""
Out:
avg_price carat_bin cut
0 863.908535 0 Ideal
1 4213.864948 1 Ideal
2 12838.984078 2 Ideal
...
27 13466.823529 3 Fair
28 15842.666667 4 Fair
29 18018.000000 5 Fair
"""
# If you have column names that don't work as attributes, you can use an
# alternate "get item" notation with X.
diamonds["column w/ spaces"] = range(len(diamonds))
diamonds >> select(X["column w/ spaces"]) >> head()
"""
Out:
column w/ spaces
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
"""
# It's possible to pass the entire dataframe using X._
diamonds >> sample_n(6) >> select(X.carat, X.price) >> X._.T
"""
Out:
18966 19729 9445 49951 3087 33128
carat 1.16 1.52 0.9 0.3 0.74 0.31
price 7803.00 8299.00 4593.0 540.0 3315.00 816.00
"""
# To pass the DataFrame or columns into functions, apply @DelayFunction
@DelayFunction
def PairwiseGreater(series1, series2):
index = series1.index
newSeries = pandas.Series([max(s1, s2) for s1, s2 in zip(series1, series2)])
newSeries.index = index
return newSeries
diamonds >> PairwiseGreater(X.x, X.y)
API reference¶
Dplyr-style operations on top of pandas DataFrame.
-
class
dplython.dplython.
DplyFrame
(*args, **kwargs)[source]¶ A subclass of the pandas DataFrame with methods for function piping.
This class implements two main features on top of the pandas DataFrame. First, dplyr-style groups. In contrast to SQL-style or pandas style groups, rows are not collapsed and replaced with a function value. Second, >> is overloaded on the DataFrame so that functions on the right-hand side of this equation are called on the object. For example,
>>> df >> select(X.carat)
will call a function (created from the “select” call) on df.
Currently, these inputs need to be one of the following:
- A “Later”
- The “ungroup” function call
- A function that returns a pandas DataFrame or DplyFrame.
-
class
dplython.dplython.
Later
(name)[source]¶ Object which represents a computation to be carried out later.
The Later object allows us to save computation that cannot currently be executed. It will later receive a DataFrame as an input, and all computation will be carried out upon this DataFrame object.
Thus, we can refer to columns of the DataFrame as inputs to functions without having the DataFrame currently available:
>>> diamonds >> sift(X.carat > 4) >> select(X.carat, X.price) Out: carat price 25998 4.01 15223 25999 4.01 15223 27130 4.13 17329 27415 5.01 18018 27630 4.50 18531
The special Later name,
"_"
will refer to the entire DataFrame. For example,>>> diamonds >> sample_n(6) >> select(X.carat, X.price) >> X._.T Out: 18966 19729 9445 49951 3087 33128 carat 1.16 1.52 0.9 0.3 0.74 0.31 price 7803.00 8299.00 4593.0 540.0 3315.00 816.00
-
class
dplython.dplython.
Manager
[source]¶ Object which helps create a delayed computational unit.
Typically will be set as a global variable
X
.X.foo
will refer to the"foo"
column of the DataFrame in which it is later applied.Manager can be used in two ways:
- attribute notation:
X.foo
- item notation:
X["foo"]
Attribute notation is preferred but item notation can be used in cases where column names contain characters on which python will choke, such as spaces, periods, and so forth.
- attribute notation:
-
dplython.dplython.
arrange
(*args, **kwargs)[source]¶ Sort DataFrame by the input column arguments.
>>> diamonds >> sample_n(5) >> arrange(X.price) >> select(X.depth, X.price) Out: depth price 28547 61.0 675 35132 59.1 889 42526 61.3 1323 3468 61.6 3392 23829 62.0 11903
-
dplython.dplython.
mutate
(*args, **kwargs)[source]¶ Adds a column to the DataFrame.
This can use existing columns of the DataFrame as input.
>>> (diamonds >> mutate(carat_bin=X.carat.round()) >> group_by(X.cut, X.carat_bin) >> summarize(avg_price=X.price.mean())) Out: avg_price carat_bin cut 0 863.908535 0 Ideal 1 4213.864948 1 Ideal 2 12838.984078 2 Ideal ... 27 13466.823529 3 Fair 28 15842.666667 4 Fair 29 18018.000000 5 Fair
-
dplython.dplython.
sample
(*args, **kwargs)[source]¶ Convenience method that calls into pandas DataFrame’s sample method
-
dplython.dplython.
sample_frac
(*args, **kwargs)[source]¶ Randomly sample frac fraction of the DataFrame
-
dplython.dplython.
select
(*args, **kwargs)[source]¶ Select specific columns from DataFrame.
Output will be DplyFrame type. Order of columns will be the same as input into select.
>>> diamonds >> select(X.color, X.carat) >> head(3) Out: color carat 0 E 0.23 1 E 0.21 2 E 0.23
-
dplython.dplython.
sift
(*args, **kwargs)[source]¶ Filters rows of the data that meet input criteria.
Giving multiple arguments to sift is equivalent to a logical “and”.
>>> df >> sift(X.carat > 4, X.cut == "Premium") # Out: # carat cut color clarity depth table price x ... # 4.01 Premium I I1 61.0 61 15223 10.14 # 4.01 Premium J I1 62.5 62 15223 10.02
As in pandas, use bitwise logical operators like
|
,&
:>>> df >> sift((X.carat > 4) | (X.cut == "Ideal")) >> head(2) # Out: carat cut color clarity depth ... # 0.23 Ideal E SI2 61.5 # 0.23 Ideal J VS1 62.8