`datascience` Tables¶

class prob140.Table(labels=None, _deprecated=None, *, formatter=<datascience.formats.Formatter object>)[source]

A sequence of string-labeled columns.

class Rows(table)[source]: An iterable view over the rows in a table.

Table.append(row_or_table)[source]: Append a row or all rows of a table. An appended table must have all columns of self.

Table.append_column(label, values)[source]

Appends a column to the table or replaces a column.

__setitem__ is aliased to this method:

table.append_column('new_col', make_array(1, 2, 3)) is equivalent to table['new_col'] = make_array(1, 2, 3).

Args:

label (str): The label of the new column.

values (single value or list/array): If a single value, every

value in the new column is values.

If a list or array, the new column contains the values in values, which must be the same length as the table.

Returns:

Original table with new or replaced column

Raises:

ValueError: If

label is not a string.
values is a list/array and does not have the same length as the number of rows in the table.

>>> table = Table().with_columns(
...     'letter', make_array('a', 'b', 'c', 'z'),
...     'count',  make_array(9, 3, 3, 1),
...     'points', make_array(1, 2, 2, 10))
>>> table
letter | count | points
a      | 9     | 1
b      | 3     | 2
c      | 3     | 2
z      | 1     | 10
>>> table.append_column('new_col1', make_array(10, 20, 30, 40))
>>> table
letter | count | points | new_col1
a      | 9     | 1      | 10
b      | 3     | 2      | 20
c      | 3     | 2      | 30
z      | 1     | 10     | 40
>>> table.append_column('new_col2', 'hello')
>>> table
letter | count | points | new_col1 | new_col2
a      | 9     | 1      | 10       | hello
b      | 3     | 2      | 20       | hello
c      | 3     | 2      | 30       | hello
z      | 1     | 10     | 40       | hello
>>> table.append_column(123, make_array(1, 2, 3, 4))
Traceback (most recent call last):
    ...
ValueError: The column label must be a string, but a int was given
>>> table.append_column('bad_col', [1, 2])
Traceback (most recent call last):
    ...
ValueError: Column length mismatch. New column does not have the same
number of rows as table.

Table.apply(fn, *column_or_columns)[source]

Apply fn to each element or elements of column_or_columns. If no column_or_columns provided, fn` is applied to each row.

Args:

fn (function) – The function to apply. column_or_columns: Columns containing the arguments to fn

as either column labels (str) or column indices (int). The number of columns must match the number of arguments that fn expects.

Raises:

ValueError – if column_label is not an existing: column in the table.
TypeError – if insufficent number of column_label passed: to fn.

Returns:

An array consisting of results of applying fn to elements specified by column_label in each row.

>>> t = Table().with_columns(
...     'letter', make_array('a', 'b', 'c', 'z'),
...     'count',  make_array(9, 3, 3, 1),
...     'points', make_array(1, 2, 2, 10))
>>> t
letter | count | points
a      | 9     | 1
b      | 3     | 2
c      | 3     | 2
z      | 1     | 10
>>> t.apply(lambda x: x - 1, 'points')
array([0, 1, 1, 9])
>>> t.apply(lambda x, y: x * y, 'count', 'points')
array([ 9,  6,  6, 10])
>>> t.apply(lambda x: x - 1, 'count', 'points')
Traceback (most recent call last):
    ...
TypeError: <lambda>() takes 1 positional argument but 2 were given
>>> t.apply(lambda x: x - 1, 'counts')
Traceback (most recent call last):
    ...
ValueError: The column "counts" is not in the table. The table contains these columns: letter, count, points

Whole rows are passed to the function if no columns are specified.

>>> t.apply(lambda row: row[1] * 2)
array([18,  6,  6,  2])

Table.as_html(max_rows=0)[source]: Format table as HTML.

Table.as_text(max_rows=0, sep=' | ')[source]: Format table as text.

Table.bar(column_for_categories=None, select=None, overlay=True, width=6, height=4, **vargs)[source]

Plot bar charts for the table.

Each plot is labeled using the values in column_for_categories and one plot is produced for every other column (or for the columns designated by select).

Every selected except column for column_for_categories must be numerical.

Args:

column_for_categories (str): A column containing x-axis categories

Kwargs:

overlay (bool): create a chart with one color per data column;: if False, each will be displayed separately.
vargs: Additional arguments that get passed into plt.bar.: See http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.bar for additional arguments that can be passed into vargs.

Table.barh(column_for_categories=None, select=None, overlay=True, width=6, **vargs)[source]

Plot horizontal bar charts for the table.

Args:

column_for_categories (str): A column containing y-axis categories: used to create buckets for bar chart.

Kwargs:

overlay (bool): create a chart with one color per data column;: if False, each will be displayed separately.
vargs: Additional arguments that get passed into plt.barh.: See http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.barh for additional arguments that can be passed into vargs.

Raises:

ValueError – Every selected except column for column_for_categories: must be numerical.

Returns:

Horizontal bar graph with buckets specified by column_for_categories. Each plot is labeled using the values in column_for_categories and one plot is produced for every other column (or for the columns designated by select).

>>> t = Table().with_columns(
...     'Furniture', make_array('chairs', 'tables', 'desks'),
...     'Count', make_array(6, 1, 2),
...     'Price', make_array(10, 20, 30)
...     )
>>> t
Furniture | Count | Price
chairs    | 6     | 10
tables    | 1     | 20
desks     | 2     | 30
>>> furniture_table.barh('Furniture') 
<bar graph with furniture as categories and bars for count and price>
>>> furniture_table.barh('Furniture', 'Price') 
<bar graph with furniture as categories and bars for price>
>>> furniture_table.barh('Furniture', make_array(1, 2)) 
<bar graph with furniture as categories and bars for count and price>

Table.bin(*columns, **vargs)[source]

Group values by bin and compute counts per bin by column.

By default, bins are chosen to contain all values in all columns. The following named arguments from numpy.histogram can be applied to specialize bin widths:

If the original table has n columns, the resulting binned table has n+1 columns, where column 0 contains the lower bound of each bin.

Args:

columns (str or int): Labels or indices of columns to be: binned. If empty, all columns are binned.
bins (int or sequence of scalars): If bins is an int,: it defines the number of equal-width bins in the given range (10, by default). If bins is a sequence, it defines the bin edges, including the rightmost edge, allowing for non-uniform bin widths.
range ((float, float)): The lower and upper range of: the bins. If not provided, range contains all values in the table. Values outside the range are ignored.
density (bool): If False, the result will contain the number of: samples in each bin. If True, the result is the value of the probability density function at the bin, normalized such that the integral over the range is 1. Note that the sum of the histogram values will not be equal to 1 unless bins of unity width are chosen; it is not a probability mass function.

Table.boxplot(**vargs)[source]

Plots a boxplot for the table.

Every column must be numerical.

Kwargs:

vargs: Additional arguments that get passed into plt.boxplot.: See http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.boxplot for additional arguments that can be passed into vargs. These include vert and showmeans.

Returns:

None

Raises:

ValueError: The Table contains columns with non-numerical values.

>>> table = Table().with_columns(
...     'test1', make_array(92.5, 88, 72, 71, 99, 100, 95, 83, 94, 93),
...     'test2', make_array(89, 84, 74, 66, 92, 99, 88, 81, 95, 94))
>>> table
test1 | test2
92.5  | 89
88    | 84
72    | 74
71    | 66
99    | 92
100   | 99
95    | 88
83    | 81
94    | 95
93    | 94
>>> table.boxplot() 
<boxplot of test1 and boxplot of test2 side-by-side on the same figure>

Table.cdf(x)

Finds the cdf of the distribution

Parameters:

Parameters:	x : float Value in distribution
Returns:	float Finds P(X<=x)

x : float

Value in distribution

Returns:

float

Finds P(X<=x)

Examples

>>> dist = Table().with_columns('Value',make_array(2, 3, 4),'Probability',make_array(0.25, 0.5, 0.25))
>>> dist.cdf(0)
0
>>> dist.cdf(2)
0.25
>>> dist.cdf(3.5)
0.75
>>> dist.cdf(1000)
1

Table.column(index_or_label)[source]

Return the values of a column as an array.

table.column(label) is equivalent to table[label].

>>> tiles = Table().with_columns(
...     'letter', make_array('c', 'd'),
...     'count',  make_array(2, 4),
... )

>>> tiles.column('letter')
array(['c', 'd'],
      dtype='<U1')
>>> tiles.column(1)
array([2, 4])

Args:: label (int or str): The index or label of a column
Returns:: An instance of numpy.array.
Raises:: ValueError: When the index_or_label is not in the table.

Table.column_index(label)[source]: Return the index of a column by looking up its label.

Table.column_labels: Return a tuple of column labels. [Deprecated]

Table.copy(*, shallow=False)[source]: Return a copy of a table.

Table.drop(*column_or_columns)[source]

Return a Table with only columns other than selected label or labels.

Args:

column_or_columns (string or list of strings): The header names or indices of the columns to be dropped.

column_or_columns must be an existing header name, or a valid column index.

Returns:

An instance of Table with given columns removed.

>>> t = Table().with_columns(
...     'burgers',  make_array('cheeseburger', 'hamburger', 'veggie burger'),
...     'prices',   make_array(6, 5, 5),
...     'calories', make_array(743, 651, 582))
>>> t
burgers       | prices | calories
cheeseburger  | 6      | 743
hamburger     | 5      | 651
veggie burger | 5      | 582
>>> t.drop('prices')
burgers       | calories
cheeseburger  | 743
hamburger     | 651
veggie burger | 582
>>> t.drop(['burgers', 'calories'])
prices
6
5
5
>>> t.drop('burgers', 'calories')
prices
6
5
5
>>> t.drop([0, 2])
prices
6
5
5
>>> t.drop(0, 2)
prices
6
5
5
>>> t.drop(1)
burgers       | calories
cheeseburger  | 743
hamburger     | 651
veggie burger | 582

classmethod Table.empty(labels=None)[source]

Creates an empty table. Column labels are optional. [Deprecated]

Args:

labels (None or list): If None, a table with 0: columns is created. If a list, each element is a column label in a table with 0 rows.

Returns:

A new instance of Table.

Table.event(x)

Shows the probability that distribution takes on value x or list of values x.

Parameters:

Parameters:	x : float or Iterable An event represented either as a specific value in the domain or a subset of the domain
Returns:	Table Shows the probabilities of each value in the event

x : float or Iterable

An event represented either as a specific value in the domain or a subset of the domain

Returns:

Table

Shows the probabilities of each value in the event

Examples

>>> dist = Table().values([1,2,3,4]).probability([1/4,1/4,1/4,1/4])
>>> dist.event(2)
Domain | Probability
2      | 0.25

>>> dist.event([2,3])
Domain | Probability
2      | 0.25
3      | 0.25

Table.exclude()[source]

Return a new Table without a sequence of rows excluded by number.

Args:

row_indices_or_slice (integer or list of integers or slice):: The row index, list of row indices or a slice of row indices to be excluded.

Returns:

A new instance of Table.

>>> t = Table().with_columns(
...     'letter grade', make_array('A+', 'A', 'A-', 'B+', 'B', 'B-'),
...     'gpa', make_array(4, 4, 3.7, 3.3, 3, 2.7))
>>> t
letter grade | gpa
A+           | 4
A            | 4
A-           | 3.7
B+           | 3.3
B            | 3
B-           | 2.7
>>> t.exclude(4)
letter grade | gpa
A+           | 4
A            | 4
A-           | 3.7
B+           | 3.3
B-           | 2.7
>>> t.exclude(-1)
letter grade | gpa
A+           | 4
A            | 4
A-           | 3.7
B+           | 3.3
B            | 3
>>> t.exclude(make_array(1, 3, 4))
letter grade | gpa
A+           | 4
A-           | 3.7
B-           | 2.7
>>> t.exclude(range(3))
letter grade | gpa
B+           | 3.3
B            | 3
B-           | 2.7

Note that exclude also supports NumPy-like indexing and slicing:

>>> t.exclude[:3]
letter grade | gpa
B+           | 3.3
B            | 3
B-           | 2.7

>>> t.exclude[1, 3, 4]
letter grade | gpa
A+           | 4
A-           | 3.7
B-           | 2.7

Table.expected_value()

Finds expected value of distribution

Returns:

Returns:	float Expected value

float

Expected value

classmethod Table.from_array(arr)[source]: Convert a structured NumPy array into a Table.

classmethod Table.from_columns_dict(columns)[source]: Create a table from a mapping of column labels to column values. [Deprecated]

classmethod Table.from_df(df)[source]: Convert a Pandas DataFrame into a Table.

classmethod Table.from_records(records)[source]: Create a table from a sequence of records (dicts with fixed keys).

classmethod Table.from_rows(rows, labels)[source]: Create a table from a sequence of rows (fixed-length sequences). [Deprecated]

Table.group(column_or_label, collect=None)[source]

Group rows by unique values in a column; count or aggregate others.

Args:

column_or_label: values to group (column label or index, or array)

collect: a function applied to values in other columns for each group

Returns:

A Table with each row corresponding to a unique value in column_or_label, where the first column contains the unique values from column_or_label, and the second contains counts for each of the unique values. If collect is provided, a Table is returned with all original columns, each containing values calculated by first grouping rows according to column_or_label, then applying collect to each set of grouped values in the other columns.

Note:

The grouped column will appear first in the result table. If collect does not accept arguments with one of the column types, that column will be empty in the resulting table.

>>> marbles = Table().with_columns(
...    "Color", make_array("Red", "Green", "Blue", "Red", "Green", "Green"),
...    "Shape", make_array("Round", "Rectangular", "Rectangular", "Round", "Rectangular", "Round"),
...    "Amount", make_array(4, 6, 12, 7, 9, 2),
...    "Price", make_array(1.30, 1.30, 2.00, 1.75, 1.40, 1.00))
>>> marbles
Color | Shape       | Amount | Price
Red   | Round       | 4      | 1.3
Green | Rectangular | 6      | 1.3
Blue  | Rectangular | 12     | 2
Red   | Round       | 7      | 1.75
Green | Rectangular | 9      | 1.4
Green | Round       | 2      | 1
>>> marbles.group("Color") # just gives counts
Color | count
Blue  | 1
Green | 3
Red   | 2
>>> marbles.group("Color", max) # takes the max of each grouping, in each column
Color | Shape max   | Amount max | Price max
Blue  | Rectangular | 12         | 2
Green | Round       | 9          | 1.4
Red   | Round       | 7          | 1.75
>>> marbles.group("Shape", sum) # sum doesn't make sense for strings
Shape       | Color sum | Amount sum | Price sum
Rectangular |           | 27         | 4.7
Round       |           | 13         | 4.05

Table.groups(labels, collect=None)[source]

Group rows by multiple columns, count or aggregate others.

Args:

labels: list of column names (or indices) to group on

collect: a function applied to values in other columns for each group

Returns: A Table with each row corresponding to a unique combination of values in

the columns specified in labels, where the first columns are those specified in labels, followed by a column of counts for each of the unique values. If collect is provided, a Table is returned with all original columns, each containing values calculated by first grouping rows according to to values in the labels column, then applying collect to each set of grouped values in the other columns.

Note:

The grouped columns will appear first in the result table. If collect does not accept arguments with one of the column types, that column will be empty in the resulting table.

>>> marbles = Table().with_columns(
...    "Color", make_array("Red", "Green", "Blue", "Red", "Green", "Green"),
...    "Shape", make_array("Round", "Rectangular", "Rectangular", "Round", "Rectangular", "Round"),
...    "Amount", make_array(4, 6, 12, 7, 9, 2),
...    "Price", make_array(1.30, 1.30, 2.00, 1.75, 1.40, 1.00))
>>> marbles
Color | Shape       | Amount | Price
Red   | Round       | 4      | 1.3
Green | Rectangular | 6      | 1.3
Blue  | Rectangular | 12     | 2
Red   | Round       | 7      | 1.75
Green | Rectangular | 9      | 1.4
Green | Round       | 2      | 1
>>> marbles.groups(["Color", "Shape"])
Color | Shape       | count
Blue  | Rectangular | 1
Green | Rectangular | 2
Green | Round       | 1
Red   | Round       | 2
>>> marbles.groups(["Color", "Shape"], sum)
Color | Shape       | Amount sum | Price sum
Blue  | Rectangular | 12         | 2
Green | Rectangular | 15         | 2.7
Green | Round       | 2          | 1
Red   | Round       | 11         | 3.05

Table.hist(*columns, overlay=True, bins=None, bin_column=None, unit=None, counts=None, width=6, height=4, **vargs)[source]

Plots one histogram for each column in columns. If no column is specificed, plot all columns.

Kwargs:

overlay (bool): If True, plots 1 chart with all the histograms: overlaid on top of each other (instead of the default behavior of one histogram for each column in the table). Also adds a legend that matches each bar color to its column.
bins (list or int): Lower bound for each bin in the: histogram or number of bins. If None, bins will be chosen automatically.
bin_column (column name or index): A column of bin lower bounds.: All other columns are treated as counts of these bins. If None, each value in each row is assigned a count of 1.

counts (column name or index): Deprecated name for bin_column.

vargs: Additional arguments that get passed into :func:plt.hist.: See http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.hist for additional arguments that can be passed into vargs. These include: range, normed, cumulative, and orientation, to name a few.

>>> t = Table().with_columns(
...     'count',  make_array(9, 3, 3, 1),
...     'points', make_array(1, 2, 2, 10))
>>> t
count | points
9     | 1
3     | 2
3     | 2
1     | 10
>>> t.hist() 
<histogram of values in count>
<histogram of values in points>

>>> t = Table().with_columns(
...     'value',      make_array(101, 102, 103),
...     'proportion', make_array(0.25, 0.5, 0.25))
>>> t.hist(bin_column='value') 
<histogram of values weighted by corresponding proportions>

Table.index_by(column_or_label)[source]: Return a dict keyed by values in a column that contains lists of rows corresponding to each value.

Table.join(column_label, other, other_label=None)[source]

Creates a new table with the columns of self and other, containing rows for all values of a column that appear in both tables.

Args:

column_label (str): label of column in self that is used to: join rows of other.
other: Table object to join with self on matching values of: column_label.

Kwargs:

other_label (str): default None, assumes column_label.: Otherwise in other used to join rows.

Returns:

New table self joined with other by matching values in column_label and other_label. If the resulting join is empty, returns None. If a join value appears more than once in self, each row with that value will appear in resulting join, but in other, only the first row with that value will be used.

>>> table = Table().with_columns('a', make_array(9, 3, 3, 1),
...     'b', make_array(1, 2, 2, 10),
...     'c', make_array(3, 4, 5, 6))
>>> table
a    | b    | c
9    | 1    | 3
3    | 2    | 4
3    | 2    | 5
1    | 10   | 6
>>> table2 = Table().with_columns( 'a', make_array(9, 1, 1, 1),
... 'd', make_array(1, 2, 2, 10),
... 'e', make_array(3, 4, 5, 6))
>>> table2
a    | d    | e
9    | 1    | 3
1    | 2    | 4
1    | 2    | 5
1    | 10   | 6
>>> table.join('a', table2)
a    | b    | c    | d    | e
1    | 10   | 6    | 2    | 4
9    | 1    | 3    | 1    | 3
>>> table.join('a', table2, 'a') # Equivalent to previous join
a    | b    | c    | d    | e
1    | 10   | 6    | 2    | 4
9    | 1    | 3    | 1    | 3
>>> table.join('a', table2, 'd') # Repeat column labels relabeled
a    | b    | c    | a_2  | e
1    | 10   | 6    | 9    | 3
>>> table2 #table2 has three rows with a = 1
a    | d    | e
9    | 1    | 3
1    | 2    | 4
1    | 2    | 5
1    | 10   | 6
>>> table #table has only one row with a = 1
a    | b    | c
9    | 1    | 3
3    | 2    | 4
3    | 2    | 5
1    | 10   | 6
>>> table2.join('a', table) # When we join, we get all three rows in table2 where a = 1
a    | d    | e    | b    | c
1    | 2    | 4    | 10   | 6
1    | 2    | 5    | 10   | 6
1    | 10   | 6    | 10   | 6
9    | 1    | 3    | 1    | 3
>>> table.join('a', table2) # Opposite join only keeps first row in table2 with a = 1
a    | b    | c    | d    | e
1    | 10   | 6    | 2    | 4
9    | 1    | 3    | 1    | 3

Table.labels: Return a tuple of column labels.

Table.move_to_end(column_label)[source]: Move a column to the last in order.

Table.move_to_start(column_label)[source]: Move a column to the first in order.

Table.normalized()

Returns the distribution by making the proabilities sum to 1

Returns:

Returns:	Table A distribution with the probabilities normalized

Table

A distribution with the probabilities normalized

Examples

>>> Table().values([1,2,3]).probability([1,1,1])
Value | Probability
1     | 1
2     | 1
3     | 1
>>> Table().values([1,2,3]).probability([1,1,1]).normalized()
Value | Probability
1     | 0.333333
2     | 0.333333
3     | 0.333333

Table.num_columns: Number of columns.

Table.num_rows: Number of rows.

Table.percentile(p)[source]

Return a new table with one row containing the pth percentile for each column.

Assumes that each column only contains one type of value.

Returns a new table with one row and the same column labels. The row contains the pth percentile of the original column, where the pth percentile of a column is the smallest value that at at least as large as the p% of numbers in the column.

>>> table = Table().with_columns(
...     'count',  make_array(9, 3, 3, 1),
...     'points', make_array(1, 2, 2, 10))
>>> table
count | points
9     | 1
3     | 2
3     | 2
1     | 10
>>> table.percentile(80)
count | points
9     | 10

Table.pivot(columns, rows, values=None, collect=None, zero=None)[source]

Generate a table with a column for each unique value in columns, with rows for each unique value in rows. Each row counts/aggregates the values that match both row and column based on collect.

Args:

columns – a single column label or index, (str or int),: used to create new columns, based on its unique values.
rows – row labels or indices, (str or int or list),: used to create new rows based on it’s unique values.
values – column label in table for use in aggregation.: Default None.
collect – aggregation function, used to group values: over row-column combinations. Default None.

zero – zero value for non-existent row-column combinations.

Raises:

TypeError – if collect is passed in and values is not,: vice versa.

Returns:

New pivot table, with row-column combinations, as specified, with aggregated values by collect across the intersection of columns and rows. Simple counts provided if values and collect are None, as default.

>>> titanic = Table().with_columns('age', make_array(21, 44, 56, 89, 95
...    , 40, 80, 45), 'survival', make_array(0,0,0,1, 1, 1, 0, 1),
...    'gender',  make_array('M', 'M', 'M', 'M', 'F', 'F', 'F', 'F'),
...    'prediction', make_array(0, 0, 1, 1, 0, 1, 0, 1))
>>> titanic
age  | survival | gender | prediction
21   | 0        | M      | 0
44   | 0        | M      | 0
56   | 0        | M      | 1
89   | 1        | M      | 1
95   | 1        | F      | 0
40   | 1        | F      | 1
80   | 0        | F      | 0
45   | 1        | F      | 1
>>> titanic.pivot('survival', 'gender')
gender | 0    | 1
F      | 1    | 3
M      | 3    | 1
>>> titanic.pivot('prediction', 'gender')
gender | 0    | 1
F      | 2    | 2
M      | 2    | 2
>>> titanic.pivot('survival', 'gender', values='age', collect = np.mean)
gender | 0       | 1
F      | 80      | 60
M      | 40.3333 | 89
>>> titanic.pivot('survival', make_array('prediction', 'gender'))
prediction | gender | 0    | 1
0          | F      | 1    | 1
0          | M      | 2    | 0
1          | F      | 0    | 2
1          | M      | 1    | 1
>>> titanic.pivot('survival', 'gender', values = 'age')
Traceback (most recent call last):
   ...
TypeError: values requires collect to be specified
>>> titanic.pivot('survival', 'gender', collect = np.mean)
Traceback (most recent call last):
   ...
TypeError: collect requires values to be specified

Table.pivot_bin(pivot_columns, value_column, bins=None, **vargs)[source]

Form a table with columns formed by the unique tuples in pivot_columns containing counts per bin of the values associated with each tuple in the value_column.

By default, bins are chosen to contain all values in the value_column. The following named arguments from numpy.histogram can be applied to specialize bin widths:

Args:

bins (int or sequence of scalars): If bins is an int,: it defines the number of equal-width bins in the given range (10, by default). If bins is a sequence, it defines the bin edges, including the rightmost edge, allowing for non-uniform bin widths.
range ((float, float)): The lower and upper range of: the bins. If not provided, range contains all values in the table. Values outside the range are ignored.
normed (bool): If False, the result will contain the number of: samples in each bin. If True, the result is normalized such that the integral over the range is 1.

Table.pivot_hist(pivot_column_label, value_column_label, overlay=True, width=6, height=4, **vargs)[source]: Draw histograms of each category in a column.

Table.plot(column_for_xticks=None, select=None, overlay=True, width=6, height=4, **vargs)[source]

Plot line charts for the table.

Args:

column_for_xticks (str/array): A column containing x-axis labels

Kwargs:

overlay (bool): create a chart with one color per data column;: if False, each plot will be displayed separately.
vargs: Additional arguments that get passed into plt.plot.: See http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.plot for additional arguments that can be passed into vargs.

Raises:

ValueError – Every selected column must be numerical.

Returns:

Returns a line plot (connected scatter). Each plot is labeled using the values in column_for_xticks and one plot is produced for all other columns in self (or for the columns designated by select).

>>> table = Table().with_columns(
...     'days',  make_array(0, 1, 2, 3, 4, 5),
...     'price', make_array(90.5, 90.00, 83.00, 95.50, 82.00, 82.00),
...     'projection', make_array(90.75, 82.00, 82.50, 82.50, 83.00, 82.50))
>>> table
days | price | projection
0    | 90.5  | 90.75
1    | 90    | 82
2    | 83    | 82.5
3    | 95.5  | 82.5
4    | 82    | 83
5    | 82    | 82.5
>>> table.plot('days') 
<line graph with days as x-axis and lines for price and projection>
>>> table.plot('days', overlay=False) 
<line graph with days as x-axis and line for price>
<line graph with days as x-axis and line for projection>
>>> table.plot('days', 'price') 
<line graph with days as x-axis and line for price>

Table.prob_event(x)

Finds the probability of an event x

Parameters:

Parameters:	x : float or Iterable An event represented either as a specific value in the domain or a subset of the domain
Returns:	float Probability of the event

x : float or Iterable

An event represented either as a specific value in the domain or a subset of the domain

Returns:

float

Probability of the event

Examples

>>> dist = Table().values([1,2,3,4]).probability([1/4,1/4,1/4,1/4])
>>> dist.prob_event(2)
0.25

>>> dist.prob_event([2,3])
0.5

>>> dist.prob_event(np.arange(1,5))
1.0

Table.probability(values)

Assigns probabilities to domain values.

Parameters:

Parameters:	values : List or Array Values that must correspond to the domain in the same order
Returns:	Table A proability distribution with those probabilities

values : List or Array

Values that must correspond to the domain in the same order

Returns:

Table

A proability distribution with those probabilities

Table.probability_function(pfunc)

Assigns probabilities to a Distribution via a probability function. The probability function is applied to each value of the domain. Must have domain values in the first column first.

Parameters:

Parameters:	pfunc : univariate function Probability function of the distribution
Returns:	Table Table with those probabilities in its second column

pfunc : univariate function

Probability function of the distribution

Returns:

Table

Table with those probabilities in its second column

classmethod Table.read_table(filepath_or_buffer, *args, **vargs)[source]

Read a table from a file or web address.

filepath_or_buffer – string or file handle / StringIO; The string: could be a URL. Valid URL schemes include http, ftp, s3, and file.

Table.relabel(column_label, new_label)[source]

Changes the label(s) of column(s) specified by column_label to labels in new_label.

Args:

column_label – (single str or array of str) The label(s) of: columns to be changed to new_label.
new_label – (single str or array of str): The label name(s): of columns to replace column_label.

Raises:

ValueError – if column_label is not in table, or if: column_label and new_label are not of equal length.
TypeError – if column_label and/or new_label is not: str.

Returns:

Original table with new_label in place of column_label.

>>> table = Table().with_columns(
...     'points', make_array(1, 2, 3),
...     'id',     make_array(12345, 123, 5123))
>>> table.relabel('id', 'yolo')
points | yolo
1      | 12345
2      | 123
3      | 5123
>>> table.relabel(make_array('points', 'yolo'),
...   make_array('red', 'blue'))
red  | blue
1    | 12345
2    | 123
3    | 5123
>>> table.relabel(make_array('red', 'green', 'blue'),
...   make_array('cyan', 'magenta', 'yellow', 'key'))
Traceback (most recent call last):
    ...
ValueError: Invalid arguments. column_label and new_label must be of equal length.

Table.relabeled(label, new_label)[source]

Return a new table with label specifying column label(s) replaced by corresponding new_label.

Args:

label – (str or array of str) The label(s) of: columns to be changed.
new_label – (str or array of str): The new label(s) of: columns to be changed. Same number of elements as label.

Raises:

ValueError – if label does not exist in: table, or if the label and new_label are not not of equal length. Also, raised if label and/or new_label are not str.

Returns:

New table with new_label in place of label.

>>> tiles = Table().with_columns('letter', make_array('c', 'd'),
...    'count', make_array(2, 4))
>>> tiles
letter | count
c      | 2
d      | 4
>>> tiles.relabeled('count', 'number')
letter | number
c      | 2
d      | 4
>>> tiles  # original table unmodified
letter | count
c      | 2
d      | 4
>>> tiles.relabeled(make_array('letter', 'count'),
...   make_array('column1', 'column2'))
column1 | column2
c       | 2
d       | 4
>>> tiles.relabeled(make_array('letter', 'number'),
...  make_array('column1', 'column2'))
Traceback (most recent call last):
    ...
ValueError: Invalid labels. Column labels must already exist in table in order to be replaced.

Table.remove(row_or_row_indices)[source]: Removes a row or multiple rows of a table in place.

Table.row(index)[source]: Return a row.

Table.rows: Return a view of all rows.

Table.sample(n=1)

Randomly samples from the distribution

Parameters:

Parameters:	n : int Number of times to sample from the distribution (default: 1)
Returns:	float or array Samples from the distribution >>> dist = Table().with_columns('Value',make_array(2, 3, 4),'Probability',make_array(0.25, 0.5, 0.25)) >>> dist.sample() 3 >>> dist.sample() 2 >>> dist.sample(10) array([3, 2, 2, 4, 3, 4, 3, 4, 3, 3])

n : int

Number of times to sample from the distribution (default: 1)

Returns:

float or array

Samples from the distribution

>>> dist = Table().with_columns('Value',make_array(2, 3, 4),'Probability',make_array(0.25, 0.5, 0.25))

>>> dist.sample()

>>> dist.sample()

>>> dist.sample(10)

array([3, 2, 2, 4, 3, 4, 3, 4, 3, 3])

Table.sample_from_distribution(distribution, k, proportions=False)[source]

Return a new table with the same number of rows and a new column. The values in the distribution column are define a multinomial. They are replaced by sample counts/proportions in the output.

>>> sizes = Table(['size', 'count']).with_rows([
...     ['small', 50],
...     ['medium', 100],
...     ['big', 50],
... ])
>>> sizes.sample_from_distribution('count', 1000) 
size   | count | count sample
small  | 50    | 239
medium | 100   | 496
big    | 50    | 265
>>> sizes.sample_from_distribution('count', 1000, True) 
size   | count | count sample
small  | 50    | 0.24
medium | 100   | 0.51
big    | 50    | 0.25

Table.scatter(column_for_x, select=None, overlay=True, fit_line=False, colors=None, labels=None, sizes=None, width=5, height=5, s=20, **vargs)[source]

Creates scatterplots, optionally adding a line of best fit.

Args:

column_for_x (str): The column to use for the x-axis values: and label of the scatter plots.

Kwargs:

overlay (bool): If true, creates a chart with one color: per data column; if False, each plot will be displayed separately.

fit_line (bool): draw a line of best fit for each set of points.

vargs: Additional arguments that get passed into plt.scatter.: See http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.scatter for additional arguments that can be passed into vargs. These include: marker and norm, to name a couple.

colors: A column of categories to be used for coloring dots.

labels: A column of text labels to annotate dots.

sizes: A column of values to set the relative areas of dots.

s: Size of dots. If sizes is also provided, then dots will be: in the range 0 to 2 * s.

Raises:

ValueError – Every column, column_for_x or select, must be numerical

Returns:

Scatter plot of values of column_for_x plotted against values for all other columns in self. Each plot uses the values in column_for_x for horizontal positions. One plot is produced for all other columns in self as y (or for the columns designated by select).

>>> table = Table().with_columns(
...     'x', make_array(9, 3, 3, 1),
...     'y', make_array(1, 2, 2, 10),
...     'z', make_array(3, 4, 5, 6))
>>> table
x    | y    | z
9    | 1    | 3
3    | 2    | 4
3    | 2    | 5
1    | 10   | 6
>>> table.scatter('x') 
<scatterplot of values in y and z on x>

>>> table.scatter('x', overlay=False) 
<scatterplot of values in y on x>
<scatterplot of values in z on x>

>>> table.scatter('x', fit_line=True) 
<scatterplot of values in y and z on x with lines of best fit>

Table.sd()

Finds standard deviation of Distribution

Returns:

Returns:	float Standard Deviation

float

Standard Deviation

Table.select(*column_or_columns)[source]

Return a table with only the columns in column_or_columns.

Args:: column_or_columns: Columns to select from the Table as either column labels (str) or column indices (int).
Returns:: A new instance of Table containing only selected columns. The columns of the new Table are in the order given in column_or_columns.
Raises:: KeyError if any of column_or_columns are not in the table.

>>> flowers = Table().with_columns(
...     'Number of petals', make_array(8, 34, 5),
...     'Name', make_array('lotus', 'sunflower', 'rose'),
...     'Weight', make_array(10, 5, 6)
... )

>>> flowers
Number of petals | Name      | Weight
8                | lotus     | 10
34               | sunflower | 5
5                | rose      | 6

>>> flowers.select('Number of petals', 'Weight')
Number of petals | Weight
8                | 10
34               | 5
5                | 6

>>> flowers  # original table unchanged
Number of petals | Name      | Weight
8                | lotus     | 10
34               | sunflower | 5
5                | rose      | 6

>>> flowers.select(0, 2)
Number of petals | Weight
8                | 10
34               | 5
5                | 6

Table.set_format(column_or_columns, formatter)[source]: Set the format of a column.

Table.show(max_rows=0)[source]: Display the table.

Table.sort(column_or_label, descending=False, distinct=False)[source]

Return a Table of rows sorted according to the values in a column.

Args:

column_or_label: the column whose values are used for sorting.

descending: if True, sorting will be in descending, rather than: ascending order.
distinct: if True, repeated values in column_or_label will: be omitted.

Returns:

An instance of Table containing rows sorted based on the values in column_or_label.

>>> marbles = Table().with_columns(
...    "Color", make_array("Red", "Green", "Blue", "Red", "Green", "Green"),
...    "Shape", make_array("Round", "Rectangular", "Rectangular", "Round", "Rectangular", "Round"),
...    "Amount", make_array(4, 6, 12, 7, 9, 2),
...    "Price", make_array(1.30, 1.30, 2.00, 1.75, 1.40, 1.00))
>>> marbles
Color | Shape       | Amount | Price
Red   | Round       | 4      | 1.3
Green | Rectangular | 6      | 1.3
Blue  | Rectangular | 12     | 2
Red   | Round       | 7      | 1.75
Green | Rectangular | 9      | 1.4
Green | Round       | 2      | 1
>>> marbles.sort("Amount")
Color | Shape       | Amount | Price
Green | Round       | 2      | 1
Red   | Round       | 4      | 1.3
Green | Rectangular | 6      | 1.3
Red   | Round       | 7      | 1.75
Green | Rectangular | 9      | 1.4
Blue  | Rectangular | 12     | 2
>>> marbles.sort("Amount", descending = True)
Color | Shape       | Amount | Price
Blue  | Rectangular | 12     | 2
Green | Rectangular | 9      | 1.4
Red   | Round       | 7      | 1.75
Green | Rectangular | 6      | 1.3
Red   | Round       | 4      | 1.3
Green | Round       | 2      | 1
>>> marbles.sort(3) # the Price column
Color | Shape       | Amount | Price
Green | Round       | 2      | 1
Red   | Round       | 4      | 1.3
Green | Rectangular | 6      | 1.3
Green | Rectangular | 9      | 1.4
Red   | Round       | 7      | 1.75
Blue  | Rectangular | 12     | 2
>>> marbles.sort(3, distinct = True)
Color | Shape       | Amount | Price
Green | Round       | 2      | 1
Red   | Round       | 4      | 1.3
Green | Rectangular | 9      | 1.4
Red   | Round       | 7      | 1.75
Blue  | Rectangular | 12     | 2

Table.split(k)[source]

Return a tuple of two tables where the first table contains k rows randomly sampled and the second contains the remaining rows.

Args:

k (int): The number of rows randomly sampled into the first: table. k must be between 1 and num_rows - 1.

Raises:

ValueError: k is not between 1 and num_rows - 1.

Returns:

A tuple containing two instances of Table.

>>> jobs = Table().with_columns(
...     'job',  make_array('a', 'b', 'c', 'd'),
...     'wage', make_array(10, 20, 15, 8))
>>> jobs
job  | wage
a    | 10
b    | 20
c    | 15
d    | 8
>>> sample, rest = jobs.split(3)
>>> sample 
job  | wage
c    | 15
a    | 10
b    | 20
>>> rest 
job  | wage
d    | 8

Table.stack(key, labels=None)[source]: Takes k original columns and returns two columns, with col. 1 of all column names and col. 2 of all associated data.

Table.stats(ops=(<built-in function min>, <built-in function max>, <function median>, <built-in function sum>))[source]: Compute statistics for each column and place them in a table.

Table.take()[source]

Return a new Table with selected rows taken by index.

Args:: row_indices_or_slice (integer or array of integers): The row index, list of row indices or a slice of row indices to be selected.
Returns:: A new instance of Table with selected rows in order corresponding to row_indices_or_slice.
Raises:: IndexError, if any row_indices_or_slice is out of bounds with respect to column length.

>>> grades = Table().with_columns('letter grade',
...     make_array('A+', 'A', 'A-', 'B+', 'B', 'B-'),
...     'gpa', make_array(4, 4, 3.7, 3.3, 3, 2.7))
>>> grades
letter grade | gpa
A+           | 4
A            | 4
A-           | 3.7
B+           | 3.3
B            | 3
B-           | 2.7
>>> grades.take(0)
letter grade | gpa
A+           | 4
>>> grades.take(-1)
letter grade | gpa
B-           | 2.7
>>> grades.take(make_array(2, 1, 0))
letter grade | gpa
A-           | 3.7
A            | 4
A+           | 4
>>> grades.take[:3]
letter grade | gpa
A+           | 4
A            | 4
A-           | 3.7
>>> grades.take(np.arange(0,3))
letter grade | gpa
A+           | 4
A            | 4
A-           | 3.7
>>> grades.take(10)
Traceback (most recent call last):
    ...
IndexError: index 10 is out of bounds for axis 0 with size 6

Table.toJoint(table, X_column_label=None, Y_column_label=None, probability_column_label=None, reverse=True)

Converts a table of probabilities associated with two variables into a JointDistribution object

Parameters:

Parameters:	table : Table You can either call pass in a Table directly or call the toJoint() method of that Table. See examples X_column_label (optional) : String Label for the first variable. Defaults to the same label as that of first variable of Table Y_column_label (optional) : String Label for the second variable. Defaults to the same label as that of second variable of Table probability_column_label (optional) : String Label for probabilities reverse (optional) : Boolean If True, the vertical values will be reversed
Returns:	JointDistribution A JointDistribution object

table : Table

You can either call pass in a Table directly or call the toJoint() method of that Table. See examples

X_column_label (optional) : String

Label for the first variable. Defaults to the same label as that of first variable of Table

Y_column_label (optional) : String

Label for the second variable. Defaults to the same label as that of second variable of Table

probability_column_label (optional) : String

Label for probabilities reverse (optional) : Boolean If True, the vertical values will be reversed

Returns:

JointDistribution

A JointDistribution object

Examples

>>> dist1 = Table().values([0,1],[2,3])
>>> dist1['Probability'] = make_array(0.1, 0.2, 0.3, 0.4)
>>> dist1.toJoint()
     X=0  X=1
Y=3  0.2  0.4
Y=2  0.1  0.3
>>> dist2 = Table().values("Coin1",['H','T'], "Coin2", ['H','T'])
>>> dist2['Probability'] = np.array([0.4*0.6, 0.6*0.6, 0.4*0.4, 0.6*0.4])
>>> dist2.toJoint()
         Coin1=H  Coin1=T
Coin2=T     0.36     0.24
Coin2=H     0.24     0.16

Table.to_array()[source]: Convert the table to a structured NumPy array.

Table.to_csv(filename)[source]

Creates a CSV file with the provided filename.

The CSV is created in such a way that if we run table.to_csv('my_table.csv') we can recreate the same table with Table.read_table('my_table.csv').

Args:: filename (str): The filename of the output CSV file.
Returns:: None, outputs a file with name filename.

>>> jobs = Table().with_columns(
...     'job',  make_array('a', 'b', 'c', 'd'),
...     'wage', make_array(10, 20, 15, 8))
>>> jobs
job  | wage
a    | 10
b    | 20
c    | 15
d    | 8
>>> jobs.to_csv('my_table.csv') 
<outputs a file called my_table.csv in the current directory>

Table.to_df()[source]: Convert the table to a Pandas DataFrame.

Table.transition_function(pfunc)

Assigns transition probabilities to a Distribution via a probability function. The probability function is applied to each value of the domain. Must have domain values in the first column first.

Parameters:

Parameters:	pfunc : variate function Conditional probability function of the distribution ( P(Y \| X))
Returns:	Table Table with those probabilities in its final column

pfunc : variate function

Conditional probability function of the distribution ( P(Y | X))

Returns:

Table

Table with those probabilities in its final column

Table.transition_probability(values)

For a multivariate probability distribution, assigns transition probabilities: ie P(Y | X).

Parameters:

Parameters:	values : List or Array Values that must correspond to the domain in the same order
Returns:	Table A proability distribution with those probabilities

values : List or Array

Values that must correspond to the domain in the same order

Returns:

Table

A proability distribution with those probabilities

Table.variance()

Finds variance of distribution

Returns:

Returns:	float Variance

float

Variance

Table.where(column_or_label, value_or_predicate=None, other=None)[source]

Return a new Table containing rows where value_or_predicate returns True for values in column_or_label.

Args:

column_or_label: A column of the Table either as a label (str) or an index (int). Can also be an array of booleans; only the rows where the array value is True are kept.

value_or_predicate: If a function, it is applied to every value in column_or_label. Only the rows where value_or_predicate returns True are kept. If a single value, only the rows where the values in column_or_label are equal to value_or_predicate are kept.

other: Optional additional column label for value_or_predicate to make pairwise comparisons. See the examples below for usage. When other is supplied, value_or_predicate must be a callable function.

Returns:

If value_or_predicate is a function, returns a new Table containing only the rows where value_or_predicate(val) is True for the val``s in ``column_or_label.

If value_or_predicate is a value, returns a new Table containing only the rows where the values in column_or_label are equal to value_or_predicate.

If column_or_label is an array of booleans, returns a new Table containing only the rows where column_or_label is True.

>>> marbles = Table().with_columns(
...    "Color", make_array("Red", "Green", "Blue",
...                        "Red", "Green", "Green"),
...    "Shape", make_array("Round", "Rectangular", "Rectangular",
...                        "Round", "Rectangular", "Round"),
...    "Amount", make_array(4, 6, 12, 7, 9, 2),
...    "Price", make_array(1.30, 1.20, 2.00, 1.75, 0, 3.00))

>>> marbles
Color | Shape       | Amount | Price
Red   | Round       | 4      | 1.3
Green | Rectangular | 6      | 1.2
Blue  | Rectangular | 12     | 2
Red   | Round       | 7      | 1.75
Green | Rectangular | 9      | 0
Green | Round       | 2      | 3

Use a value to select matching rows

>>> marbles.where("Price", 1.3)
Color | Shape | Amount | Price
Red   | Round | 4      | 1.3

In general, a higher order predicate function such as the functions in datascience.predicates.are can be used.

>>> from datascience.predicates import are
>>> # equivalent to previous example
>>> marbles.where("Price", are.equal_to(1.3))
Color | Shape | Amount | Price
Red   | Round | 4      | 1.3

>>> marbles.where("Price", are.above(1.5))
Color | Shape       | Amount | Price
Blue  | Rectangular | 12     | 2
Red   | Round       | 7      | 1.75
Green | Round       | 2      | 3

Use the optional argument other to apply predicates to compare columns.

>>> marbles.where("Price", are.above, "Amount")
Color | Shape | Amount | Price
Green | Round | 2      | 3

>>> marbles.where("Price", are.equal_to, "Amount") # empty table
Color | Shape | Amount | Price

Table.with_column(label, values, *rest)[source]

Return a new table with an additional or replaced column.

Args:

label (str): The column label. If an existing label is used,: the existing column will be replaced in the new table.
values (single value or sequence): If a single value, every: value in the new column is values. If sequence of values, new column takes on values in values.
rest: An alternating list of labels and values describing: additional columns. See with_columns for a full description.

Raises:

ValueError: If

label is not a valid column name
if label is not of type (str)
values is a list/array that does not have the same

length as the number of rows in the table.

Returns:

copy of original table with new or replaced column

>>> alphabet = Table().with_column('letter', make_array('c','d'))
>>> alphabet = alphabet.with_column('count', make_array(2, 4))
>>> alphabet
letter | count
c      | 2
d      | 4
>>> alphabet.with_column('permutes', make_array('a', 'g'))
letter | count | permutes
c      | 2     | a
d      | 4     | g
>>> alphabet
letter | count
c      | 2
d      | 4
>>> alphabet.with_column('count', 1)
letter | count
c      | 1
d      | 1
>>> alphabet.with_column(1, make_array(1, 2))
Traceback (most recent call last):
    ...
ValueError: The column label must be a string, but a int was given
>>> alphabet.with_column('bad_col', make_array(1))
Traceback (most recent call last):
    ...
ValueError: Column length mismatch. New column does not have the same number of rows as table.

Table.with_columns(*labels_and_values)[source]

Return a table with additional or replaced columns.

Args:

labels_and_values: An alternating list of labels and values or: a list of label-value pairs. If one of the labels is in existing table, then every value in the corresponding column is set to that value. If label has only a single value (int), every row of corresponding column takes on that value.

Raises:

ValueError: If

any label in labels_and_values is not a valid column

name, i.e if label is not of type (str).
if any value in labels_and_values is a list/array and

does not have the same length as the number of rows in the table.

AssertionError:

‘incorrect columns format’, if passed more than one sequence

(iterables) for labels_and_values.
‘even length sequence required’ if missing a pair in

label-value pairs.

Returns:

Copy of original table with new or replaced columns. Columns added in order of labels. Equivalent to with_column(label, value) when passed only one label-value pair.

>>> players = Table().with_columns('player_id',
...     make_array(110234, 110235), 'wOBA', make_array(.354, .236))
>>> players
player_id | wOBA
110234    | 0.354
110235    | 0.236
>>> players = players.with_columns('salaries', 'N/A', 'season', 2016)
>>> players
player_id | wOBA  | salaries | season
110234    | 0.354 | N/A      | 2016
110235    | 0.236 | N/A      | 2016
>>> salaries = Table().with_column('salary',
...     make_array('$500,000', '$15,500,000'))
>>> players.with_columns('salaries', salaries.column('salary'),
...     'years', make_array(6, 1))
player_id | wOBA  | salaries    | season | years
110234    | 0.354 | $500,000    | 2016   | 6
110235    | 0.236 | $15,500,000 | 2016   | 1
>>> players.with_columns(2, make_array('$600,000', '$20,000,000'))
Traceback (most recent call last):
    ...
ValueError: The column label must be a string, but a int was given
>>> players.with_columns('salaries', make_array('$600,000'))
Traceback (most recent call last):
    ...
ValueError: Column length mismatch. New column does not have the same number of rows as table.

Table.with_row(row)[source]

Return a table with an additional row.

Args:: row (sequence): A value for each column.
Raises:: ValueError: If the row length differs from the column count.

>>> tiles = Table(make_array('letter', 'count', 'points'))
>>> tiles.with_row(['c', 2, 3]).with_row(['d', 4, 2])
letter | count | points
c      | 2     | 3
d      | 4     | 2

Table.with_rows(rows)[source]

Return a table with additional rows.

Args:

rows (sequence of sequences): Each row has a value per column.

If rows is a 2-d array, its shape must be (_, n) for n columns.

Raises:

ValueError: If a row length differs from the column count.

>>> tiles = Table(make_array('letter', 'count', 'points'))
>>> tiles.with_rows(make_array(make_array('c', 2, 3),
...     make_array('d', 4, 2)))
letter | count | points
c      | 2     | 3
d      | 4     | 2

`prob140` JointDistribution¶

class prob140.JointDistribution(data=None, index=None, columns=None, dtype=None, copy=False)[source]

both_marginals()[source]

Finds the marginal distribution of both variables

Returns:	JointDistribution Table

Examples

>>> dist1 = Table().values([0,1],[2,3]).probability([0.1, 0.2, 0.3, 0.4]).toJoint()
>>> dist1.both_marginals()
                    X=0  X=1  Sum: Marginal of Y
Y=3                 0.2  0.4                 0.6
Y=2                 0.1  0.3                 0.4
Sum: Marginal of X  0.3  0.7                 1.0

conditional_dist(label, given='', show_ev=False)[source]

Given the random variable label, finds the conditional distribution of the other variable

Parameters:

Parameters:	label : String given variable
Returns:	JointDistribution Table

label : String

given variable

Returns:

JointDistribution Table

Examples

>>> coins = Table().values("Coin1",['H','T'],"Coin2", ['H','T']).probability(np.array([0.24, 0.36, 0.16,0.24])).toJoint()
>>> coins.conditional_dist("Coin1","Coin2")
                          Coin1=H  Coin1=T  Sum
Dist. of Coin1 | Coin2=H      0.6      0.4  1.0
Dist. of Coin1 | Coin2=T      0.6      0.4  1.0
Marginal of Coin1             0.6      0.4  1.0
>>> coins.conditional_dist("Coin2","Coin1")
         Dist. of Coin2 | Coin1=H  Dist. of Coin2 | Coin1=T  Marginal of Coin2
Coin2=H                       0.4                       0.4                0.4
Coin2=T                       0.6                       0.6                0.6
Sum                           1.0                       1.0                1.0

marginal(label)[source]

Returns the marginal distribution of label

Parameters:

Parameters:	label : String The label of the variable of which we want to find the marginal distribution
Returns:	JointDistribution Table

label : String

The label of the variable of which we want to find the marginal distribution

Returns:

JointDistribution Table

Examples

>>> dist2 = Table().values("Coin1",['H','T'],"Coin2", ['H','T']).probability(np.array([0.24, 0.36, 0.16, 0.24])).toJoint()
>>> dist2.marginal("Coin1")
                        Coin1=H  Coin1=T
Coin2=T                    0.36     0.24
Coin2=H                    0.24     0.16
Sum: Marginal of Coin1     0.60     0.40
>>> dist2.marginal("Coin2")
         Coin1=H  Coin1=T  Sum: Marginal of Coin2
Coin2=T     0.36     0.24                     0.6
Coin2=H     0.24     0.16                     0.4

marginal_dist(label)[source]

Finds the marginal marginal distribution of label, returns as a single variable distribution

Parameters:

Parameters:	label The label of the variable of which we want to find the marginal distribution
Returns:	Table Single variable distribution of label

label

The label of the variable of which we want to find the marginal distribution

Returns:

Table

Single variable distribution of label

datascience Tables¶

prob140 JointDistribution¶

`datascience` Tables¶

`prob140` JointDistribution¶