IdaDataFrame

Open an IdaDataFrame Object.

class ibmdbpy.frame.IdaDataFrame(idadb, tablename, indexer=None)[source]

An IdaDataFrame object is a reference to a table in a remote instance of dashDB/DB2. IDA stands for In-DataBase Analytics. IdaDataFrame copies the Pandas interface for DataFrame objects to ensure intuitive interaction for end-users.

Examples

>>> idadb = IdaDataBase('DASHDB') # See documentation for IdaDataBase
>>> ida_iris = IdaDataFrame(idadb, 'IRIS')
>>> ida_iris.cov()
                  sepal_length  sepal_width  petal_length  petal_width
sepal_length      0.685694    -0.042434      1.274315     0.516271
sepal_width      -0.042434     0.189979     -0.329656    -0.121639
petal_length      1.274315    -0.329656      3.116278     1.295609
petal_width       0.516271    -0.121639      1.295609     0.581006
__init__(idadb, tablename, indexer=None)[source]

Constructor for IdaDataFrame objects.

Parameters:

idadb : IdaDataBase

IdaDataBase instance which contains the connection to be used.

tablename : str

Name of the table to be opened in the database.

indexer : str, optional

Name of the column that should be used as an index. This is optional. However, if no indexer is given, the order of rows issued by the head and tail functions is not guaranteed. Also, several in-database machine learning algorithms need an indexer as a parameter to be executed.

Notes

Attributes “type”, “dtypes”, “index”, “columns”, “axes”, and “shape” are evaluated in a lazy way to avoid an overhead when creating an IdaDataFrame. Sometimes the index may be too big to be downloaded.

Examples

>>> idadb = IdaDataBase('DASHDB')
>>> ida_iris = IdaDataFrame(idadb, "IRIS")

Attributes

_idadb (IdaDataBase) IdaDataBase object parent of the current instance.
tablename (str) Name of the table self references.
name (str) Full name of the table self references, including schema.
schema (str) Name of the schema the table belongs to.
indexer (str) Name of the column used as an index. “None” if no indexer.
loc (str) Indexer that enables the selection and projection of IdaDataFrame instances. For more information, see the loc class documentation.
internal_state (InternalState) Object used to internally store the state of the IdaDataFrame. It also allows several non-destructive manipulation methods.
type (str) Type of the IdaDataFrame : “Table”, “View”, or “Unknown”.
dtypes (DataFrame) Data type in the database for each column.
index (pandas.core.index) Index containing the row names in this table.
columns (pandas.core.index) Index containing the columns names in this table.
axes (list) List containing columns and index attributes.
shape (Tuple) Number of rows and number of columns.

Methods

DataFrame introspection

internal_state

IdaDataFrame.internal_state[source]

InternalState instances manage the state of an IdaDataFrame instance and allow several non-destructive data manipulation methods, such as the selection, projection, filtering, and aggregation of columns.

indexer

IdaDataFrame.indexer

The indexer attribute refers to the name of a column that should be used to index the table. This makes sense because dashDB is a column-based database, so row IDs do not make sense and are not deterministic. As a consequence, the only way to address a particular row is to refer to it by its index. If no indexer is provided, ibmdbpy still works but a correct row order is not guaranteed as far as the dataset is not sorted. Also, note that the indexer column is not taken into account in data mining algorithms.

type

IdaDataFrame.type[source]

Type of self: ‘Table’, ‘View’ or ‘Unknown’.

Returns:

str

idaDataFrame type.

Examples

>>> ida_iris.type
'Table'

dtypes

IdaDataFrame.dtypes[source]

Data type in database for each column in self.

Returns:

DataFrame

In-Database type for each columns.

Examples

>>> ida_iris.dtypes
             TYPENAME
sepal_length   DOUBLE
sepal_width    DOUBLE
petal_length   DOUBLE
petal_width    DOUBLE
species       VARCHAR

index

IdaDataFrame.index[source]

Index containing the row names in self.

Returns:Index

Notes

Because indexes in a database can be only numeric, it is not that interesting for an IdaDataFrame but can still be useful sometimes. The function can break if the table is too large. Ask for the user’s approval before downloading an index which has more than 10000 values.

Examples

>>> ida_iris.index
Int64Index([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,
    ...
    140, 141, 142, 143, 144, 145, 146, 147, 148, 149],
   dtype='int64', length=150)

columns

IdaDataFrame.columns[source]

Index containing the column names in self.

Returns:Index

Examples

>>> ida_iris.columns
Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
'species'],
dtype='object')

axes

IdaDataFrame.axes[source]

List containing IdaDataFrame.columns and IdaDataFrame.index attributes.

Returns:

list

List containing two indexes (indexes and column attributes).

Examples

>>> ida_iris.axes
[Int64Index([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,
     ...
     140, 141, 142, 143, 144, 145, 146, 147, 148, 149],
    dtype='int64', length=150),
Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
'species'],
dtype='object')]

shape

IdaDataFrame.shape[source]

Tuple containing number of rows and number of columns.

Returns:tuple

Examples

>>> ida_iris.shape
(150, 5)

empty

IdaDataFrame.empty

Boolean that is True if the table is empty (no rows).

Returns:Boolean

__len__

IdaDataFrame.__len__()[source]

Number of records.

Returns:int

Examples

>>> len(idadf)
150

__iter__

IdaDataFrame.__iter__()[source]

Iterate over columns.

DataFrame modification

Selection, Projection

IdaDataFrame.__getitem__(item)[source]

Enable label based projection (selection of columns) in IdaDataFrames.

Enable slice based selection of rows in IdaDataFrames.

Enable row filtering.

The syntax is similar to Pandas.

Notes

The row order is not guaranteed if no indexer is given and the dataset is not sorted

Examples

>>> idadf['col1'] # return an IdaSeries
>>> idadf[['col1']] # return an IdaDataFrame with one column
>>> idadf[['col1', 'col2', 'col3']] # return an IdaDataFrame with 3 columns
>>> idadf[0:9] # Select the 10 first rows
>>> idadf[idadf['col1'] = "test"]
# select of rows for which attribute col1 is equal to "test"

Selection and Projection are also possible using the ibmdbpy.Loc object stored in IdaDataFrame.loc.

class ibmdbpy.indexing.Loc(idadf)[source]

The Loc class is used to select and project IdaDataFrames. It implements a Pandas-like interface.

Filtering

class ibmdbpy.filtering.FilterQuery(columns, tablename, method, value)[source]

FilterQueries are used to represent the filtering of an IdaDataFrame in ibmdbpy. The use of comparison operators, such as <, <=, ==, >=, > on an IdaDataFrame instance produces a FilterQuery instance which acts as a container for the where clause of the corresponding SQL request.

Filtering is possible using a Pandas-like syntax. Applying comparison operators to IdaDataFrames produces a FilterQuery instance which contains the string embedding the corresponding where clause in the “wherestr” attribute.

FilterQuery objects also contain a logic that allows them to be combined, thus allowing complex filtering.

You can combine the following operators: |, &, ^ (OR, AND and XOR)

Notes

It is not possible to filter an IdaDataFrame by using an IdaDataFrame that is opened in a different data source in the database. This is due to the fact that, using a Pandas-like syntax, “idadf[‘petal_width’] < 5” will return a Boolean array that is used to subset the original DataFrame. This is a fundamental restriction of ibmdbpy: we cannot afford to compute and download such an array because we cannot assume that the result will fit into user’s memory. Download time can also be a performance issue.

Examples

>>> idadf[['sepal_length', 'petal_width'] < 5
>>> <ibmdbpy.filtering.FilterQuery at 0xa65ba90>
>>> _.wherestr
'("sepal_length" < 5 AND "petal_width" < 5)'
>>> idadf[idadf[['sepal_length', 'petal_width']] < 5]
<ibmdbpy.frame.IdaDataFrame at 0xa73a860>
>>> _.head() # filtered IdaDataFrame
    sepal_length  sepal_width  petal_length  petal_width     species
0           4.4          2.9           1.4          0.2      setosa
1           4.7          3.2           1.6          0.2      setosa
2           4.9          2.5           4.5          1.7   virginica
3           4.9          2.4           3.3          1.0  versicolor
4           4.6          3.2           1.4          0.2      setosa
>>> idadf[(idadf['sepal_length'] < 5) & (idadf[petal_width'] > 1.5)]
<ibmdbpy.frame.IdaDataFrame at 0xa74b9b0>
>>> _.head()
   sepal_length  sepal_width  petal_length  petal_width    species
0           4.9          2.5           4.5          1.7  virginica
query

Return an SQL query like “SELECT * FROM %s WHERE <WHERESTR>”, where <WHERESTR> is the value of the attribute “wherestr”.

IdaDataFrame.__lt__(value)[source]

ibmdbpy.filtering.FilterQuery object when comparing self using “<”.

IdaDataFrame.__le__(value)[source]

ibmdbpy.filtering.FilterQuery object when comparing self using “<=”.

IdaDataFrame.__eq__(value)[source]

ibmdbpy.filtering.FilterQuery object when comparing self using “==”.

IdaDataFrame.__ne__(value)[source]

ibmdbpy.filtering.FilterQuery object when comparing self using ”!=”.

IdaDataFrame.__ge__(value)[source]

ibmdbpy.filtering.FilterQuery object when comparing self using “>=”.

IdaDataFrame.__gt__(value)[source]

ibmdbpy.filtering.FilterQuery object when comparing self using “>”.

Feature Engineering

IdaDataFrame.__setitem__(key, item)[source]

Enable the creation and aggregation of columns.

Examples

>>> idadf['new'] = idadf['sepal_length'] * idadf['sepalwidth']
# select a new column as the product of two existing columns
>>> idadf['sepal_length'] = idadf['sepal_length'] / idadf['sepal_length'].mean()
# modify an existing column
aggregation.aggregate_idadf(idadf, method, other, swap=False)

Modify internal state variables to represent the aggregation of columns of an IdaDataFrame or IdaSeries in ibmdbpy.

The following comparison operators are supported: +, *, /, -, //, %, **.

The syntax is similar to Pandas.

Parameters:

idadf : IdaDataFrame or IdaSeries

IdaDataFrame or IdaSerie on the left (if swap is False)

method : str

Aggregation method that is computed: the following values are admissible: “add”,”mul”,”div”,”sub”,”floordiv”,”mod”,”neg”,”pow”

other: Number or IdaDataFrame or IdaSeries

Another object that idadf will be aggregated with (on the right if swap is False).

swap : bool, default: False

Internally used to handle cases where the call is made reflexively, that is when the main IdaDataFrame/IdaSeries is not on the left. If swap is True, this also implies that other is not of type IdaDataFrame/IdaSeries.

Returns:

Aggregated IdaDataFrame or IdaSeries

Raises:

ValueError

Aggregation method not supported.

TypeError

Type not supported for aggregation.

Notes

It is not possible to create aggregations between columns that are stored in different dashDB/DB2 tables.

Examples

>>> idairis['SepalLength'] = idairis['SepalLength'] * 2
...
IdaDataFrame.__add__(other)[source]

Perform an addition between self and another IdaDataFrame or number.

Notes

Arithmetic operations only make sense if self contains only numeric columns.

Examples

>>> ida = idadf['sepal_length'] + 3
IdaDataFrame.__radd__(other)[source]

Enable the reflexivity of the addtion operation.

Examples

>>> ida = idadf['sepal_length'] + 3
>>> ida = 3 + idadf['sepal_length']
IdaDataFrame.__div__(other)[source]

Perform a division between self and another IdaDataFrame or number.

When __future__.division is not in effect.

Notes

Arithmetic operations only make sense if self contains only numeric columns.

Examples

>>> ida = idadf['sepal_length'] / 3
IdaDataFrame.__rdiv__(other)[source]

Enable the reflexivity of the division operation.

When __future__.division is not in effect.

Examples

>>> ida = idadf['sepal_length'] / 3
>>> ida = 3 / idadf['sepal_length']
IdaDataFrame.__truediv__(other)[source]

Perform a division between self and another IdaDataFrame or number.

When __future__.division is in effect.

Notes

Arithmetic operations only make sense if self contains only numeric columns.

Examples

>>> ida = idadf['sepal_length'] / 3
IdaDataFrame.__rtruediv__(other)[source]

Enable the reflexivity of the division operation.

When __future__.division is in effect.

Examples

>>> ida = idadf['sepal_length'] / 3
>>> ida = 3 / idadf['sepal_length']
IdaDataFrame.__floordiv__(other)[source]

Perform an integer division between self and another IdaDataFrame or number.

Notes

Arithmetic operations only make sense if self contains only numeric columns.

Examples

>>> ida = idadf['sepal_length'] // 3
IdaDataFrame.__rfloordiv__(other)[source]

Enable the reflexivity of the integer division operation.

Examples

>>> ida = idadf['sepal_length'] // 3
>>> ida = 3 // idadf['sepal_length']
IdaDataFrame.__mod__(other)[source]

Perform a modulo operation between self and another IdaDataFrame or number.

Notes

Arithmetic operations make sense if self has only numeric columns.

Examples

>>> ida = idadf['sepal_length'] % 3
IdaDataFrame.__rmod__(other)[source]

Enable the reflexivity of the modulo operation.

Examples

>>> ida = idadf['sepal_length'] % 3
>>> ida = 3 % idadf['sepal_length']
IdaDataFrame.__mul__(other)[source]

Perform a multiplication between self and another IdaDataFrame or number.

Notes

Arithmetic operations only make sense if self contains only numeric columns.

Examples

>>> ida = idadf['sepal_length'] * 3
IdaDataFrame.__rmul__(other)[source]

Enable the reflexivity of the multiplication operation.

Examples

>>> ida = idadf['sepal_length'] % 3
>>> ida = 3 % idadf['sepal_length']
IdaDataFrame.__neg__()[source]

Calculate the absolute negative of all columns in self.

Notes

Arithmetic operations only make sense if self contains only numeric columns.

IdaDataFrame.__rpos__(other)[source]

Calculate the absolute positive. No operation required.

IdaDataFrame.__pow__(other)[source]

Perform a power operation between self and another IdaDataFrame or number.

Notes

Arithmetic operations only make sense if self contains only numeric columns.

Examples

>>> ida = idadf['sepal_length'] ** 3
IdaDataFrame.__rpow__(other)[source]

Enable the reflexivity of the power operation.

Examples

>>> ida = idadf['sepal_length'] ** 3
>>> ida = 3 ** idadf['sepal_length']
IdaDataFrame.__sub__(other)[source]

Perform a substraction between self and another IdaDataFrame or number.

Notes

Arithmetic operations only make sense if self contains only numeric columns.

Examples

>>> ida = idadf['sepal_length'] - 3
IdaDataFrame.__rsub__(other)[source]

Enable the reflexivity of the substraction operation.

Examples

>>> ida = idadf['sepal_length'] - 3
>>> ida = 3 - idadf['sepal_length']
IdaDataFrame.__delitem__(item)[source]

Enable non-destructive deletion of columns using a Pandas style syntax. This happens inplace, which means that the current IdaDataFrame is modified.

Examples

>>> idadf = IdaDataFrame(idadb, "IRIS")
>>> idadf.columns
Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species'], dtype='object')
>>> del idadf['sepal_length']
>>> idadf.columns
Index(['sepal_width', 'petal_length', 'petal_width', 'species'], dtype='object')
IdaDataFrame.save_as(*args, **kwds)[source]

Save self as a table name in the remote database with the name tablename. This function might erase an existing table if tablename already exists and clear_existing is True.

DataBase Features

exists

IdaDataFrame.exists()[source]

Convenience function delegated from IdaDataBase.

Check if the data still exists in the database.

is_view / is_table

IdaDataFrame.is_view()[source]

Convenience function delegated from IdaDataBase.

Check if the IdaDataFrame corresponds to a view in the database.

IdaDataFrame.is_table()[source]

Convenience function delegated from IdaDataBase.

Check if the IdaDataFrame corresponds to a table in the database.

get_primary_key

IdaDataFrame.get_primary_key()[source]

Get the name of the primary key of self, if there is one. Otherwise, this function returns 0. This function may be deprecated in future versions because it is not very useful.

ida_query

IdaDataFrame.ida_query(query, silent=False, first_row_only=False, autocommit=False)[source]

Convenience function delegated from IdaDataBase.

Prepare, execute and format the result of a query in a data frame or in a tuple. See the IdaDataBase.ida_query documentation.

ida_scalar_query

IdaDataFrame.ida_scalar_query(query, silent=False, autocommit=False)[source]

Convenience function delegated from IdaDataBase.

Prepare and execute a query and return only the first element as a string. See the IdaDataBase.ida_scalar_query documentation.

Data Exploration

tail

IdaDataFrame.tail(*args, **kwds)[source]

Print the n last rows of the instance, n is set to 5 by default.

Parameters:

nrow : int > 0

The number of rows to be included in the result.

sort: default is True

If set to True and no indexer is set the data will be sorted by the first numeric column or if no numeric column is available by the first column of the dataframe. If set to False and no indexer is set the row order is not guaranteed and can vary with each execution. For big tables this option might save query processing time.

Returns:

DataFrame

The index of the corresponding row number and the columns are all columns of self.

Examples

>>> ida_iris.tail()
     sepal_length  sepal_width  petal_length  petal_width    species
145           6.7          3.0           5.2          2.3  virginica
146           6.3          2.5           5.0          1.9  virginica
147           6.5          3.0           5.2          2.0  virginica
148           6.2          3.4           5.4          2.3  virginica
149           5.9          3.0           5.1          1.8  virginica

pivot_table

IdaDataFrame.pivot_table(*args, **kwds)[source]

Compute an aggregation function over all rows of each column that is specified as a value on the dataset. The result grouped by the columns defined in “columns”.

Parameters:

values: str or list or str optional

List of columns on which “aggfunc” is computed.

columns: str or list or str optional

List of columns that is used as an index and by which the dataframe is grouped.

max_entries: int, default=1000

The maximum number of cells to be part of the output. By default, set to 1000.

sort: str, optional

Admissible values are: “alpha” and “factors”.
  • If “alpha”, the index of the output is sorted according to the alphabetical order.
  • If “factors”, the index of the output will be sorted according to increasing number of the distinct values.

By default, the index will be sorted in the same order that is specified in “columns” argument.

factor_threshold: int, default: 20

Number of distinct values above which a categorical column should not be considered categorical anymore and under which a numerical column column should not be considered numerical anymore.

interactive: bool

If True, the user is asked if he wants to display the output, given its size.

aggfunc: str

Aggregation function to be computed on each column specified in the argument “values”. Admissible values are: “count”, “sum”, “avg”. This entry is not case-sensitive.

Returns:

Pandas Series with Multi-index (columns)

Examples

>>> val = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
>>> ida_iris.pivot_table(values= val, aggfunc="avg")
              species
sepal_length  setosa        5.006
              versicolor    5.936
              virginica     6.588
sepal_width   setosa        3.428
              versicolor    2.770
              virginica     2.974
petal_length  setosa        1.462
              versicolor    4.260
              virginica     5.552
petal_width   setosa        0.246
              versicolor    1.326
              virginica     2.026
dtype: float64

sort

IdaDataFrame.sort(*args, **kwds)[source]

Sort the IdaDataFrame row wise or column wise.

Parameters:

columns : str or list of str

Columns that should be used to sort the rows in the IdaDataFrame. If columns is set to None and axis to 0, then the IdaDataFrame columns are sorted in lexicographical order.

axis : int (0/1)

Axis that is sorted. 0 for sorting row wise, 1 for sorting column wise.

ascending : bool, default: True

Sorting order, True : ascending, False : descending

inplace : bool, default: False

The current object is modified or creates a modified copy. If False, the function creates a modified copy of the current dataframe. If True, the function modifies the current dataframe.

Raises:

ValueError

  • When sorting by column (column not None), the axis value must be 0 (rows).
  • A column does not belong to self.
  • The axis argument has a value other than 0 or 1.

Notes

If columns is set to None and axis to 0, this undoes all sorting the IdaDataFrame and returns the original sorting in the dashDB/DB2 database.

No actual changes are made in dashDB/DB2, only the querying changes. Everything is registered in an InternalState object. Changes can be observed by using head and tail function.

Descriptive Statistics

describe

IdaDataFrame.describe(*args, **kwds)[source]

A basic statistical summary about current IdaDataFrame. If at least one numerical column exists, the summary includes:

  • The count of non-missing values for each numerical column.
  • The mean for each numerical column.
  • The standart deviation for each numerical column.
  • The minimum and maximum for each numerical column.
  • A list of percentiles set by the user (default : the quartiles).
Parameters:

idadf : IdaDataFrame

percentiles : Float or list of floats, default: [0.25, 0.50, 0.75].

percentiles to be computed on numerical columns. All values in percentiles must be > 0 and < 1.

Returns:

summary: DataFrame, where

  • Index is the name of the computed values.
  • Columns are either numerical or categorical columns of self.

cov (covariance)

IdaDataFrame.cov(*args, **kwds)[source]

Compute the covariance matrix, composed of covariance coefficients between all pairs of columns in self.

Returns:

covariance matrix: DataFrame

The axes are the columns of self and the values are the covariance coefficients.

corr (correlation)

IdaDataFrame.corr(*args, **kwds)[source]

Compute the correlation matrix, composed of correlation coefficients between all pairs of columns in self.

Parameters:

method : str, default: pearson

Method to be used to compute the correlation. By default, compute the pearson correlation coefficient. The Spearman rank correlation is also available. Admissible values are: “pearson”, “spearman”.

Returns:

correlation matrix: DataFrame

The axes are the columns of self and the values are the correlation coefficients.

Notes

For the Spearman rank correlation, the ordinal rank of columns is computed. For performance reasons this is easier to compute than the fractional rank traditionally computed for the Spearman rank correlation method. This strategy has the property that the sum of the ranking numbers is the same as under ordinal ranking. We then apply the pearson correlation coefficient method to these ranks.

quantile

IdaDataFrame.quantile(*args, **kwds)[source]

Compute row wise quantiles for each numeric column.

Parameters:

q : float or array-like, default 0.5 (50% quantile)

0 <= q <= 1, the quantile(s) to compute

Returns:

quantiles: Series or DataFrame

If q is an array, the function returns a DataFrame in which the index is q. The columns are the columns of sel, and the values are the quantiles. If q is a float, a Series is returned where the index is the columns of self and the values are the quantiles.

mad (mean absolute deviation)

IdaDataFrame.mad(*args, **kwds)[source]

Compute the mean absolute distance for all numeric columns of self.

Returns:

mad: Series

The index consists of the columns of self and the values are the mean absolute distance.

min (minimum)

IdaDataFrame.min(*args, **kwds)[source]

Compute the minimum value for all numerics column of self.

Returns:

min: Series

The index consists of the columns of self and the values are the minimum.

max (maximum)

IdaDataFrame.max(*args, **kwds)[source]

Compute the maximum value over for all numeric columns of self.

Returns:

max: Series.

The index consists of the columns of self and the values are the maximum.

count

IdaDataFrame.count(*args, **kwds)[source]

Compute the count of non-missing values for all columns of self.

Returns:

count: Series.

The index consists of the columns of self and the values are the number of non-missing values.

count_distinct

IdaDataFrame.count_distinct(*args, **kwds)[source]

Compute the count of distinct values for all numeric columns of self.

Returns:

disctinct count: Series

The index consists of the columns of self and values are the number of distinct values.

std (standard deviation)

IdaDataFrame.std(*args, **kwds)[source]

Compute the standart deviation for all numeric columns of self.

Returns:

std: Series

The index consists of the columns of self and the values are the standart deviation.

var (variance)

IdaDataFrame.var(*args, **kwds)[source]

Compute the variance for all numeric columns of self.

Returns:

var: Series

The index consists of the columns of self and the values are the variance.

mean

IdaDataFrame.mean(*args, **kwds)[source]

Compute the mean for each numeric columns of self.

Returns:

mean: Series

The index consists of the columns of self and the values are the mean.

sum

IdaDataFrame.sum(*args, **kwds)[source]

Compute the sum of values for all numeric columns of self.

Returns:

sum: Series

The index consists of the columns of self and the values are the sum.

median

IdaDataFrame.median(*args, **kwds)[source]

Compute the median for all numeric columns of self.

Returns:

median: Series

The index consists of the columns of self and the values are the median.

Import as DataFrame

as_dataframe

IdaDataFrame.as_dataframe(*args, **kwds)[source]

Download and return an in-memory representation of the dataset as a Pandas DataFrame.

Returns:

DataFrame

Columns and records are the same as in self.

Examples

>>> iris = ida_iris.as_dataframe()
>>> iris.head()
   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa

Connection Management

commit

IdaDataFrame.commit()[source]

Commit operations in the database.

Notes

All changes that are made in the database after the last commit, including those in the child IdaDataFrames, are commited.

If the environment variable ‘VERBOSE’ is set to True, the commit operations are notified in the console.

rollback

IdaDataFrame.rollback()[source]

Rollback operations in the database.

Notes

All changes that are made in the database after the last commit, including those in the child IdaDataFrames, are discarded.

Private Methods

_clone

IdaDataFrame._clone()[source]

Clone the actual object.

_clone_as_serie

IdaDataFrame._clone_as_serie(column)[source]

Clone the actual object as an IdaSeries and select one of its columns.

_get_type

IdaDataFrame._get_type()[source]

Type of the IdaDataFrame : “Table”, “View” or “Unknown”.

_get_columns

IdaDataFrame._get_columns()[source]

Index containing a list of the columns in self.

_get_all_columns_in_table

IdaDataFrame._get_all_columns_in_table()[source]

Get all columns that exists in the physical table.

_get_index

IdaDataFrame._get_index(force=False)[source]

Index containing a list of the row names in self.

_get_shape

IdaDataFrame._get_shape()[source]

Tuple containing the number of rows and the number of columns in self.

_get_columns_dtypes

IdaDataFrame._get_columns_dtypes()[source]

DataFrame containing the column names and database types in self.

_reset_attributes

IdaDataFrame._reset_attributes(attributes)[source]

Delete an attribute of self to force its refreshing at the next call.

_table_def

IdaDataFrame._table_def(factor_threshold=None)[source]

Classify columns in the idaDataFrame into 4 classes: CATEGORICAL, STRING, NUMERIC or NONE. Use the database data type and a user-threshold “factor_threshold”:

  • CATEGORICAL columns that have a number of distinct values that is greater than the factor_threshold should be considered a STRING.
  • NUMERIC columns that have a number of distinct values that is smaller or equal to the factor_threshold should be considered CATEGORICAL.
Returns:

DataFrame

  • Index is the columns of self.
  • Column “FACTORS” contains the number of distinct values.
  • Column “VALTYPE” contains the resulting class.

Examples

>>> ida_iris._table_def()
             TYPENAME FACTORS      VALTYPE
sepal_length   DOUBLE      35      NUMERIC
sepal_width    DOUBLE      23      NUMERIC
petal_length   DOUBLE      43      NUMERIC
petal_width    DOUBLE      22      NUMERIC
species       VARCHAR       3  CATEGORICAL

_get_numerical_columns

IdaDataFrame._get_numerical_columns()[source]

Get the columns of self that are considered as numerical. Their data type in the database determines whether these columns are numerical. The following data types are considered numerical:

‘SMALLINT’, ‘INTEGER’,’BIGINT’,’REAL’, ‘DOUBLE’,’FLOAT’,’DECIMAL’,’NUMERIC’
Returns:

list

List of numerical column names.

Examples

>>> ida_iris._get_numerical_columns()
['sepal_length', 'sepal_width', 'petal_length', 'petal_width']

_get_categorical_columns

IdaDataFrame._get_categorical_columns()[source]

Get the columns of self that are considered as categorical. Their data type in the database determines whether these columns are categorical. The following data types are considered categorical:

“VARCHAR”,”CHARACTER”, “VARGRAPHIC”, “GRAPHIC”, “CLOB”.
Returns:

list

List of categorical column names.

Examples

>>> ida_iris._get_categorical_columns()
['species']

_prepare_and_execute

IdaDataFrame._prepare_and_execute(query, autocommit=True, silent=False)[source]

Prepare and execute a query.

Parameters:

query : str

Query to be executed.

autocommit : bool

If set to true, the autocommit function is available.

_autocommit

IdaDataFrame._autocommit()[source]

Autocommit the connection. If the environment variable ‘AUTOCOMMIT’ is set to True, the function commits the changes.

Notes

If you commit, all changes that are made in the database after the last commit, including those in the child IdaDataFrames, are commited.

If the environment variable ‘VERBOSE’ is set to True, the autocommit operations are notified in the console.

_combine_check

IdaDataFrame._combine_check(other)[source]

Check if self and other refer to the same table and if all columns in self and other are numeric. This sanity check is used before performing aggregation operations between IdaDataFrame/IdaSeries.

These functions are defined in ibmdbpy.statistics but apply to IdaDataFrames.

_numeric_stats

ibmdbpy.statistics._numeric_stats(idadf, stat, columns)[source]

Compute various stats from one or several numerical columns of an IdaDataFrame.

Parameters:

idadf : IdaDataFrame

Data source.

stat : str

Name of the statistic to be computed.

columns : str or list of str

Name of the columns that belong to the IdaDataFrame.

Returns:

Tuple

One value for each column.

Notes

Currently, the following functions are supported: count, mean, median, std, var, min, max, sum. Should return a tuple. Only available for numerical columns.

_get_percentiles

ibmdbpy.statistics._get_percentiles(idadf, percentiles, columns)[source]

Return percentiles over all entries of a column or list of columns in the IdaDataFrame.

Parameters:

idadf : IdaDataFrame

percentiles: Float or list of floats.

All values in percentiles must be > 0 and < 1

columns: String or list of string

Name of columns belonging to the IdaDataFrame.

Returns:

DataFrame

_categorical_stats

ibmdbpy.statistics._categorical_stats(idadf, stat, columns)[source]

Computes various stats from one or several categorical columns of the IdaDataFrame. This is not implemented.

Parameters:

idadf : IdaDataFrame

stat : str

Name of the statistic function to be computed.

columns : str or list of str

Name of columns belonging to the IdaDataFrame.

Returns:

Tuple.

_get_number_of_nas

ibmdbpy.statistics._get_number_of_nas(idadf, columns)[source]

Return the count of missing values for a list of columns in the IdaDataFrame.

Parameters:

idadf : IdaDataFrame

columns : str or list

One column as a string or a list of columns in the idaDataFrame.

Returns:

Tuple

_count_level

ibmdbpy.statistics._count_level(idadf, columnlist=None)[source]

Count distinct levels across a list of columns of an IdaDataFrame grouped by themselves.

Parameters:

columnlist : list

List of column names that exist in the IdaDataFrame. By default, these are all columns in IdaDataFrame.

Returns:

Tuple

Notes

The function assumes the follwing:
  • The columns given as parameter exists in the IdaDataframe.
  • The parameter columnlist is an optional list.
  • Columns are referenced by their own name (character string).

_count_level_groupby

ibmdbpy.statistics._count_level_groupby(idadf, columnlist=None)[source]

Count distinct levels across a list of columns in the IdaDataFrame grouped by themselves. This is used to get the dimension of the resulting cross table.

Parameters:

columnlist : list

List of column names existing in the IdaDataFrame. By default, these are columns of self

Returns:

Tuple

Notes

The function assumes the follwing:
  • The columns given as parameter exists in the IdaDataframe.
  • The parameter columnlist is a optional and is a list.
  • Columns are referenced by their own name (character string).

_factors_count

ibmdbpy.statistics._factors_count(idadf, columnlist, valuelist=None)[source]

Count non-missing values for all columns in a list (valuelist) over the IdaDataFrame grouped by a list of columns(columnlist).

Parameters:

columnlist : list

List of column names that exist in self.

valuelist : list

List of column names that exist in self.

Returns:

DataFrame

_factors_sum

ibmdbpy.statistics._factors_sum(idadf, columnlist, valuelist)[source]

Compute the arithmetic sum over for all columns in a list (valuelist) over the IdaDataFrame grouped by a list of columns (columnlist).

Parameters:

columnlist : list

List of column names that exist in self.

valuelist : list

List of column names that exist in self.

Returns:

DataFrame

_factors_avg

ibmdbpy.statistics._factors_avg(idadf, columnlist, valuelist)[source]

Compute the arithmetic average for all columns in a list (valuelist) over the IdaDataFrame grouped by a list of columns (columnlist).

Parameters:

columnlist : list

List of column names that exist in self.

valuelist : list

List of column names that exist in self.

Returns:

DataFrame