IdaDataFrame¶
Open an IdaDataFrame Object.¶
-
class
ibmdbpy.frame.
IdaDataFrame
(idadb, tablename, indexer=None)[source]¶ An IdaDataFrame object is a reference to a table in a remote instance of dashDB/DB2. IDA stands for In-DataBase Analytics. IdaDataFrame copies the Pandas interface for DataFrame objects to ensure intuitive interaction for end-users.
Examples
>>> idadb = IdaDataBase('DASHDB') # See documentation for IdaDataBase >>> ida_iris = IdaDataFrame(idadb, 'IRIS') >>> ida_iris.cov() sepal_length sepal_width petal_length petal_width sepal_length 0.685694 -0.042434 1.274315 0.516271 sepal_width -0.042434 0.189979 -0.329656 -0.121639 petal_length 1.274315 -0.329656 3.116278 1.295609 petal_width 0.516271 -0.121639 1.295609 0.581006
-
__init__
(idadb, tablename, indexer=None)[source]¶ Constructor for IdaDataFrame objects.
Parameters: idadb : IdaDataBase
IdaDataBase instance which contains the connection to be used.
tablename : str
Name of the table to be opened in the database.
indexer : str, optional
Name of the column that should be used as an index. This is optional. However, if no indexer is given, the order of rows issued by the head and tail functions is not guaranteed. Also, several in-database machine learning algorithms need an indexer as a parameter to be executed.
Notes
Attributes “type”, “dtypes”, “index”, “columns”, “axes”, and “shape” are evaluated in a lazy way to avoid an overhead when creating an IdaDataFrame. Sometimes the index may be too big to be downloaded.
Examples
>>> idadb = IdaDataBase('DASHDB') >>> ida_iris = IdaDataFrame(idadb, "IRIS")
Attributes
_idadb (IdaDataBase) IdaDataBase object parent of the current instance. tablename (str) Name of the table self references. name (str) Full name of the table self references, including schema. schema (str) Name of the schema the table belongs to. indexer (str) Name of the column used as an index. “None” if no indexer. loc (str) Indexer that enables the selection and projection of IdaDataFrame instances. For more information, see the loc class documentation. internal_state (InternalState) Object used to internally store the state of the IdaDataFrame. It also allows several non-destructive manipulation methods. type (str) Type of the IdaDataFrame : “Table”, “View”, or “Unknown”. dtypes (DataFrame) Data type in the database for each column. index (pandas.core.index) Index containing the row names in this table. columns (pandas.core.index) Index containing the columns names in this table. axes (list) List containing columns and index attributes. shape (Tuple) Number of rows and number of columns.
-
Methods
DataFrame introspection¶
internal_state¶
indexer¶
-
IdaDataFrame.
indexer
¶ The indexer attribute refers to the name of a column that should be used to index the table. This makes sense because dashDB is a column-based database, so row IDs do not make sense and are not deterministic. As a consequence, the only way to address a particular row is to refer to it by its index. If no indexer is provided, ibmdbpy still works but a correct row order is not guaranteed as far as the dataset is not sorted. Also, note that the indexer column is not taken into account in data mining algorithms.
type¶
dtypes¶
index¶
-
IdaDataFrame.
index
[source]¶ Index containing the row names in self.
Returns: Index Notes
Because indexes in a database can be only numeric, it is not that interesting for an IdaDataFrame but can still be useful sometimes. The function can break if the table is too large. Ask for the user’s approval before downloading an index which has more than 10000 values.
Examples
>>> ida_iris.index Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, ... 140, 141, 142, 143, 144, 145, 146, 147, 148, 149], dtype='int64', length=150)
columns¶
axes¶
-
IdaDataFrame.
axes
[source]¶ List containing IdaDataFrame.columns and IdaDataFrame.index attributes.
Returns: list
List containing two indexes (indexes and column attributes).
Examples
>>> ida_iris.axes [Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, ... 140, 141, 142, 143, 144, 145, 146, 147, 148, 149], dtype='int64', length=150), Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species'], dtype='object')]
shape¶
DataFrame modification¶
Selection, Projection¶
-
IdaDataFrame.
__getitem__
(item)[source]¶ Enable label based projection (selection of columns) in IdaDataFrames.
Enable slice based selection of rows in IdaDataFrames.
Enable row filtering.
The syntax is similar to Pandas.
Notes
The row order is not guaranteed if no indexer is given and the dataset is not sorted
Examples
>>> idadf['col1'] # return an IdaSeries
>>> idadf[['col1']] # return an IdaDataFrame with one column
>>> idadf[['col1', 'col2', 'col3']] # return an IdaDataFrame with 3 columns
>>> idadf[0:9] # Select the 10 first rows
>>> idadf[idadf['col1'] = "test"] # select of rows for which attribute col1 is equal to "test"
Selection and Projection are also possible using the ibmdbpy.Loc
object stored in IdaDataFrame.loc
.
Filtering¶
-
class
ibmdbpy.filtering.
FilterQuery
(columns, tablename, method, value)[source]¶ FilterQueries are used to represent the filtering of an IdaDataFrame in ibmdbpy. The use of comparison operators, such as <, <=, ==, >=, > on an IdaDataFrame instance produces a FilterQuery instance which acts as a container for the where clause of the corresponding SQL request.
Filtering is possible using a Pandas-like syntax. Applying comparison operators to IdaDataFrames produces a FilterQuery instance which contains the string embedding the corresponding where clause in the “wherestr” attribute.
FilterQuery objects also contain a logic that allows them to be combined, thus allowing complex filtering.
You can combine the following operators: |, &, ^ (OR, AND and XOR)Notes
It is not possible to filter an IdaDataFrame by using an IdaDataFrame that is opened in a different data source in the database. This is due to the fact that, using a Pandas-like syntax, “idadf[‘petal_width’] < 5” will return a Boolean array that is used to subset the original DataFrame. This is a fundamental restriction of ibmdbpy: we cannot afford to compute and download such an array because we cannot assume that the result will fit into user’s memory. Download time can also be a performance issue.
Examples
>>> idadf[['sepal_length', 'petal_width'] < 5 >>> <ibmdbpy.filtering.FilterQuery at 0xa65ba90> >>> _.wherestr '("sepal_length" < 5 AND "petal_width" < 5)'
>>> idadf[idadf[['sepal_length', 'petal_width']] < 5] <ibmdbpy.frame.IdaDataFrame at 0xa73a860> >>> _.head() # filtered IdaDataFrame sepal_length sepal_width petal_length petal_width species 0 4.4 2.9 1.4 0.2 setosa 1 4.7 3.2 1.6 0.2 setosa 2 4.9 2.5 4.5 1.7 virginica 3 4.9 2.4 3.3 1.0 versicolor 4 4.6 3.2 1.4 0.2 setosa
>>> idadf[(idadf['sepal_length'] < 5) & (idadf[petal_width'] > 1.5)] <ibmdbpy.frame.IdaDataFrame at 0xa74b9b0> >>> _.head() sepal_length sepal_width petal_length petal_width species 0 4.9 2.5 4.5 1.7 virginica
-
query
¶ Return an SQL query like “SELECT * FROM %s WHERE <WHERESTR>”, where <WHERESTR> is the value of the attribute “wherestr”.
-
-
IdaDataFrame.
__lt__
(value)[source]¶ ibmdbpy.filtering.FilterQuery object when comparing self using “<”.
-
IdaDataFrame.
__le__
(value)[source]¶ ibmdbpy.filtering.FilterQuery object when comparing self using “<=”.
-
IdaDataFrame.
__eq__
(value)[source]¶ ibmdbpy.filtering.FilterQuery object when comparing self using “==”.
-
IdaDataFrame.
__ne__
(value)[source]¶ ibmdbpy.filtering.FilterQuery object when comparing self using ”!=”.
Feature Engineering¶
-
IdaDataFrame.
__setitem__
(key, item)[source]¶ Enable the creation and aggregation of columns.
Examples
>>> idadf['new'] = idadf['sepal_length'] * idadf['sepalwidth'] # select a new column as the product of two existing columns
>>> idadf['sepal_length'] = idadf['sepal_length'] / idadf['sepal_length'].mean() # modify an existing column
-
aggregation.
aggregate_idadf
(idadf, method, other, swap=False)¶ Modify internal state variables to represent the aggregation of columns of an IdaDataFrame or IdaSeries in ibmdbpy.
The following comparison operators are supported: +, *, /, -, //, %, **.
The syntax is similar to Pandas.
Parameters: idadf : IdaDataFrame or IdaSeries
IdaDataFrame or IdaSerie on the left (if swap is False)
method : str
Aggregation method that is computed: the following values are admissible: “add”,”mul”,”div”,”sub”,”floordiv”,”mod”,”neg”,”pow”
other: Number or IdaDataFrame or IdaSeries
Another object that idadf will be aggregated with (on the right if swap is False).
swap : bool, default: False
Internally used to handle cases where the call is made reflexively, that is when the main IdaDataFrame/IdaSeries is not on the left. If swap is True, this also implies that other is not of type IdaDataFrame/IdaSeries.
Returns: Aggregated IdaDataFrame or IdaSeries
Raises: ValueError
Aggregation method not supported.
TypeError
Type not supported for aggregation.
Notes
It is not possible to create aggregations between columns that are stored in different dashDB/DB2 tables.
Examples
>>> idairis['SepalLength'] = idairis['SepalLength'] * 2 ...
-
IdaDataFrame.
__add__
(other)[source]¶ Perform an addition between self and another IdaDataFrame or number.
Notes
Arithmetic operations only make sense if self contains only numeric columns.
Examples
>>> ida = idadf['sepal_length'] + 3
-
IdaDataFrame.
__radd__
(other)[source]¶ Enable the reflexivity of the addtion operation.
Examples
>>> ida = idadf['sepal_length'] + 3
>>> ida = 3 + idadf['sepal_length']
-
IdaDataFrame.
__div__
(other)[source]¶ Perform a division between self and another IdaDataFrame or number.
When __future__.division is not in effect.
Notes
Arithmetic operations only make sense if self contains only numeric columns.
Examples
>>> ida = idadf['sepal_length'] / 3
-
IdaDataFrame.
__rdiv__
(other)[source]¶ Enable the reflexivity of the division operation.
When __future__.division is not in effect.
Examples
>>> ida = idadf['sepal_length'] / 3
>>> ida = 3 / idadf['sepal_length']
-
IdaDataFrame.
__truediv__
(other)[source]¶ Perform a division between self and another IdaDataFrame or number.
When __future__.division is in effect.
Notes
Arithmetic operations only make sense if self contains only numeric columns.
Examples
>>> ida = idadf['sepal_length'] / 3
-
IdaDataFrame.
__rtruediv__
(other)[source]¶ Enable the reflexivity of the division operation.
When __future__.division is in effect.
Examples
>>> ida = idadf['sepal_length'] / 3
>>> ida = 3 / idadf['sepal_length']
-
IdaDataFrame.
__floordiv__
(other)[source]¶ Perform an integer division between self and another IdaDataFrame or number.
Notes
Arithmetic operations only make sense if self contains only numeric columns.
Examples
>>> ida = idadf['sepal_length'] // 3
-
IdaDataFrame.
__rfloordiv__
(other)[source]¶ Enable the reflexivity of the integer division operation.
Examples
>>> ida = idadf['sepal_length'] // 3
>>> ida = 3 // idadf['sepal_length']
-
IdaDataFrame.
__mod__
(other)[source]¶ Perform a modulo operation between self and another IdaDataFrame or number.
Notes
Arithmetic operations make sense if self has only numeric columns.
Examples
>>> ida = idadf['sepal_length'] % 3
-
IdaDataFrame.
__rmod__
(other)[source]¶ Enable the reflexivity of the modulo operation.
Examples
>>> ida = idadf['sepal_length'] % 3
>>> ida = 3 % idadf['sepal_length']
-
IdaDataFrame.
__mul__
(other)[source]¶ Perform a multiplication between self and another IdaDataFrame or number.
Notes
Arithmetic operations only make sense if self contains only numeric columns.
Examples
>>> ida = idadf['sepal_length'] * 3
-
IdaDataFrame.
__rmul__
(other)[source]¶ Enable the reflexivity of the multiplication operation.
Examples
>>> ida = idadf['sepal_length'] % 3
>>> ida = 3 % idadf['sepal_length']
-
IdaDataFrame.
__neg__
()[source]¶ Calculate the absolute negative of all columns in self.
Notes
Arithmetic operations only make sense if self contains only numeric columns.
-
IdaDataFrame.
__pow__
(other)[source]¶ Perform a power operation between self and another IdaDataFrame or number.
Notes
Arithmetic operations only make sense if self contains only numeric columns.
Examples
>>> ida = idadf['sepal_length'] ** 3
-
IdaDataFrame.
__rpow__
(other)[source]¶ Enable the reflexivity of the power operation.
Examples
>>> ida = idadf['sepal_length'] ** 3
>>> ida = 3 ** idadf['sepal_length']
-
IdaDataFrame.
__sub__
(other)[source]¶ Perform a substraction between self and another IdaDataFrame or number.
Notes
Arithmetic operations only make sense if self contains only numeric columns.
Examples
>>> ida = idadf['sepal_length'] - 3
-
IdaDataFrame.
__rsub__
(other)[source]¶ Enable the reflexivity of the substraction operation.
Examples
>>> ida = idadf['sepal_length'] - 3
>>> ida = 3 - idadf['sepal_length']
-
IdaDataFrame.
__delitem__
(item)[source]¶ Enable non-destructive deletion of columns using a Pandas style syntax. This happens inplace, which means that the current IdaDataFrame is modified.
Examples
>>> idadf = IdaDataFrame(idadb, "IRIS") >>> idadf.columns Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species'], dtype='object') >>> del idadf['sepal_length'] >>> idadf.columns Index(['sepal_width', 'petal_length', 'petal_width', 'species'], dtype='object')
DataBase Features¶
exists¶
is_view / is_table¶
get_primary_key¶
ida_query¶
Data Exploration¶
head¶
-
IdaDataFrame.
head
(*args, **kwds)[source]¶ Print the n first rows of the instance, n is set to 5 by default.
Parameters: nrow : int > 0
Number of rows to be included in the result.
sort: default is True
If set to True and no indexer is set the data will be sorted by the first numeric column or if no numeric column is available by the first column of the dataframe. If set to False and no indexer is set the row order is not guaranteed and can vary with each execution. For big tables this option might save query processing time.
Returns: DataFrame or Series
The index of the corresponding row number and the columns are all columns of self. If the IdaDataFrame has only one column, it returns a Series.
Examples
>>> ida_iris.head() sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 setosa 1 4.9 3.0 1.4 0.2 setosa 2 4.7 3.2 1.3 0.2 setosa 3 4.6 3.1 1.5 0.2 setosa 4 5.0 3.6 1.4 0.2 setosa
tail¶
-
IdaDataFrame.
tail
(*args, **kwds)[source]¶ Print the n last rows of the instance, n is set to 5 by default.
Parameters: nrow : int > 0
The number of rows to be included in the result.
sort: default is True
If set to True and no indexer is set the data will be sorted by the first numeric column or if no numeric column is available by the first column of the dataframe. If set to False and no indexer is set the row order is not guaranteed and can vary with each execution. For big tables this option might save query processing time.
Returns: DataFrame
The index of the corresponding row number and the columns are all columns of self.
Examples
>>> ida_iris.tail() sepal_length sepal_width petal_length petal_width species 145 6.7 3.0 5.2 2.3 virginica 146 6.3 2.5 5.0 1.9 virginica 147 6.5 3.0 5.2 2.0 virginica 148 6.2 3.4 5.4 2.3 virginica 149 5.9 3.0 5.1 1.8 virginica
pivot_table¶
-
IdaDataFrame.
pivot_table
(*args, **kwds)[source]¶ Compute an aggregation function over all rows of each column that is specified as a value on the dataset. The result grouped by the columns defined in “columns”.
Parameters: values: str or list or str optional
List of columns on which “aggfunc” is computed.
columns: str or list or str optional
List of columns that is used as an index and by which the dataframe is grouped.
max_entries: int, default=1000
The maximum number of cells to be part of the output. By default, set to 1000.
sort: str, optional
- Admissible values are: “alpha” and “factors”.
- If “alpha”, the index of the output is sorted according to the alphabetical order.
- If “factors”, the index of the output will be sorted according to increasing number of the distinct values.
By default, the index will be sorted in the same order that is specified in “columns” argument.
factor_threshold: int, default: 20
Number of distinct values above which a categorical column should not be considered categorical anymore and under which a numerical column column should not be considered numerical anymore.
interactive: bool
If True, the user is asked if he wants to display the output, given its size.
aggfunc: str
Aggregation function to be computed on each column specified in the argument “values”. Admissible values are: “count”, “sum”, “avg”. This entry is not case-sensitive.
Returns: Pandas Series with Multi-index (columns)
Examples
>>> val = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width'] >>> ida_iris.pivot_table(values= val, aggfunc="avg") species sepal_length setosa 5.006 versicolor 5.936 virginica 6.588 sepal_width setosa 3.428 versicolor 2.770 virginica 2.974 petal_length setosa 1.462 versicolor 4.260 virginica 5.552 petal_width setosa 0.246 versicolor 1.326 virginica 2.026 dtype: float64
sort¶
-
IdaDataFrame.
sort
(*args, **kwds)[source]¶ Sort the IdaDataFrame row wise or column wise.
Parameters: columns : str or list of str
Columns that should be used to sort the rows in the IdaDataFrame. If columns is set to None and axis to 0, then the IdaDataFrame columns are sorted in lexicographical order.
axis : int (0/1)
Axis that is sorted. 0 for sorting row wise, 1 for sorting column wise.
ascending : bool, default: True
Sorting order, True : ascending, False : descending
inplace : bool, default: False
The current object is modified or creates a modified copy. If False, the function creates a modified copy of the current dataframe. If True, the function modifies the current dataframe.
Raises: ValueError
- When sorting by column (column not None), the axis value must be 0 (rows).
- A column does not belong to self.
- The axis argument has a value other than 0 or 1.
Notes
If columns is set to None and axis to 0, this undoes all sorting the IdaDataFrame and returns the original sorting in the dashDB/DB2 database.
No actual changes are made in dashDB/DB2, only the querying changes. Everything is registered in an InternalState object. Changes can be observed by using head and tail function.
Descriptive Statistics¶
describe¶
-
IdaDataFrame.
describe
(*args, **kwds)[source]¶ A basic statistical summary about current IdaDataFrame. If at least one numerical column exists, the summary includes:
- The count of non-missing values for each numerical column.
- The mean for each numerical column.
- The standart deviation for each numerical column.
- The minimum and maximum for each numerical column.
- A list of percentiles set by the user (default : the quartiles).
Parameters: idadf : IdaDataFrame
percentiles : Float or list of floats, default: [0.25, 0.50, 0.75].
percentiles to be computed on numerical columns. All values in percentiles must be > 0 and < 1.
Returns: summary: DataFrame, where
- Index is the name of the computed values.
- Columns are either numerical or categorical columns of self.
cov (covariance)¶
corr (correlation)¶
-
IdaDataFrame.
corr
(*args, **kwds)[source]¶ Compute the correlation matrix, composed of correlation coefficients between all pairs of columns in self.
Parameters: method : str, default: pearson
Method to be used to compute the correlation. By default, compute the pearson correlation coefficient. The Spearman rank correlation is also available. Admissible values are: “pearson”, “spearman”.
Returns: correlation matrix: DataFrame
The axes are the columns of self and the values are the correlation coefficients.
Notes
For the Spearman rank correlation, the ordinal rank of columns is computed. For performance reasons this is easier to compute than the fractional rank traditionally computed for the Spearman rank correlation method. This strategy has the property that the sum of the ranking numbers is the same as under ordinal ranking. We then apply the pearson correlation coefficient method to these ranks.
quantile¶
-
IdaDataFrame.
quantile
(*args, **kwds)[source]¶ Compute row wise quantiles for each numeric column.
Parameters: q : float or array-like, default 0.5 (50% quantile)
0 <= q <= 1, the quantile(s) to compute
Returns: quantiles: Series or DataFrame
If q is an array, the function returns a DataFrame in which the index is q. The columns are the columns of sel, and the values are the quantiles. If q is a float, a Series is returned where the index is the columns of self and the values are the quantiles.
mad (mean absolute deviation)¶
min (minimum)¶
max (maximum)¶
count¶
count_distinct¶
std (standard deviation)¶
var (variance)¶
mean¶
sum¶
Import as DataFrame¶
as_dataframe¶
-
IdaDataFrame.
as_dataframe
(*args, **kwds)[source]¶ Download and return an in-memory representation of the dataset as a Pandas DataFrame.
Returns: DataFrame
Columns and records are the same as in self.
Examples
>>> iris = ida_iris.as_dataframe() >>> iris.head() sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 setosa 1 4.9 3.0 1.4 0.2 setosa 2 4.7 3.2 1.3 0.2 setosa 3 4.6 3.1 1.5 0.2 setosa 4 5.0 3.6 1.4 0.2 setosa
Private Methods¶
_clone_as_serie¶
_get_type¶
_get_all_columns_in_table¶
_get_index¶
_get_shape¶
_get_columns_dtypes¶
_reset_attributes¶
_table_def¶
-
IdaDataFrame.
_table_def
(factor_threshold=None)[source]¶ Classify columns in the idaDataFrame into 4 classes: CATEGORICAL, STRING, NUMERIC or NONE. Use the database data type and a user-threshold “factor_threshold”:
- CATEGORICAL columns that have a number of distinct values that is greater than the factor_threshold should be considered a STRING.
- NUMERIC columns that have a number of distinct values that is smaller or equal to the factor_threshold should be considered CATEGORICAL.
Returns: DataFrame
- Index is the columns of self.
- Column “FACTORS” contains the number of distinct values.
- Column “VALTYPE” contains the resulting class.
Examples
>>> ida_iris._table_def() TYPENAME FACTORS VALTYPE sepal_length DOUBLE 35 NUMERIC sepal_width DOUBLE 23 NUMERIC petal_length DOUBLE 43 NUMERIC petal_width DOUBLE 22 NUMERIC species VARCHAR 3 CATEGORICAL
_get_numerical_columns¶
-
IdaDataFrame.
_get_numerical_columns
()[source]¶ Get the columns of self that are considered as numerical. Their data type in the database determines whether these columns are numerical. The following data types are considered numerical:
‘SMALLINT’, ‘INTEGER’,’BIGINT’,’REAL’, ‘DOUBLE’,’FLOAT’,’DECIMAL’,’NUMERIC’Returns: list
List of numerical column names.
Examples
>>> ida_iris._get_numerical_columns() ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
_get_categorical_columns¶
-
IdaDataFrame.
_get_categorical_columns
()[source]¶ Get the columns of self that are considered as categorical. Their data type in the database determines whether these columns are categorical. The following data types are considered categorical:
“VARCHAR”,”CHARACTER”, “VARGRAPHIC”, “GRAPHIC”, “CLOB”.Returns: list
List of categorical column names.
Examples
>>> ida_iris._get_categorical_columns() ['species']
_prepare_and_execute¶
_autocommit¶
-
IdaDataFrame.
_autocommit
()[source]¶ Autocommit the connection. If the environment variable ‘AUTOCOMMIT’ is set to True, the function commits the changes.
Notes
If you commit, all changes that are made in the database after the last commit, including those in the child IdaDataFrames, are commited.
If the environment variable ‘VERBOSE’ is set to True, the autocommit operations are notified in the console.
_combine_check¶
-
IdaDataFrame.
_combine_check
(other)[source]¶ Check if self and other refer to the same table and if all columns in self and other are numeric. This sanity check is used before performing aggregation operations between IdaDataFrame/IdaSeries.
These functions are defined in ibmdbpy.statistics but apply to IdaDataFrames.
_numeric_stats¶
-
ibmdbpy.statistics.
_numeric_stats
(idadf, stat, columns)[source]¶ Compute various stats from one or several numerical columns of an IdaDataFrame.
Parameters: idadf : IdaDataFrame
Data source.
stat : str
Name of the statistic to be computed.
columns : str or list of str
Name of the columns that belong to the IdaDataFrame.
Returns: Tuple
One value for each column.
Notes
Currently, the following functions are supported: count, mean, median, std, var, min, max, sum. Should return a tuple. Only available for numerical columns.
_get_percentiles¶
-
ibmdbpy.statistics.
_get_percentiles
(idadf, percentiles, columns)[source]¶ Return percentiles over all entries of a column or list of columns in the IdaDataFrame.
Parameters: idadf : IdaDataFrame
percentiles: Float or list of floats.
All values in percentiles must be > 0 and < 1
columns: String or list of string
Name of columns belonging to the IdaDataFrame.
Returns: DataFrame
_categorical_stats¶
-
ibmdbpy.statistics.
_categorical_stats
(idadf, stat, columns)[source]¶ Computes various stats from one or several categorical columns of the IdaDataFrame. This is not implemented.
Parameters: idadf : IdaDataFrame
stat : str
Name of the statistic function to be computed.
columns : str or list of str
Name of columns belonging to the IdaDataFrame.
Returns: Tuple.
_get_number_of_nas¶
_count_level¶
-
ibmdbpy.statistics.
_count_level
(idadf, columnlist=None)[source]¶ Count distinct levels across a list of columns of an IdaDataFrame grouped by themselves.
Parameters: columnlist : list
List of column names that exist in the IdaDataFrame. By default, these are all columns in IdaDataFrame.
Returns: Tuple
Notes
- The function assumes the follwing:
- The columns given as parameter exists in the IdaDataframe.
- The parameter columnlist is an optional list.
- Columns are referenced by their own name (character string).
_count_level_groupby¶
-
ibmdbpy.statistics.
_count_level_groupby
(idadf, columnlist=None)[source]¶ Count distinct levels across a list of columns in the IdaDataFrame grouped by themselves. This is used to get the dimension of the resulting cross table.
Parameters: columnlist : list
List of column names existing in the IdaDataFrame. By default, these are columns of self
Returns: Tuple
Notes
- The function assumes the follwing:
- The columns given as parameter exists in the IdaDataframe.
- The parameter columnlist is a optional and is a list.
- Columns are referenced by their own name (character string).
_factors_count¶
-
ibmdbpy.statistics.
_factors_count
(idadf, columnlist, valuelist=None)[source]¶ Count non-missing values for all columns in a list (valuelist) over the IdaDataFrame grouped by a list of columns(columnlist).
Parameters: columnlist : list
List of column names that exist in self.
valuelist : list
List of column names that exist in self.
Returns: DataFrame
_factors_sum¶
-
ibmdbpy.statistics.
_factors_sum
(idadf, columnlist, valuelist)[source]¶ Compute the arithmetic sum over for all columns in a list (valuelist) over the IdaDataFrame grouped by a list of columns (columnlist).
Parameters: columnlist : list
List of column names that exist in self.
valuelist : list
List of column names that exist in self.
Returns: DataFrame
_factors_avg¶
-
ibmdbpy.statistics.
_factors_avg
(idadf, columnlist, valuelist)[source]¶ Compute the arithmetic average for all columns in a list (valuelist) over the IdaDataFrame grouped by a list of columns (columnlist).
Parameters: columnlist : list
List of column names that exist in self.
valuelist : list
List of column names that exist in self.
Returns: DataFrame