Feature Selection¶
Ibmdbpy includes a range of functions to conduct advanced data analysis, such as estimating the relevance of attributes, to make a prediction. This is important to improve the quality of classifiers and indatabase algorithms. Hereafter we provide the documentation for each developed functions.
The implemented functions are: Pearson correlation, Sperman rank correlation, Tstatistics, Chisquared statistics, Gini index as well as several entropybased measures such as information gain, gain ratio and symmetric uncertainty. We also provide a wrapper to discretize continuous attributes.
Pearson correlation¶
The Pearson correlation coefficient was introduced by Karl Pearson in 1886. It is a measure of the linear correlation between two random variables X and Y. Its values range from 1 to 1. The value 0 indicates no linear correlation between X and Y. The value 1 indicates a total negative correlation and +1 indicates a total positive correlation between X and Y. It is defined on realvalue variables.
The main drawback of the Pearson correlation coefficient is that it can only detect linear correlations. The Pearson correlation coefficient is a parametric measure, since it assumes that the distribution of each attribute can be described using a gaussian distribution. In the real world, correlations are not necessarily of linear nature.

ibmdbpy.feature_selection.
pearson
(self, *args, **kwds)[source]¶ Compute the pearson correlation coefficients between a set of features and a set of target in an IdaDataFrame. Provide more granualirity than IdaDataFrame.corr
Parameters: idadf : IdaDataFrame
target : str or list of str, optional
A column or list of columns against to be used as target. Per default, consider all columns
features : str or list of str, optional
A column or list of columns to be used as features. Per default, consider all columns.
ignore_indexer : bool, default: True
Per default, ignore the column declared as indexer in idadf
Returns: Pandas.DataFrame or Pandas.Series if only one target
Notes
Input columns as target and features should be numerical.
Examples
>>> idadf = IdaDataFrame(idadb, "IRIS") >>> pearson(idadf)
Spearman rank correlation¶
Spearman rank correlation, also called grade correlation, is a nonparametric measure of statistical dependence. It assesses how well the relationship between two random variables X and Y can be described using a monotonous function. It corresponds to the Pearson correlation applied to the rank of two real random variables X and Y.
The Spearman rank correlation is interesting because it is not limited to measuring linear relationships and can be applied to discrete ordinal values. Note however that it still cannot be applied to categorical attributes, such as nonnumerical value.

ibmdbpy.feature_selection.
spearman
(self, *args, **kwds)[source]¶ Compute the spearman rho correlation coefficients between a set of features and a set of target in an IdaDataFrame.
Parameters: idadf : IdaDataFrame
target : str or list of str, optional
A column or list of columns against to be used as target. Per default, consider all columns
features : str or list of str, optional
A column or list of columns to be used as features. Per default, consider all columns.
ignore_indexer : bool, default: True
Per default, ignore the column declared as indexer in idadf
Returns: Pandas.DataFrame or Pandas.Series if only one target
Notes
Input columns as target and features should be numerical. This function is a wrapper for pearson. The scalability of this approach is not very good. Should not be used on high dimensional data.
Examples
>>> idadf = IdaDataFrame(idadb, "IRIS") >>> spearman(idadf)
Tstatistics¶
The Ttest is a statistical test often used to determine whether the means of two samples are significantly different from each other, taking into account the difference between the means and the variability of the samples. The ttest has been extensively used for feature ranking, especially in the field of microarray analysis.
Tstatistics, as it is implemented here, requires the target attribute to be categorical (nominal or numerical discrete) and the other attributes to be numerical, such that the mean can be computed.

ibmdbpy.feature_selection.
ttest
(idadf, target=None, features=None, ignore_indexer=True)[source]¶ Compute the tstatistics values of a set of features against a set of target attributes.
Parameters: idadf : IdaDataFrame
target : str or list of str, optional
A column or list of columns against which the tstatistcs values will be computed. Per default, consider all columns
features : str or list of str, optional
A column or list of columns for which the tstatistics values will be computed against each target attributes. Per default, consider all columns, except non numerical columns.
ignore_indexer : bool, default: True
Per default, ignore the column declared as indexer in idadf
Returns: Pandas.DataFrame or Pandas.Series if only one target
Raises: TypeError
If the features argument or the data set does not contains any numerical features. Raise TypeError.
Notes
This implements the “modified” ttest as defined in the paper A Modified Ttest feature Selection Method and Its Application on the HapMap Genotype Data (Zhou et al.)
The target columns should be categorical, while the feature columns should be numerical.
The scalability of this approach is not very good. Should not be used on high dimensional data.
Examples
>>> idadf = IdaDataFrame(idadb, "IRIS") >>> ttest(idadf,"CLASS")
ChiSquared statistics¶
The ChiSquared method evaluates each feature individually by measuring its Chisquared statistic with respect to the classes of another feature. We can calculate the ChiSquared values for each pair of attributes, a larger Chisquared value typically means a larger interdependence between the two features. Since it applies to categorical attributes, numerical attributes require first to be discretized into several intervals.

ibmdbpy.feature_selection.
chisquared
(self, *args, **kwds)[source]¶ Compute the ChiSquared statistics coefficients between a set of features and a set of target in an IdaDataFrame.
Parameters: idadf : IdaDataFrame
target : str or list of str, optional
A column or list of columns against to be used as target. Per default, consider all columns
features : str or list of str, optional
A column or list of columns to be used as features. Per default, consider all columns.
ignore_indexer : bool, default: True
Per default, ignore the column declared as indexer in idadf
Returns: Pandas.DataFrame or Pandas.Series if only one target
Notes
Input columns as target and features should be categorical, otherwise this measure does not make much sense.
Chisquared as defined in A Comparative Study on Feature Selection and Classification Methods Using Gene Expression Profiles and Proteomic Patterns. (GIW02F006)
The scalability of this approach is not very good. Should not be used on high dimensional data.
Examples
>>> idadf = IdaDataFrame(idadb, "IRIS") >>> chisquared(idadf)
Gini index¶
The Gini index, also known as the Gini coefficient or Gini ratio, is a measure commonly used in decision trees to decide what is the best attribute to split the current node for an efficient decision tree construction. It was developed by Corrado Gini in 1912. The Gini index is a measure of statistical dispersion and can be interpreted as a measure of impurity for an attribute.
The Gini index values range from 0 to 1. The greater the value, the more the classes of an attributes are evenly distributed. An attribute having a smaller Gini index value typically means that it is easier to predict apriori, because one or several classes are more frequent than others.

ibmdbpy.feature_selection.
gini
(self, *args, **kwds)[source]¶ Compute the gini coefficients for a set of features in an IdaDataFrame.
Parameters: idadf : IdaDataFrame
features : str or list of str, optional
A column or list of columns to be used as features. Per default, consider all columns.
ignore_indexer : bool, default: True
Per default, ignore the column declared as indexer in idadf
Returns: Pandas.Series
Notes
Input column should be categorical, otherwise this measure does not make much sense.
Examples
>>> idadf = IdaDataFrame(idadb, "IRIS") >>> gini(idadf)
More interestingly, we can measure how well knowing the value of a particular attribute X can improve the average Gini index value of each of the samples of an attribute Y, partitioned with respect to the classes of X. This is dened as the conditional Gini index measure.

ibmdbpy.feature_selection.
gini_pairwise
(self, *args, **kwds)[source]¶ Compute the conditional gini coefficients between a set of features and a set of target in an IdaDataFrame.
Parameters: idadf : IdaDataFrame
target : str or list of str, optional
A column or list of columns against to be used as target. Per default, consider all columns
features : str or list of str, optional
A column or list of columns to be used as features. Per default, consider all columns.
ignore_indexer : bool, default: True
Per default, ignore the column declared as indexer in idadf
Returns: Pandas.DataFrame or Pandas.Series if only one target
Notes
Input columns as target and features should be categorical, otherwise this measure does not make much sense.
Examples
>>> idadf = IdaDataFrame(idadb, "IRIS") >>> gini_pairwise(idadf)
Entropy¶
Entropy is an important concept of information theory. It was introduced by Claude Schannon in 1948 and corresponds to the expected quantity of the information contained in a flow of information. Entropy can be understood as a measure of uncertainty of a random variable.
Intuitively, an attribute with a higher entropy will be more difficult to predict apriori than other attributes with a lower entropy. Various correlation measures are based on the informationtheoretical concept of entropy, such as information gain, gain ratio and symmetric uncertainty. We will discuss those measures in the next sections.

ibmdbpy.feature_selection.
entropy
(self, *args, **kwds)[source]¶ Compute the entropy for a set of features in an IdaDataFrame.
Parameters: idadf: IdaDataFrame
target: str or list of str, optional
A column or list of columns to be used as features. Per default, consider all columns.
mode: “normal” or “raw”
Experimental
execute: bool, default:True
Experimental. Execute the request or return the correponding SQL query
ignore_indexer: bool, default: True
Per default, ignore the column declared as indexer in idadf
Returns: Pandas.Series
Notes
Input column should be categorical, otherwise this measure does not make much sense.
Examples
>>> idadf = IdaDataFrame(idadb, "IRIS") >>> entropy(idadf)

ibmdbpy.feature_selection.
entropy_stats
(idadf, target=None, mode=u'normal', execute=True, ignore_indexer=True)[source]¶ Similar to ibmdbby.feature_selection.entropy.entrop but use DB2 statistics to speed the computation. Returns an approximate value. Experimental.
Parameters: idadf: IdaDataFrame
target: str or list of str, optional
A column or list of columns to be used as features. Per default, consider all columns.
mode: “normal” or “raw”
Experimental
execute: bool, default:True
Experimental. Execute the request or return the correponding SQL query
ignore_indexer: bool, default: True
Per default, ignore the column declared as indexer in idadf
Returns: Pandas.Series
Notes
Input column should be categorical, otherwise this measure does not make much sense.
Cannot handle columns that are not physically existing in the database, since no statistics are available for them.
Examples
>>> idadf = IdaDataFrame(idadb, "IRIS") >>> entropy_stats(idadf)
Information gain¶
In information theory, information gain is often used as a synonym for mutual information. It is a measure of mutual dependence between two variables X and Y and gives an interpretation of the amount of information that is shared by the two variables.

ibmdbpy.feature_selection.
info_gain
(self, *args, **kwds)[source]¶ Compute the information gain / mutual information coefficients between a set of features and a set of target in an IdaDataFrame.
Parameters: idadf : IdaDataFrame
target : str or list of str, optional
A column or list of columns against to be used as target. Per default, consider all columns
features : str or list of str, optional
A column or list of columns to be used as features. Per default, consider all columns.
ignore_indexer : bool, default: True
Per default, ignore the column declared as indexer in idadf
Returns: Pandas.DataFrame or Pandas.Series if only one target
Notes
Input columns as target and features should be categorical, otherwise this measure does not make much sense.
Examples
>>> idadf = IdaDataFrame(idadb, "IRIS") >>> info_gain(idadf)
Gain ratio¶
The information gain ratio is a variant of the mutual information. It can be seen as a normalization of the mutual information values from 0 to 1. It is the ratio of information to the entropy of the target attribute. By doing so, it also reduces the bias toward attributes with many values. There exists several versions of the definition of this measure. Especially, there exists an asymmetric and a symmetric version.

ibmdbpy.feature_selection.
gain_ratio
(self, *args, **kwds)[source]¶ Compute the gain ratio coefficients between a set of features and a set of target in an IdaDataFrame.
Parameters: idadf : IdaDataFrame
target : str or list of str, optional
A column or list of columns against to be used as target. Per default, consider all columns
features : str or list of str, optional
A column or list of columns to be used as features. Per default, consider all columns.
symmetry : bool, default: True
If True, compute the symmetric gain ratio as defined by [Lopez de Mantaras 1991]. Otherwise, the asymmetric gain ratio.
ignore_indexer : bool, default: True
Per default, ignore the column declared as indexer in idadf
Returns: Pandas.DataFrame or Pandas.Series if only one target
Notes
Input columns as target and features should be categorical, otherwise this measure does not make much sense.
Examples
>>> idadf = IdaDataFrame(idadb, "IRIS") >>> gain_ratio(idadf)
Symmetric uncertainty¶
Symmetric uncertainty is a pairwise independence measure originally defined by Witten and Franck. Symmetric uncertainty compensates the bias of information gain and offers a variant to the symmetric gain ratio normalized within the range [0, 1] with the value 1 indicating that knowledge of either one of the values completely predicts the value of the other and the value 0 indicating that X and Y are independent. It is also a symmetric measure.

ibmdbpy.feature_selection.
su
(self, *args, **kwds)[source]¶ Compute the symmetric uncertainty coefficients between a set of features and a set of target in an IdaDataFrame.
Parameters: idadf : IdaDataFrame
target : str or list of str, optional
A column or list of columns against to be used as target. Per default, consider all columns
features : str or list of str, optional
A column or list of columns to be used as features. Per default, consider all columns.
ignore_indexer : bool, default: True
Per default, ignore the column declared as indexer in idadf
Returns: Pandas.DataFrame or Pandas.Series if only one target
Notes
Input columns as target and features should be categorical, otherwise this measure does not make much sense.
Examples
>>> idadf = IdaDataFrame(idadb, "IRIS") >>> su(idadf)
Discretization¶
Since most correlation measures require the attributes to be discretized first, we provide a wrapper for an indatabase discretization method.

ibmdbpy.feature_selection.
discretize
(*args, **kwds)[source]¶ Discretize a set of numerical columns from an IdaDataFrame and returns an IdaDataFrame open on the discretized version of the dataset.
Parameters: idadf : IdaDataFrame
columns : str or list of str, optional
A column or list of columns to be discretized
disc : “ef”, “em”, “ew”, “ewn” default: “em”
Discretization method to be used
 ef: Discretization bins of equal frequency
 em: Discretization bins of minimal entropy
 ew: Discretization bins of equal width
 ewn: Discretization bins of equal width with humanfriendly limits
target : str
Target column again which the discretization will be done. Relevant only for “em” discretization.
bins: int, optional
Number of bins. Not relevant for “em” discretization.
outtable: str, optional
The name of the output table where the assigned clusters are stored. If this parameter is not specified, it is generated automatically. If the parameter corresponds to an existing table in the database, it is replaced.
clear_existing: bool, default: False
If set to True, a table will be replaced when a table with the same name already exists in the database.