Feature Selection

Ibmdbpy includes a range of functions to conduct advanced data analysis, such as estimating the relevance of attributes, to make a prediction. This is important to improve the quality of classifiers and in-database algorithms. Hereafter we provide the documentation for each developed functions.

The implemented functions are: Pearson correlation, Sperman rank correlation, T-statistics, Chi-squared statistics, Gini index as well as several entropy-based measures such as information gain, gain ratio and symmetric uncertainty. We also provide a wrapper to discretize continuous attributes.

Pearson correlation

The Pearson correlation coefficient was introduced by Karl Pearson in 1886. It is a measure of the linear correlation between two random variables X and Y. Its values range from -1 to 1. The value 0 indicates no linear correlation between X and Y. The value -1 indicates a total negative correlation and +1 indicates a total positive correlation between X and Y. It is defined on real-value variables.

The main drawback of the Pearson correlation coefficient is that it can only detect linear correlations. The Pearson correlation coefficient is a parametric measure, since it assumes that the distribution of each attribute can be described using a gaussian distribution. In the real world, correlations are not necessarily of linear nature.

ibmdbpy.feature_selection.pearson(self, *args, **kwds)[source]

Compute the pearson correlation coefficients between a set of features and a set of target in an IdaDataFrame. Provide more granualirity than IdaDataFrame.corr

Parameters:

idadf : IdaDataFrame

target : str or list of str, optional

A column or list of columns against to be used as target. Per default, consider all columns

features : str or list of str, optional

A column or list of columns to be used as features. Per default, consider all columns.

ignore_indexer : bool, default: True

Per default, ignore the column declared as indexer in idadf

Returns:

Pandas.DataFrame or Pandas.Series if only one target

Notes

Input columns as target and features should be numerical.

Examples

>>> idadf = IdaDataFrame(idadb, "IRIS")
>>> pearson(idadf)

Spearman rank correlation

Spearman rank correlation, also called grade correlation, is a non-parametric measure of statistical dependence. It assesses how well the relationship between two random variables X and Y can be described using a monotonous function. It corresponds to the Pearson correlation applied to the rank of two real random variables X and Y.

The Spearman rank correlation is interesting because it is not limited to measuring linear relationships and can be applied to discrete ordinal values. Note however that it still cannot be applied to categorical attributes, such as non-numerical value.

ibmdbpy.feature_selection.spearman(self, *args, **kwds)[source]

Compute the spearman rho correlation coefficients between a set of features and a set of target in an IdaDataFrame.

Parameters:

idadf : IdaDataFrame

target : str or list of str, optional

A column or list of columns against to be used as target. Per default, consider all columns

features : str or list of str, optional

A column or list of columns to be used as features. Per default, consider all columns.

ignore_indexer : bool, default: True

Per default, ignore the column declared as indexer in idadf

Returns:

Pandas.DataFrame or Pandas.Series if only one target

Notes

Input columns as target and features should be numerical. This function is a wrapper for pearson. The scalability of this approach is not very good. Should not be used on high dimensional data.

Examples

>>> idadf = IdaDataFrame(idadb, "IRIS")
>>> spearman(idadf)

T-statistics

The T-test is a statistical test often used to determine whether the means of two samples are significantly different from each other, taking into account the difference between the means and the variability of the samples. The t-test has been extensively used for feature ranking, especially in the field of microarray analysis.

T-statistics, as it is implemented here, requires the target attribute to be categorical (nominal or numerical discrete) and the other attributes to be numerical, such that the mean can be computed.

ibmdbpy.feature_selection.ttest(idadf, target=None, features=None, ignore_indexer=True)[source]

Compute the t-statistics values of a set of features against a set of target attributes.

Parameters:

idadf : IdaDataFrame

target : str or list of str, optional

A column or list of columns against which the t-statistcs values will be computed. Per default, consider all columns

features : str or list of str, optional

A column or list of columns for which the t-statistics values will be computed against each target attributes. Per default, consider all columns, except non numerical columns.

ignore_indexer : bool, default: True

Per default, ignore the column declared as indexer in idadf

Returns:

Pandas.DataFrame or Pandas.Series if only one target

Raises:

TypeError

If the features argument or the data set does not contains any numerical features. Raise TypeError.

Notes

This implements the “modified” ttest as defined in the paper A Modified T-test feature Selection Method and Its Application on the HapMap Genotype Data (Zhou et al.)

The target columns should be categorical, while the feature columns should be numerical.

The scalability of this approach is not very good. Should not be used on high dimensional data.

Examples

>>> idadf = IdaDataFrame(idadb, "IRIS")
>>> ttest(idadf,"CLASS")

Chi-Squared statistics

The Chi-Squared method evaluates each feature individually by measuring its Chi-squared statistic with respect to the classes of another feature. We can calculate the Chi-Squared values for each pair of attributes, a larger Chi-squared value typically means a larger inter-dependence between the two features. Since it applies to categorical attributes, numerical attributes require first to be discretized into several intervals.

ibmdbpy.feature_selection.chisquared(self, *args, **kwds)[source]

Compute the Chi-Squared statistics coefficients between a set of features and a set of target in an IdaDataFrame.

Parameters:

idadf : IdaDataFrame

target : str or list of str, optional

A column or list of columns against to be used as target. Per default, consider all columns

features : str or list of str, optional

A column or list of columns to be used as features. Per default, consider all columns.

ignore_indexer : bool, default: True

Per default, ignore the column declared as indexer in idadf

Returns:

Pandas.DataFrame or Pandas.Series if only one target

Notes

Input columns as target and features should be categorical, otherwise this measure does not make much sense.

Chi-squared as defined in A Comparative Study on Feature Selection and Classification Methods Using Gene Expression Profiles and Proteomic Patterns. (GIW02F006)

The scalability of this approach is not very good. Should not be used on high dimensional data.

Examples

>>> idadf = IdaDataFrame(idadb, "IRIS")
>>> chisquared(idadf)

Gini index

The Gini index, also known as the Gini coefficient or Gini ratio, is a measure commonly used in decision trees to decide what is the best attribute to split the current node for an efficient decision tree construction. It was developed by Corrado Gini in 1912. The Gini index is a measure of statistical dispersion and can be interpreted as a measure of impurity for an attribute.

The Gini index values range from 0 to 1. The greater the value, the more the classes of an attributes are evenly distributed. An attribute having a smaller Gini index value typically means that it is easier to predict apriori, because one or several classes are more frequent than others.

ibmdbpy.feature_selection.gini(self, *args, **kwds)[source]

Compute the gini coefficients for a set of features in an IdaDataFrame.

Parameters:

idadf : IdaDataFrame

features : str or list of str, optional

A column or list of columns to be used as features. Per default, consider all columns.

ignore_indexer : bool, default: True

Per default, ignore the column declared as indexer in idadf

Returns:

Pandas.Series

Notes

Input column should be categorical, otherwise this measure does not make much sense.

Examples

>>> idadf = IdaDataFrame(idadb, "IRIS")
>>> gini(idadf)

More interestingly, we can measure how well knowing the value of a particular attribute X can improve the average Gini index value of each of the samples of an attribute Y, partitioned with respect to the classes of X. This is dened as the conditional Gini index measure.

ibmdbpy.feature_selection.gini_pairwise(self, *args, **kwds)[source]

Compute the conditional gini coefficients between a set of features and a set of target in an IdaDataFrame.

Parameters:

idadf : IdaDataFrame

target : str or list of str, optional

A column or list of columns against to be used as target. Per default, consider all columns

features : str or list of str, optional

A column or list of columns to be used as features. Per default, consider all columns.

ignore_indexer : bool, default: True

Per default, ignore the column declared as indexer in idadf

Returns:

Pandas.DataFrame or Pandas.Series if only one target

Notes

Input columns as target and features should be categorical, otherwise this measure does not make much sense.

Examples

>>> idadf = IdaDataFrame(idadb, "IRIS")
>>> gini_pairwise(idadf)

Entropy

Entropy is an important concept of information theory. It was introduced by Claude Schannon in 1948 and corresponds to the expected quantity of the information contained in a flow of information. Entropy can be understood as a measure of uncertainty of a random variable.

Intuitively, an attribute with a higher entropy will be more difficult to predict apriori than other attributes with a lower entropy. Various correlation measures are based on the information-theoretical concept of entropy, such as information gain, gain ratio and symmetric uncertainty. We will discuss those measures in the next sections.

ibmdbpy.feature_selection.entropy(self, *args, **kwds)[source]

Compute the entropy for a set of features in an IdaDataFrame.

Parameters:

idadf: IdaDataFrame

target: str or list of str, optional

A column or list of columns to be used as features. Per default, consider all columns.

mode: “normal” or “raw”

Experimental

execute: bool, default:True

Experimental. Execute the request or return the correponding SQL query

ignore_indexer: bool, default: True

Per default, ignore the column declared as indexer in idadf

Returns:

Pandas.Series

Notes

Input column should be categorical, otherwise this measure does not make much sense.

Examples

>>> idadf = IdaDataFrame(idadb, "IRIS")
>>> entropy(idadf)
ibmdbpy.feature_selection.entropy_stats(idadf, target=None, mode=u'normal', execute=True, ignore_indexer=True)[source]

Similar to ibmdbby.feature_selection.entropy.entrop but use DB2 statistics to speed the computation. Returns an approximate value. Experimental.

Parameters:

idadf: IdaDataFrame

target: str or list of str, optional

A column or list of columns to be used as features. Per default, consider all columns.

mode: “normal” or “raw”

Experimental

execute: bool, default:True

Experimental. Execute the request or return the correponding SQL query

ignore_indexer: bool, default: True

Per default, ignore the column declared as indexer in idadf

Returns:

Pandas.Series

Notes

Input column should be categorical, otherwise this measure does not make much sense.

Cannot handle columns that are not physically existing in the database, since no statistics are available for them.

Examples

>>> idadf = IdaDataFrame(idadb, "IRIS")
>>> entropy_stats(idadf)

Information gain

In information theory, information gain is often used as a synonym for mutual information. It is a measure of mutual dependence between two variables X and Y and gives an interpretation of the amount of information that is shared by the two variables.

ibmdbpy.feature_selection.info_gain(self, *args, **kwds)[source]

Compute the information gain / mutual information coefficients between a set of features and a set of target in an IdaDataFrame.

Parameters:

idadf : IdaDataFrame

target : str or list of str, optional

A column or list of columns against to be used as target. Per default, consider all columns

features : str or list of str, optional

A column or list of columns to be used as features. Per default, consider all columns.

ignore_indexer : bool, default: True

Per default, ignore the column declared as indexer in idadf

Returns:

Pandas.DataFrame or Pandas.Series if only one target

Notes

Input columns as target and features should be categorical, otherwise this measure does not make much sense.

Examples

>>> idadf = IdaDataFrame(idadb, "IRIS")
>>> info_gain(idadf)

Gain ratio

The information gain ratio is a variant of the mutual information. It can be seen as a normalization of the mutual information values from 0 to 1. It is the ratio of information to the entropy of the target attribute. By doing so, it also reduces the bias toward attributes with many values. There exists several versions of the definition of this measure. Especially, there exists an asymmetric and a symmetric version.

ibmdbpy.feature_selection.gain_ratio(self, *args, **kwds)[source]

Compute the gain ratio coefficients between a set of features and a set of target in an IdaDataFrame.

Parameters:

idadf : IdaDataFrame

target : str or list of str, optional

A column or list of columns against to be used as target. Per default, consider all columns

features : str or list of str, optional

A column or list of columns to be used as features. Per default, consider all columns.

symmetry : bool, default: True

If True, compute the symmetric gain ratio as defined by [Lopez de Mantaras 1991]. Otherwise, the asymmetric gain ratio.

ignore_indexer : bool, default: True

Per default, ignore the column declared as indexer in idadf

Returns:

Pandas.DataFrame or Pandas.Series if only one target

Notes

Input columns as target and features should be categorical, otherwise this measure does not make much sense.

Examples

>>> idadf = IdaDataFrame(idadb, "IRIS")
>>> gain_ratio(idadf)

Symmetric uncertainty

Symmetric uncertainty is a pair-wise independence measure originally defined by Witten and Franck. Symmetric uncertainty compensates the bias of information gain and offers a variant to the symmetric gain ratio normalized within the range [0, 1] with the value 1 indicating that knowledge of either one of the values completely predicts the value of the other and the value 0 indicating that X and Y are independent. It is also a symmetric measure.

ibmdbpy.feature_selection.su(self, *args, **kwds)[source]

Compute the symmetric uncertainty coefficients between a set of features and a set of target in an IdaDataFrame.

Parameters:

idadf : IdaDataFrame

target : str or list of str, optional

A column or list of columns against to be used as target. Per default, consider all columns

features : str or list of str, optional

A column or list of columns to be used as features. Per default, consider all columns.

ignore_indexer : bool, default: True

Per default, ignore the column declared as indexer in idadf

Returns:

Pandas.DataFrame or Pandas.Series if only one target

Notes

Input columns as target and features should be categorical, otherwise this measure does not make much sense.

Examples

>>> idadf = IdaDataFrame(idadb, "IRIS")
>>> su(idadf)

Discretization

Since most correlation measures require the attributes to be discretized first, we provide a wrapper for an in-database discretization method.

ibmdbpy.feature_selection.discretize(*args, **kwds)[source]

Discretize a set of numerical columns from an IdaDataFrame and returns an IdaDataFrame open on the discretized version of the dataset.

Parameters:

idadf : IdaDataFrame

columns : str or list of str, optional

A column or list of columns to be discretized

disc : “ef”, “em”, “ew”, “ewn” default: “em”

Discretization method to be used

  • ef: Discretization bins of equal frequency
  • em: Discretization bins of minimal entropy
  • ew: Discretization bins of equal width
  • ewn: Discretization bins of equal width with human-friendly limits

target : str

Target column again which the discretization will be done. Relevant only for “em” discretization.

bins: int, optional

Number of bins. Not relevant for “em” discretization.

outtable: str, optional

The name of the output table where the assigned clusters are stored. If this parameter is not specified, it is generated automatically. If the parameter corresponds to an existing table in the database, it is replaced.

clear_existing: bool, default: False

If set to True, a table will be replaced when a table with the same name already exists in the database.