Ensemble Package

About the Author

You can contact Prajwal Kailas at {prajwal967@gmail.com}.
Student at National Institute Of Technology, Karnataka
Bachelors in Computer Science and Engineering.

Once more into the fray
Into the last good fight I’ll ever know
Live and die on this day
Live and die on this day.

— Grey

Package for Ensemble of Machine Learning Models

The motivation was to incorporate the various techniques described into a single python package. The current version of the package has the power to perform weighted averaging, and build stacking and blending models for binary classification. The other ensembling techniques will be incorporated in the future. The package enables data encoding, the desired data on which the user wishes to classify the data, can be encoded in many different ways like label encoding (categorical encoding), one hot encoding, sum encoding, polynomial encoding, backward difference encoding, helmert encoding, and hashing encoding. The encoded data will then be used to train the base models. The package provides the user with the ability to build a number of base models such as Gradient Boosting (XGBoost), Multi Layer Perceptron, Random Forest Classifier, Decision Tree, Linear Regression, Logistic Regression. The user can build any or all the base models with default parameter values, change the parameter values or provide a list of parameter values to perform hyper parameter optimistaion (Using Hyperopt and Grid Search) and identify the optimum parameter values. Once the desired parameter values have been obtained the respective base models will be trained. Upon training the models, the trained models are used to obtain predictions on the cross validation data, these predictions obtained from the base models will be used to construct a data frame, which will be used to train the stacking and blending models, and perform weighted averaging.

Once the base models have been trained the user can select which ensembling technique to use. The dataframe of predictions will be used to perform weighted averaging, the same dataframe will be used to train the stacking model, for training the blending model the dataframe of predictions will be appended with the cross validation data. To train the ensemble models the algorithm or classifier which will be used can be any of the algorithms/classifiers provided for the training of the base models. For testing the stacking and blending models we can hold out a test set which can be used to examine how well these ensemble models perform, wether they overfit, underfit, provide better performance than the base models.

Installation

pip install ensembles
import ensembles

Documentation

Number Of Layers (MLP)

def set_no_of_layers **(number)**

Setting the global variable no_of_layers. This variable will contain the number of layers that will be used to construct the multi layer perceptron model.

Parameters:

number[Integer]: The number of layers to be used in the multi layer perceptron model. Examples

Python Source Code

def set_no_of_layers(number):

    global no_of_layers

    no_of_layers = number

Data Import

def data_import **(data, label_output, encode = None, split = True, stratify = True, split_size = 0.3)**

Function is used for providing the input data. Examples

Parameters:

data[Pandas DataFrame]: The dataset as a csv file needs to be passed as the parameter. (Binary classification dataset only).
label_output[string]: The column name which contains the output needs to be passed as the parameter.
encode[string, optional (default = “label”) ("binary", "hashing", "backward_difference", "helmert", "sum", "polynomial")]: The encoding that needs to be performed on the data. For more information on encoding techniques (categorical encoding) click here.
split[bool, optional (default = True)]: Performing a split to obtain test data (True), Test Data not required (False)
stratify[bool, optional (default = True)]: Stratified split (True) or random split (False) of the data.
split_size[float, optional (default = 0.1)]: The split ratio for training and testing data.

Returns:

Test dataset, if split = True.

Python Source Code

def data_import(data, label_output, encode = None, split = True, stratify = True, split_size = 0.3):

    global Data
    Data = data

    #Data = Data.dropna()
    #Data = data.fillna(data.mean())
    #Data = data.interpolate()
    #Data = data.fillna(data.median())
    #(interpolate)methods = {‘linear’, ‘time’, ‘index’, ‘values’, ‘nearest’, ‘zero’,
    #‘slinear’, ‘quadratic’, ‘cubic’, ‘barycentric’, ‘krogh’, ‘polynomial’, ‘spline’, ‘piecewise_polynomial’,\
    #‘from_derivatives’, ‘pchip’, ‘akima’}

    #Reading the data, into a Data Frame.
    global target_label
    target_label = label_output

    #Selcting the columns of string data type
    names = data.select_dtypes(include = ['object'])

    #Converting string categorical variables to integer categorical variables.
    label_encode(names.columns.tolist())

    if(target_label in names):

        columns = names.drop([target_label],axis=1).columns.tolist()

    else:

        columns = names

    #Data will be encoded to the form that the user enters
    encoding = {'binary':binary_encode,'hashing':hashing_encode,'backward_difference'
               :backward_difference_encode,'helmert':helmert_encode,'polynomial':
               polynomial_encode,'sum':sum_encode,'label':label_encode}

    if(encode != None):

        #Once the above encoding techniques has been selected by the user,
        #the appropriate encoding function is called
        encoding[encode](columns)


    #This function intializes the dataframes that will be used later in the program
    #data_initialize()

    #Splitting the data into to train and test sets, according to user preference
    if(split == True):

        test_data = data_split(stratify,split_size)
        return test_data

Evaluation Metrics

def metric_set **(metric)**

Setting the metric that will be used for validating the models/measuring model performance. Examples
metric_score: Global variable (Scikit Learn Function) metric_grid_search: Global variable (Parameter for Grid Search)

Parameters:

metric[ String, optional ("roc_auc_score", "average_precision_score", "f1_score", "accuracy_score", "mean_absolute_error", "mean_squared_error", "r2_score"): One of the above metrics need to be selected. This metric will be used to measure the performance of the models (Base and Ensemble models).

Python Source Code

def metric_set(metric):

    global metric_score
    global metric_grid_search
    metric_functions = {'roc_auc_score' : [roc_auc_score, 'roc_auc'], 'average_precision_score' :
                        [average_precision_score, 'average_precision'], 'f1_score' : [f1_score, 'f1'],
                        'log_loss' : [log_loss, 'log_loss'], 'accuracy_score' : [accuracy_score, 'accuracy'],
                        'mean_absolute_error' : [mean_absolute_error,'mean_absolute_error'],
                        'mean_squared_error':[mean_squared_error, 'mean_squared_error'],
                        'r2_score' : [r2_score, 'r2']
                        }

    metric_score = metric_functions[metric][0]
    metric_grid_search = metric_functions[metric][1]

Gradient Boosting

def parameter_set_gradient_boosting (hyper_parameter_optimisation = False, eval_metric = None, booster = ['gbtree'], silent = [0], eta = [0.3], gamma = [0], max_depth = [6], min_child_weight = [1], max_delta_step = [0], subsample = [1], colsample_bytree = [1], colsample_bylevel = [1], lambda_xgb = [1], alpha = [0], tree_method = ['auto'], sketch_eps = [0.03], scale_pos_weight = [0], lambda_bias = [0], objective = ['reg:linear'], base_score = [0.5], num_class = None)

Setting the parameters for training the gradient boosting model. Every parameter value has to be of list type. The list can contain one single value (hyper_parameter_optimisation = False) or multiple values (hyper_parameter_optimisation = True). Examples

Parameters:

hyper_parameter_optimisation[bool, optional (default = False)]: if False, hyper parameter optimisation will not be performed. if True, hyper parameter optimisation will be performed across the search space, that is the multiple values entered for the parameters of the gradient boosting model. Using hyperopt
Documentation of the Gradient Boosting model (XGBoost) can be obtained[here]:

Returns:

Dictionary containg the respective parameter names and values.

Python Source Code

#Defining the parameters for the XGBoost (Gradient Boosting) Algorithm.
def parameter_set_gradient_boosting(hyper_parameter_optimisation = False, eval_metric = None, booster = ['gbtree'],\
                                    silent = [0], eta = [0.3], gamma = [0], max_depth = [6],\
                                    min_child_weight = [1], max_delta_step = [0], subsample = [1],\
                                    colsample_bytree = [1], colsample_bylevel = [1], lambda_xgb = [1], alpha = [0],\
                                    tree_method = ['auto'], sketch_eps = [0.03], scale_pos_weight = [0],\
                                    lambda_bias = [0], objective = ['reg:linear'], base_score = [0.5],\
                                    num_class = None):

    parameter_gradient_boosting = {}
    #This variable will be used to check if the user wants to perform hyper parameter optimisation.
    parameter_gradient_boosting['hyper_parameter_optimisation'] = hyper_parameter_optimisation

    #Setting objective and seed
    parameter_gradient_boosting['objective'] = objective[0]
    parameter_gradient_boosting['seed'] = 0

    if(num_class != None):

        parameter_gradient_boosting['num_class'] = num_class

    #If hyper parameter optimisation is false, we unlist the default values and/or the values that the user enters
    #in the form of a list. Values have to be entered by the user in the form of a list, for hyper parameter
    #optimisation = False, these values will be unlisted below
    #Ex : booster = ['gbtree'](default value) becomes booster = 'gbtree'
    #This is done beacuse for training the model, the model does not accept list type values
    if(parameter_gradient_boosting['hyper_parameter_optimisation'] == False):

        #Setting the parameters for the Booster, list values are unlisted (E.x - booster[0])
        parameter_gradient_boosting['booster'] = booster[0]
        parameter_gradient_boosting['eval_metric'] = eval_metric[0]
        parameter_gradient_boosting['eta'] = eta[0]
        parameter_gradient_boosting['gamma'] = gamma[0]
        parameter_gradient_boosting['max_depth'] = max_depth[0]
        parameter_gradient_boosting['min_child_weight'] = min_child_weight[0]
        parameter_gradient_boosting['max_delta_step'] = max_delta_step[0]
        parameter_gradient_boosting['subsample'] = subsample[0]
        parameter_gradient_boosting['colsample_bytree'] = colsample_bytree[0]
        parameter_gradient_boosting['colsample_bylevel'] = colsample_bylevel[0]
        parameter_gradient_boosting['base_score'] = base_score[0]
        parameter_gradient_boosting['lambda_bias'] = lambda_bias[0]
        parameter_gradient_boosting['alpha'] = alpha[0]
        parameter_gradient_boosting['tree_method'] = tree_method[0]
        parameter_gradient_boosting['sketch_eps'] = sketch_eps[0]
        parameter_gradient_boosting['scale_pos_weigth'] = scale_pos_weight[0]
        parameter_gradient_boosting['lambda'] = lambda_xgb[0]

    else:

        #Setting parameters for the Booster which will be optimized later using hyperopt.
        #The user can enter a list of values that he wants to optimize
        parameter_gradient_boosting['booster'] = booster
        parameter_gradient_boosting['eval_metric'] = eval_metric
        parameter_gradient_boosting['eta'] = eta
        parameter_gradient_boosting['gamma'] = gamma
        parameter_gradient_boosting['max_depth'] = max_depth
        parameter_gradient_boosting['min_child_weight'] = min_child_weight
        parameter_gradient_boosting['max_delta_step'] = max_delta_step
        parameter_gradient_boosting['subsample'] = subsample
        parameter_gradient_boosting['colsample_bytree'] = colsample_bytree
        parameter_gradient_boosting['colsample_bylevel'] = colsample_bylevel
        parameter_gradient_boosting['base_score'] = base_score
        parameter_gradient_boosting['lambda_bias'] = lambda_bias
        parameter_gradient_boosting['alpha'] = alpha
        parameter_gradient_boosting['tree_method'] = tree_method
        parameter_gradient_boosting['sketch_eps'] = sketch_eps
        parameter_gradient_boosting['scale_pos_weigth'] = scale_pos_weight
        parameter_gradient_boosting['lambda'] = lambda_xgb

    return parameter_gradient_boosting

Multi Layer Perceptron

def parameter_set_multi_layer_perceptron (hyper_parameter_optimisation = 'Default', optimizer = ['rmsprop'], init_layer = [], dim_layer = [], activation = [], dropout = [], weights = [], W_regularizer = [], b_regularizer = [], activity_regularizer = [], W_constraint = [], b_constraint = [], bias = [])

Setting the parameters for training the multi layer perceptron model. Every parameter value has to be of list type. A list of lists needs to be passed as parameter values. That is the parameter values for each layer have to be enclosed in a list. Finally the lists for each layer are enclosed in a final list. Examples
Example
Layer 1: dim_layer = 32
Layer 2: dim_layer = 64
This is passed as dim_layer = [[32],[64]]
The values within each list can contain no value (deafult values will be used) one single value (hyper_parameter_optimisation = False) or multiple values (hyper_parameter_optimisation = True).

Parameters:

hyper_parameter_optimisation[bool, optional (default = False)]: if False, hyper parameter optimisation will not be performed. if True, hyper parameter optimisation will be performed across the search space, that is the multiple values entered for the parameters of the Multi Layer Perceptron model using hyperopt
Documentation of the Multi Layer Perceptron model can be obtained[here]:

Returns:

Dictionary containg the respective parameter names and values.

Python Source Code

def parameter_set_multi_layer_perceptron(hyper_parameter_optimisation = False, optimizer = ['rmsprop'], \
                                         init_layer = [], dim_layer = [], \
                                         activation = [], dropout = [], weights = [], \
                                         W_regularizer = [], b_regularizer = [], \
                                         activity_regularizer = [], W_constraint = [], \
                                         b_constraint = [], bias = []):

    global no_of_layers

    parameter_multi_layer_perceptron = {}
    #This variable will be used to check if the user wants to perform hyper parameter optimisation.
    parameter_multi_layer_perceptron['hyper_parameter_optimisation'] = hyper_parameter_optimisation

    if (parameter_multi_layer_perceptron['hyper_parameter_optimisation'] == 'Default'):

        x = 1

    else:

        x = no_of_layers


    if (init_layer == []):

        init_layer = [['glorot_uniform']] * x

    if (dim_layer == []):

        dim_layer = [[32]] * x

    if (activation == []):

        activation = [['sigmoid']] * x

    if (dropout == []):

        dropout = [[0]] * x

    if (weights == []):

        weights = [[None]] * x

    if (W_regularizer == []):

        W_regularizer = [[None]] * x

    if (b_regularizer == []):

        b_regularizer = [[None]] * x

    if (activity_regularizer == []):

        activity_regularizer = [[None]] * x

    if (W_constraint == []):

        W_constraint = [[None]] * x

    if (b_constraint == []):

        b_constraint = [[None]] * x

    if (bias == []):

        bias= [[True]] * x



    for i in range(no_of_layers):

        if(parameter_multi_layer_perceptron['hyper_parameter_optimisation'] == False):

            parameter_multi_layer_perceptron['dim_layer'+str(i)] = dim_layer[i][0]
            parameter_multi_layer_perceptron['activation_layer'+str(i)] = activation[i][0]
            parameter_multi_layer_perceptron['init_layer'+str(i)] = init_layer[i][0]
            parameter_multi_layer_perceptron['dropout'+str(i)] = dropout[i][0]
            parameter_multi_layer_perceptron['weights'+str(i)] = weights[i][0]
            parameter_multi_layer_perceptron['W_regularizer'+str(i)] = W_regularizer[i][0]
            parameter_multi_layer_perceptron['b_regularizer'+str(i)] = b_regularizer[i][0]
            parameter_multi_layer_perceptron['activity_regularizer'+str(i)] = activity_regularizer[i][0]
            parameter_multi_layer_perceptron['W_constraint'+str(i)] = W_constraint[i][0]
            parameter_multi_layer_perceptron['b_constraint'+str(i)] = b_constraint[i][0]
            parameter_multi_layer_perceptron['bias'+str(i)] = bias[i][0]
            parameter_multi_layer_perceptron['optimizer'] = optimizer[0]
            parameter_multi_layer_perceptron['dim_layer'+str(no_of_layers-1)] = 1
            parameter_multi_layer_perceptron['dropout'+str(no_of_layers-1)] = 0

        else:

            parameter_multi_layer_perceptron['dim_layer'+str(i)] = dim_layer[i]
            parameter_multi_layer_perceptron['activation_layer'+str(i)] = activation[i]
            parameter_multi_layer_perceptron['init_layer'+str(i)] = init_layer[i]
            parameter_multi_layer_perceptron['dropout'+str(i)] = dropout[i]
            parameter_multi_layer_perceptron['weights'+str(i)] = weights[i]
            parameter_multi_layer_perceptron['W_regularizer'+str(i)] = W_regularizer[i]
            parameter_multi_layer_perceptron['b_regularizer'+str(i)] = b_regularizer[i]
            parameter_multi_layer_perceptron['activity_regularizer'+str(i)] = activity_regularizer[i]
            parameter_multi_layer_perceptron['W_constraint'+str(i)] = W_constraint[i]
            parameter_multi_layer_perceptron['b_constraint'+str(i)] = b_constraint[i]
            parameter_multi_layer_perceptron['bias'+str(i)] = bias[i]
            parameter_multi_layer_perceptron['optimizer'] = optimizer
            parameter_multi_layer_perceptron['dim_layer'+str(no_of_layers-1)] = [1]
            parameter_multi_layer_perceptron['dropout'+str(no_of_layers-1)] = [0]

    return parameter_multi_layer_perceptron

Decision Tree

def parameter_set_decision_tree (criterion = ['gini'], splitter = ['best'], max_depth = [None], min_samples_split = [2], min_samples_leaf = [1], min_weight_fraction_leaf = [0.0], max_features = [None], random_state = [None], max_leaf_nodes = [None], class_weight = [None], presort = [False])

Setting the parameters for training the decision tree model. Every parameter value has to be of list type. The list can contain one single value (hyper parameter optimisation will not be performed) or multiple values (hyper parameter optimisation will be performed).
Hyper parameter optimisation will be performed using gridsearch
Examples

Parameters:

Documentation of the Decision Tree model can be obtained [here]

Returns:

Dictionary containg the respective parameter names and values.

Python Source Code

def parameter_set_decision_tree(criterion = ['gini'], splitter = ['best'], max_depth = [None],\
                                min_samples_split = [2], min_samples_leaf = [1], min_weight_fraction_leaf = [0.0],\
                                max_features = [None], random_state = [None], max_leaf_nodes = [None],\
                                class_weight = [None], presort = [False]):

    parameters_decision_tree = {}
    parameters_decision_tree['criterion'] = criterion
    parameters_decision_tree['splitter'] = splitter
    parameters_decision_tree['max_depth'] = max_depth
    parameters_decision_tree['min_samples_split'] = min_samples_split
    parameters_decision_tree['min_samples_leaf'] = min_samples_leaf
    parameters_decision_tree['min_weight_fraction_leaf'] = min_weight_fraction_leaf
    parameters_decision_tree['max_features'] = max_features
    parameters_decision_tree['random_state'] = random_state
    parameters_decision_tree['max_leaf_nodes'] = max_leaf_nodes
    parameters_decision_tree['class_weight'] = class_weight
    parameters_decision_tree['presort'] = presort

    return parameters_decision_tree

Random Forest

def parameter_set_random_forest (n_estimators = [10], criterion = ['gini'], max_depth = [None], min_samples_split = [2], min_samples_leaf = [1], min_weight_fraction_leaf = [0.0], max_features = ['auto'], max_leaf_nodes = [None], bootstrap = [True], oob_score = [False], random_state = [None], verbose = [0],warm_start = [False], class_weight = [None])

Setting the parameters for training the random forest model. Every parameter value has to be of list type. The list can contain one single value (hyper parameter optimisation will not be performed) or multiple values (hyper parameter optimisation will be performed).
Hyper parameter optimisation will be performed using gridsearch
Examples

Parameters:

Documentation of the Random Forest model can be obtained [here]

Returns:

Dictionary containg the respective parameter names and values.

Python Source Code

#Parameters for random forest. To perform hyper parameter optimisation a list of multiple elements can be entered
#and the optimal value in that list will be picked using grid search
def parameter_set_random_forest(n_estimators = [10], criterion = ['gini'], max_depth = [None],\
                                min_samples_split = [2], min_samples_leaf = [1], min_weight_fraction_leaf = [0.0],\
                                max_features = ['auto'], max_leaf_nodes = [None], bootstrap = [True],\
                                oob_score = [False], random_state = [None], verbose = [0],warm_start = [False],\
                                class_weight = [None]):

    parameters_random_forest = {}
    parameters_random_forest['criterion'] = criterion
    parameters_random_forest['n_estimators'] = n_estimators
    parameters_random_forest['max_depth'] = max_depth
    parameters_random_forest['min_samples_split'] = min_samples_split
    parameters_random_forest['min_samples_leaf'] = min_samples_leaf
    parameters_random_forest['min_weight_fraction_leaf'] = min_weight_fraction_leaf
    parameters_random_forest['max_features'] = max_features
    parameters_random_forest['random_state'] = random_state
    parameters_random_forest['max_leaf_nodes'] = max_leaf_nodes
    parameters_random_forest['class_weight'] = class_weight
    parameters_random_forest['bootstrap'] = bootstrap
    parameters_random_forest['oob_score'] = oob_score
    parameters_random_forest['warm_start'] = warm_start

    return parameters_random_forest

Linear Regression

def parameter_set_linear_regression **(fit_intercept = [True], normalize = [False], copy_X = [True])**

Setting the parameters for training the linear regression model. Every parameter value has to be of list type. The list can contain one single value (hyper parameter optimisation will not be performed) or multiple values (hyper parameter optimisation will be performed).
Hyper parameter optimisation will be performed using gridsearch
Examples

Parameters:

Documentation of the Linear Regression model can be obtained [here]

Returns:

Dictionary containg the respective parameter names and values.

Python Source Code

#Parameters for linear regression. To perform hyper parameter optimisation a list of multiple elements can be entered
#and the optimal value in that list will be picked using grid search
def parameter_set_linear_regression(fit_intercept = [True], normalize = [False], copy_X = [True]):

    parameters_linear_regression = {}
    parameters_linear_regression['fit_intercept'] = fit_intercept
    parameters_linear_regression['normalize'] = normalize

    return parameters_linear_regression

Logistic Regression

def parameter_set_logistic_regression (penalty = ['l2'], dual = [False], tol = [0.0001], C = [1.0], fit_intercept = [True], intercept_scaling = [1], class_weight = [None], random_state = [None], solver = ['liblinear'], max_iter = [100], multi_class = ['ovr'], verbose = [0], warm_start = [False])

Setting the parameters for training the logistic regression model. Every parameter value has to be of list type. The list can contain one single value (hyper parameter optimisation will not be performed) or multiple values (hyper parameter optimisation will be performed).
Hyper parameter optimisation will be performed using gridsearch
Examples

Parameters:

Documentation of the Logistic Regression model can be obtained [here]

Returns:

Dictionary containg the respective parameter names and values.

Python Source Code

#Parameters for logistic regression. To perform hyper parameter optimisation a list of multiple elements can be entered
#And the optimal value in that list will be picked using grid search
def parameter_set_logistic_regression(penalty = ['l2'], dual = [False], tol = [0.0001], C = [1.0],\
                                      fit_intercept = [True], intercept_scaling = [1], class_weight = [None],\
                                      random_state = [None], solver = ['liblinear'], max_iter = [100],\
                                      multi_class = ['ovr'], verbose = [0], warm_start = [False]):

    parameters_logistic_regression = {}
    parameters_logistic_regression['penalty'] = penalty
    parameters_logistic_regression['dual'] = dual
    parameters_logistic_regression['tol'] = tol
    parameters_logistic_regression['C'] = C
    parameters_logistic_regression['fit_intercept'] = fit_intercept
    parameters_logistic_regression['intercept_scaling'] = intercept_scaling
    parameters_logistic_regression['class_weight'] = class_weight
    parameters_logistic_regression['solver'] = solver
    parameters_logistic_regression['max_iter'] = max_iter
    parameters_logistic_regression['multi_class'] = multi_class
    parameters_logistic_regression['warm_start'] = warm_start

    return parameters_logistic_regression

Training Base Models

def train_base_models **(model_list, parameters_list, save_models = False)**

The function trains the desired base models that the user has specified.
The function also obtains the predictions of the base models, which are then used to construct a dataset for Stacking and/or Blending.
The models are trained in parallel. The predictions of the models are also computed in parallel. This is done using joblib
Examples

Parameters:

model_list[list, optional ("gradient_boosting", "decision_tree", "random_forest", "linear_regression", "logistic_regression")]: The list is a non-empty list containing the names of the base models that have to be trained. Base models can be repeated, that is training the same base model with different paramters.
parameters_list[list]: The parameters of the respective base models have to be entered in the same order as how the names of its models have been entered in the model_list parameter. The parameters are dictionaries that are returned on calling the respective parameter set functions as described before. The parameters_list will contain the parameters of the respective base models in the same order as the base model names entered in model_list.
save_models[bool, optional (default = False)]: If True, all the base models will be saved in .pkl files using joblib. To get the base models and perform further operations using them call the get_base_models() function (ensembles.get_base_models()), the function will return all the base model objects.

Python Source Code

#This function calls the respective training and predic functions of the base models.
def train_base_models(model_list, parameters_list, save_models = False):

    print('\nTRAINING BASE MODELS\n')

    #Cross Validation using Stratified K Fold
    train, cross_val = train_test_split(Data, test_size = 0.5, stratify = Data[target_label],random_state = 0)

    #Training the base models, and calculating AUC on the cross validation data.
    #Selecting the data (Traing Data & Cross Validation Data)
    train_Y = train[target_label]
    train_X = train.drop([target_label],axis=1)

    cross_val_Y = cross_val[target_label]
    cross_val_X = cross_val.drop([target_label],axis=1)

    #The list of base models the user wants to train.
    global base_model_list
    base_model_list = model_list


    #No of base models that user wants to train
    global no_of_base_models
    no_of_base_models = len(base_model_list)


    #We get the list of base model training functions and predict functions. The elements of the two lists are
    #tuples that have (base model training function,model parameters), (base model predict functions) respectively
    [train_base_model_list,predict_base_model_list] = construct_model_parameter_list(base_model_list,\
                                                                                     parameters_list)


    #Training the base models parallely, the resulting models are stored which will be used for cross validation.
    models = (Parallel(n_jobs = -1)(delayed(function)(train_X, train_Y, model_parameter)\
                                                   for function, model_parameter in train_base_model_list))

    if(save_models == True):

        save_base_models(models)


    #A list with elements as tuples containing (base model predict function, and its respective model object) is
    #returned. This list is used in the next step in the predict_base_models function, the list will be used in
    #joblibs parallel module/function to compute the predictions and metric scores of the base models
    #Appended in the following manner so it can be used in joblib's parallel module/function
    global base_model_predict_function_list
    base_model_predict_function_list = construct_model_predict_function_list(base_model_list, models,\
                                                                        predict_base_model_list)
    predict_base_models(cross_val_X, cross_val_Y,mode = 'train')

Weighted Average

def assign_weights **(weights = 'default',hyper_parameter_optimisation = False)**

The function needs to be called if weighted average is going to be performed, to set the weights for performing weighted average of the models. Examples

Parameters:

weights[list, optional (default = 'default')]: The value "default" will assign default weights, that is weights will be chosen from the range (1-10) for each model by performing hyper parameter optimisation (if hyper_parameter_optimisation = True) otherwise equal weights = 1 will be assigned to all the models (if hyper_parameter_optimisation = False). The other option is to manually pass the weights that need to be assigned to the model, or a list of multiple weights for each model (nested list) can be passed and it will be optimised (optimum weight from the list of weights will be chosen for that model) using hyperopt
hyper_parameter_optimisation[string/bool, optional (default = False)]: Needs to be True when multiple weights are passed for a model (nested list). False when weights are manually assigned or when using equal weights.

Returns:

List containing the weights that will be used for performing the weighted average.

Python Source Code

#The user can either use the default weights or provide their own list of values.
def assign_weights(weights = 'default',hyper_parameter_optimisation = False):

    weight_list = list()

    #The last element of the weight_list will indicate wether hyper parameter optimisation needs to be peroformed
    if(hyper_parameter_optimisation == True):

        if(weights == 'default'):

            weight_list = [range(10)] * no_of_base_models
            weight_list.append(True)

        else:

            weight_list = weights
            weight_list.append(True)

    else:

        if(weights == 'default'):

            weight_list = [1] * no_of_base_models
            weight_list.append(False)

        else:

            weight_list = weights
            weight_list.append(False)

    return weight_list

Training Ensemble Models

def train_ensemble_models **(stack_model_list = [], stack_parameters_list = [], blend_model_list = [], blend_parameters_list = [], perform_weighted_average = None, weights_list = None, save_models = False)**

The function needs to be called for training the ensemble models. The models are trained in parallel. The predictions of the models are also computed in parallel. This is done using joblib
Examples

Parameters:

*stack_model_list *[list or empty, optional ("gradient_boosting", "decision_tree", "random_forest", "linear_regression", "logistic_regression")]: The list is a non-empty list containing the names of the stacking models that have to be trained. Stacking models can be repeated, that is training the same stacking model with different paramters. Empty list needs to be passed when stacking is not going to be performed.
stack_parameters_list[list]: The parameters of the respective stacking models have to be entered in the same order as how the names of its models have been entered in the stack_model_list parameter. The parameters are dictionaries that are returned on calling the respective parameter set functions as described before. The stack_parameters_list will contain the parameters of the respective stacking models in the same order as the stacking model names entered in stack_model_list.
*blend_model_list *[list or empty, optional ("gradient_boosting", "decision_tree", "random_forest", "linear_regression", "logistic_regression")]: The list is a non-empty list containing the names of the blending models that have to be trained. Blending models can be repeated, that is training the same blending model with different paramters. Empty list needs to be passed when blending is not going to be performed.
blend_parameters_list[list]: The parameters of the respective blending models have to be entered in the same order as how the names of its models have been entered in the blend_model_list parameter. The parameters are dictionaries that are returned on calling the respective parameter set functions as described before. The blend_parameters_list will contain the parameters of the respective blending models in the same order as the blending model names entered in blend_model_list.
perform_weighted_average[bool (defualt = False)]: To specify wether to perform weighted average of the base models or not.
weights_list[list]: The list of weights that is returned by the assign_weights() function

Python Source Code

#Training the second level models parallely
def train_ensemble_models(stack_model_list = [], stack_parameters_list = [], blend_model_list = [],\
                              blend_parameters_list = [], perform_weighted_average = False, weights_list = None,
                          save_models = False):

    print('\nTRAINING ENSEMBLE MODELS\n')

    global no_of_ensemble_models

    #This list will contain the names of the models/algorithms that have been used as second level models
    #This list will be used later in the testing phase for identifying which model belongs to which ensemble
    #(stacking or blending), hence the use of dictionaries as elements of the list
    #Analogous to the base_model_list
    global ensmeble_model_list
    ensmeble_model_list = list()

    train_stack_model_list = list()
    predict_stack_model_list = list()
    train_blend_model_list = list()
    predict_blend_model_list = list()

    #The list will be used to train the ensemble models, while using joblib's parallel
    train_second_level_models = list()

    #Stacking will not be done if user does not enter the list of models he wants to use for stacking
    if(stack_model_list != []):

        #Appending a dictionary that contians key-Stacking and its values/elements are the names of the
        #models/algorithms that are used for performing the stacking procedure, this is done so that it will be easy
        #to identify the models belonging to the stacking ensemble
        ensmeble_model_list.append({'Stacking' : stack_model_list})

        #We get the list of stacked model training functions and predict functions. The elements of the two
        #lists are tuples that have(base model training function,model parameters,train_stack function),
        #(base model predict functions,predict_stack function) respectively
        [train_stack_model_list,predict_stack_model_list] = construct_model_parameter_list(stack_model_list,\
                                                                                           stack_parameters_list,
                                                                                           stack=True)

    #Blending will not be done if user does not enter the list of models he wants to use for blending
    if(blend_model_list != []):

        #Appending a dictionary that contians key-Blending and its values/elements are the names of the
        #models/algorithms that are used for performing the blending procedure, this is done so that it will be easy
        #to identify the models belonging to the blending ensemble
        ensmeble_model_list.append({'Blending' : blend_model_list})

        #We get the list of blending model training functions and predict functions. The elements of the two
        #lists are tuples that have(base model training function,model parameters,train_blend function),
        #(base model predict functions,predict_blend function) respectively
        [train_blend_model_list,predict_blend_model_list] = construct_model_parameter_list(blend_model_list,\
                                                                                           blend_parameters_list,\
                                                                                           blend=True)

    #The new list contains either the stacked models or blending models or both or remain empty depending on what
    #the user has decided to use
    train_second_level_models = train_stack_model_list + train_blend_model_list

    #If the user wants to perform a weighted average, a tuple containing (hyper parmeter optimisation = True/False,
    #the lsit of weights either deafult or entered by the user, and the function that performs the weighted average)
    #will be created. This tuple will be appended to the list above
    #weights_list[-1] is an element of the list that indicates wwether hyper parameter optimisation needs to be
    #perofrmed
    if(perform_weighted_average == True):

        train_weighted_average_list = (weights_list[-1], weights_list, weighted_average)
        train_second_level_models.append(train_weighted_average_list)


    no_of_ensemble_models = len(train_second_level_models)

    #If weighted average is performed, the last element of models will contain the metric score and weighted average
    #predictions, and not a model object. So we use the last element in different ways compared to the other model
    #objects

    #Training the ensmeble models parallely
    models = Parallel(n_jobs = -1)(delayed(function)(stack_X, stack_Y, model, model_parameter)\
                                        for model, model_parameter, function in train_second_level_models)


    #A list with elements as tuples containing((base model predict function,predict_stack or predict_blend functions)
    #,and its respective base model object) is returned. This list is used in the next step in the
    #predict_ensemble_models function, the list will be used in
    #joblibs parallel module/function to compute the predictions and metric score of the ensemble models
    #Appended in the following manner so it can be used in joblib's parallel module/function
    #Analogous to base_model_predict_function_list
    global ensmeble_model_predict_function_list
    ensmeble_model_predict_function_list = construct_model_predict_function_list(stack_model_list + blend_model_list,\
                                                                                 models, predict_stack_model_list
                                                                                 + predict_blend_model_list)

    #If weighted average is needed to be perofrmed we need to append((None(which indicates its testing phase),the
    #weighted average function),and the weights). Appended in the following manner so it can be used in joblib's
    #parallel module/function
    if(perform_weighted_average == True):

        weight = models[-1][-1]
        print('Weighted Average')
        print('Weight',weight)
        print('Metric Score',models[-1][0])
        ensmeble_model_list.append({'Weighted Average' : [str(weight)]})
        ensmeble_model_predict_function_list.append(((None,weighted_average),weight))

    if(save_models == True and perform_weighted_average == True):

        del models[-1]
        no_of_ensemble_models = no_of_ensemble_models - 1
        save_ensemble_models(models)

    elif(save_models == True and perform_weighted_average == False):

        save_ensemble_models(models)

Testing

def test_models **(test_data)**

The function needs to be called for measuring the performance of the model(Base and Base+Second: Ensemble Models) on the test dataset.

Parameters:

*test_data *[Pandas DataFrame]: The dataset as a csv file needs to be passed as the parameter. (Binary classification dataset only).

Python Source Code

def test_models(test_data):

    print('\nTESTING PHASE\n')

    #Training the base models, and calculating AUC on the test data.
    #Selecting the data (Test Data)
    test_Y = test_data[target_label]
    test_X = test_data.drop([target_label],axis=1)

    predict_base_models(test_X,test_Y,mode='test')
    predict_ensemble_models(test_stack_X,test_stack_Y)

Ensemble Package

About the Author

Package for Ensemble of Machine Learning Models

Installation

Documentation

Number Of Layers (MLP)

def set_no_of_layers (number)

Data Import

def data_import (data, label_output, encode = None, split = True, stratify = True, split_size = 0.3)

Evaluation Metrics

def metric_set (metric)

Gradient Boosting

Multi Layer Perceptron

Decision Tree

def parameter_set_decision_tree (criterion = ['gini'], splitter = ['best'], max_depth = [None], min_samples_split = [2], min_samples_leaf = [1], min_weight_fraction_leaf = [0.0], max_features = [None], random_state = [None], max_leaf_nodes = [None], class_weight = [None], presort = [False])

Random Forest

Linear Regression

def parameter_set_linear_regression (fit_intercept = [True], normalize = [False], copy_X = [True])

Logistic Regression

def parameter_set_logistic_regression (penalty = ['l2'], dual = [False], tol = [0.0001], C = [1.0], fit_intercept = [True], intercept_scaling = [1], class_weight = [None], random_state = [None], solver = ['liblinear'], max_iter = [100], multi_class = ['ovr'], verbose = [0], warm_start = [False])

Training Base Models

def train_base_models (model_list, parameters_list, save_models = False)

Weighted Average

def assign_weights (weights = 'default',hyper_parameter_optimisation = False)

Training Ensemble Models

def train_ensemble_models (stack_model_list = [], stack_parameters_list = [], blend_model_list = [], blend_parameters_list = [], perform_weighted_average = None, weights_list = None, save_models = False)

Testing

def test_models (test_data)

def set_no_of_layers **(number)**

def data_import **(data, label_output, encode = None, split = True, stratify = True, split_size = 0.3)**

def metric_set **(metric)**

def parameter_set_linear_regression **(fit_intercept = [True], normalize = [False], copy_X = [True])**

def train_base_models **(model_list, parameters_list, save_models = False)**

def assign_weights **(weights = 'default',hyper_parameter_optimisation = False)**

def train_ensemble_models **(stack_model_list = [], stack_parameters_list = [], blend_model_list = [], blend_parameters_list = [], perform_weighted_average = None, weights_list = None, save_models = False)**

def test_models **(test_data)**