speedml package

Submodules

speedml.base module

class speedml.base.Base[source]

Bases: object

static data_n()[source]

Updates train_n and test_n numeric datasets (used for model data creation) based on numeric datatypes from train and test datasets.

speedml.feature module

Speedml Feature component with methods that work on dataset features or the feature engineering workflow. Contact author https://twitter.com/manavsehgal. Code, docs and demos https://speedml.com.

class speedml.feature.Feature[source]

Bases: speedml.base.Base

add(a, num)[source]

Update a numeric feature by adding num number to each values.

concat(new, a, sep, b)[source]

Create new text feature by concatenating a and b text feature values, using sep separator.

density(a)[source]

Create new feature named a feature name + suffix ‘_density’, based on density or value_counts for each unique value in a feature specified as a string or multiple features as a list of strings.

diff(new, a, b)[source]

Create new numeric feature by subtracting a - b feature values.

divide(new, a, b)[source]

Create new numeric feature by dividing a / b feature values. Replace division-by-zero with zero values.

drop(features)[source]

Drop one or more list of strings naming features from train and test datasets.

extract(a, regex, new=None)[source]

Match regex regular expression with a text feature values to update a feature with matching text if new = None. Otherwise create new feature based on matching text.

fillna(a, new)[source]

Fills empty or null values in a feature name with new string value.

impute()[source]

Replace empty values in the entire dataframe with median value for numerical features and most common values for text features.

labels(features)[source]

Generate numerical labels replacing text values from list of categorical features.

list_len(new, a)[source]

Create new numeric feature based on length or item count from a feature containing list object as values.

mapping(a, data)[source]

Convert values for categorical feature a using data dictionary. Use when number of categories are limited otherwise use labels.

outliers(a, lower=None, upper=None)[source]

Fix outliers for lower or upper or both percentile of values within a feature.

product(new, a, b)[source]

Create new numeric feature by multiplying a * b feature values.

replace(a, match, new)[source]

In feature a values match string or list of strings and replace with a new string.

round(new, a, precision)[source]

Create new numeric feature by rounding a feature value to precision decimal places.

sum(new, a, b)[source]

Create new numeric feature by adding a + b feature values.

word_count(new, a)[source]

Create new numeric feature based on length or word count from a feature containing free-form text.

speedml.model module

Speedml Model component with methods that work on sklearn models workflow. Contact author https://twitter.com/manavsehgal. Code, docs and demos https://speedml.com.

class speedml.model.Model[source]

Bases: speedml.base.Base

data()[source]

Prepare model input data Base.train_y as Series, Base.train_X, and Base.test_X datasets as Matrix.

evaluate()[source]

Model evaluation across multiple classifiers based on accuracy of predictions.

ranks()[source]

Returns DataFrame of model ranking sorted by Accuracy.

speedml.plot module

Speedml Plot component with methods that work on plots or the Exploratory Data Analysis (EDA) workflow. Contact author https://twitter.com/manavsehgal. Code, docs and demos https://speedml.com.

class speedml.plot.Plot[source]

Bases: speedml.base.Base

bar(x, y)[source]

Bar plot x across y feature values.

continuous(y)[source]

Plot continuous features (numeric) using scatter plot. Use this to determine outliers within continuous features.

correlate()[source]

Plot correlation matrix heatmap for numerical features of the training dataset. Use this plot to understand if certain features are duplicate, are of low importance, or possibly high importance for our model.

crosstab(x, y)[source]

Return a dataframe cross-tabulating values from feature x and y.

distribute()[source]

Plot multiple feature distribution histogram plots for all numeric features. This helps understand skew of distribution from normal to quickly and relatively identify outliers in the dataset.

importance()[source]

Plot importance of features based on ExtraTreesClassifier.

model_ranks()[source]

Plot ranking among accuracy offered by various models based on our datasets.

ordinal(y)[source]

Plot ordinal features (categorical numeric) using Violin plot against target feature. Use this to determine outliers within ordinal features spread across associated target feature values.

strip(x, y)[source]

Stripplot plot x across y feature values.

xgb_importance()[source]

Plot importance of features based on XGBoost.

speedml.util module

Speedml utility methods. Contact author https://twitter.com/manavsehgal. Code, docs and demos https://speedml.com.

class speedml.util.DataFrameImputer[source]

Bases: sklearn.base.TransformerMixin

fit(X, y=None)[source]

Uses X dataset to fill empty values for numeric features with the median value, otherwise fills most common value for text features.

transform(X, y=None)[source]

Calls the self.fill rule defined in fit method.

speedml.xgb module

Speedml Xgb component with methods that work on XGBoost model workflow. Contact author https://twitter.com/manavsehgal. Code, docs and demos https://speedml.com.

class speedml.xgb.Xgb[source]

Bases: speedml.base.Base

classifier()[source]

Creates the XGBoost Classifier with Base.xgb_params dictionary of model hyper-parameters.

cv(grid_params)[source]

Calculate the Cross-Validation (CV) score for XGBoost model based on grid_params parameters. Sets xgb.cv_results variable to the resulting dataframe.

feature_selection()[source]

Returns threshold and accuracy for n number of features.

fit()[source]

Sets Base.xgb_model with trained XGBoost model.

hyper(select_params, fixed_params)[source]

Tune XGBoost hyper-parameters by selecting from permutations of values from the select_params dictionary. Remaining parameters with single values are specified by the fixed_params dictionary. Returns a dataframe with ranking of select_params items.

params(params)[source]

Sets Base.xgb_params to params dictionary.

predict()[source]

Sets xgb.predictions with predictions from the XGBoost model.

sample_accuracy()[source]

Calculate the accuracy of an XGBoost model based on number of correct labels in prediction.

Module contents

Speedml is a Python package to speed start machine learning projects. Contact author https://twitter.com/manavsehgal. Code, docs and demos https://speedml.com.

class speedml.Speedml(train, test, target, uid=None)[source]

Bases: speedml.base.Base

configure(option=None, value=None)[source]

Configure Speedml defaults with option configuration parameter, value setting. When method is called without parameters it simply returns the current config dictionary, otherwise returns the updated configuration.

eda()[source]

Performs speed exploratory data analysis (EDA) on the current state of datasets. Returns metrics and recommendations as a dataframe. Progressively hides metrics as they achieve workflow completion goals or meet the configured defaults and thresholds.

info()[source]

Runs DataFrame.info() on both Train and Test datasets.

save_results(columns, file_path)[source]

Saves the columns dictionary input to a DataFrame as file_path CSV file.

shape()[source]

Print shape (samples, features) of train, test datasets and number of numerical features in each dataset.

slug()[source]