# infpy Package¶

## infpy Package¶

Infpy is a python package I have put together that implements some of the algorithms I (John Reid) have used in my research. In particular it has a Gaussian process package that is largely based on the excellent book, Gaussian Processes for Machine Learning by Rasmussen and Williams.

The Gaussian process package is the only infpy package that is extensively documented so far but you are welcome to try out the others. The Gaussian process package has the following attributes:

• noisy data is easily modelled
• many different kernels are supported out of the box allowing many models to be tested
• kernel composition (point-wise sum and product) is intuitive permitting rapid model evaluation
• maximum likelihood estimation of hyper-parameters facilitates model comparison
• numpy integration allows easy interoperability with other python scientific toolkits
• high quality matplotlib plots are easy to create
• best of both worlds : ease of using an interpreted language but all performance critical linear algebra performed in compiled code
infpy.__init__.version_string()[source]

Return the release and svn revision as a string.

## bootstrap Module¶

Code to handle bootstrap analyses.

infpy.bootstrap.bootstrap_p_value(bootstrap_stats, stat_value)[source]

Calculate the p-value for the statistic’s value given the bootstrap values.

infpy.bootstrap.calculate_bootstrap_statistics(samples, statistic)[source]

Calculate the bootstrap statistics for the samples.

infpy.bootstrap.generate_bootstrap_samples(num_samples, test_universe, test_set_sizes)[source]

Yield samples that match the sizes given in test_set_sizes

## convergence_test Module¶

Code to implement convergence tests (primarily for sequences of log likelihoods).

class infpy.convergence_test.LlConvergenceTest(eps=1e-08, should_increase=True, use_absolute_difference=False)[source]

Bases: object

Tests for convergence of a series of log likelihoods.

LLs = None

The log likelihoods.

eps = None

If true a warning is printed if the log likelihoods don’t always increase.

should_increase = None

If true a warning is printed if the log likelihoods don’t always increase.

use_absolute_difference = None

If true the absolute differences is used for the convergence test, otherwise any decrease is viewed as convergence.

infpy.convergence_test.check_LL_increased(last_LL, LL, tag='', tolerance=0.01, raise_error=True)[source]

Takes 2 numpy arrays of components of a log likelihood. Assumes the total LL is the sum of each array. Compares the 2 LLs and if the new one is smaller than the first by at least tolerance, raises an error.

## distribution Module¶

class infpy.distribution.Distribution[source]

Bases: object

Base class for distributions

mean()[source]

Mean of the distribution

plot(start, end, num_steps=100)[source]
sample_from()[source]

Samples one value from the distribution

supports(x)[source]

Returns if x is in the distribution’s support

variance()[source]

Variance of the distribution

http://en.wikipedia.org/wiki/Gamma_distribution

Mean is k * theta Support is [0,inf)

dlog_pdf_dx(x)[source]
log_pdf(x)[source]
mean()[source]

Mean of the distribution

sample_from()[source]

Samples one value from the distribution

supports(x)[source]

Returns if x is in the distribution’s support

variance()[source]

Variance of the distribution

class infpy.distribution.LogNormalDistribution(mu=1.0, sigma=1.0)[source]

http://en.wikipedia.org/wiki/Log_normal_distribution

dlog_pdf_dx(x)[source]
log_pdf(x)[source]
mean()[source]

Mean of the distribution

sample_from()[source]

Samples one value from the distribution

supports(x)[source]

Returns if x is in the distribution’s support

variance()[source]

Variance of the distribution

class infpy.distribution.NormalDistribution(mu=1.0, sigma=1.0)[source]

http://en.wikipedia.org/wiki/Normal_distribution

dlog_pdf_dx(x)[source]
log_pdf(x)[source]
mean()[source]

Mean of the distribution

sample_from()[source]

Samples one value from the distribution

variance()[source]

Variance of the distribution

infpy.distribution.plot_distribution(d, start=0.01, stop=10.0, resolution=0.1)[source]

Displays a plot of the pdf of d

## roc Module¶

Code to implement ROC point/curve calculation and plotting.

class infpy.roc.RocCalculator(tp=0, fp=0, tn=0, fn=0)[source]

Bases: object

Calculates specificities and sensitivities from counts of true and false positives and negatives.

Source: wikipedia - Fawcett (2004)

accuracy()[source]

(TP+TN)/(TP+TN+FP+FN)

static always_predict_false()[source]

A RocCalculator for a predictor that always predicts False

static always_predict_true()[source]

A RocCalculator for a predictor that always predicts True

average_performance()[source]

(sensitivity()+positive_predictive_value())/2

correlation_coefficient()[source]

(TP.TN-FN.FP)/sqrt((TP+FN)(TN+FP)(TP+FP)(TN+FN)) see: Burset & Guigo

distance(other)[source]

Measure of distance between points.

false_positive_rate()[source]

FP/(TN+FP)

fn = None

Number of false negatives.

fp = None

Number of false positives.

fpr()

FP/(TN+FP)

hit_rate()

TP/(TP+FN)

negative_predictive_value()[source]

TN/(TN+FN)

normalise(rhs)[source]

Normalise this RocCalculator so that tp+tn+fp+fn=1.

performance_coefficient()[source]

TP/(TP+FN+FP) see: Pevzner & Sve

positive_predictive_value()[source]

TP/(TP+FP)

precision()

TP/(TP+FP)

recall()

TP/(TP+FN)

sensitivity()[source]

TP/(TP+FN)

specificity()[source]

TN/(TN+FP)

tn = None

Number of true negatives.

total_negative[source]

The total number of negative test cases.

total_positive[source]

The total number of positive test cases.

tp = None

Number of true positives.

tpr()

TP/(TP+FN)

true_positive_rate()

TP/(TP+FN)

infpy.roc.all_rocs_from_thresholds(positive_thresholds, negative_thresholds, negative_first=True)[source]

Takes 2 sorted lists (smallest to largest): one list is of the thresholds required to classify the positive examples as positive and the other list is of the thresholds required to classify the negative examples as positive.

Returns: Yields all the ROC points. Note that they are returned in the opposite order to some of the other methods in this module.
infpy.roc.area_under_curve(rocs, include_0_0=True, include_1_1=True)[source]
Parameters: rocs – The ROC points. include_0_0 – True to include extra point for origin of ROC curve. include_1_1 – True to include extra point at (1,1) in ROC curve. The area under the ROC curve given by the ROC points.
infpy.roc.auc(rocpoints)[source]

Calculate the area under the ROC points.

infpy.roc.auc50(positive_thresholds, negative_thresholds, num_negative=50, num_points=32)[source]

Calculate the AUC50 as in Gribskov & Robinson: ‘Use of ROC analysis to evaluate sequence pattern matching’

infpy.roc.auc50_from_rocpoints(rocpoints, max_fp=50)[source]

Calculate the AUC50 as in Gribskov & Robinson: ‘Use of ROC analysis to evaluate sequence pattern matching’

infpy.roc.auc50_wrong(positive_thresholds, negative_thresholds, num_negative=50, num_points=32)[source]

Calculate the AUC50 as in Gribskov & Robinson ‘Use of ROC analysis to evaluate sequence pattern matching’

infpy.roc.bisect_rocs(rocpoints, predicate, start=0, end=None)[source]

Return the index into rocpoints for first rocpoint with predicate(rocpoint) is True and start <= index < end. Assumes rocpoints are sorted w.r.t. predicate.

infpy.roc.count_threshold_classifications(thresholds, value)[source]

Take a list of thresholds (in sorted order) and count how many would be classified positive and negative at the given value.

Returns: (num_positive, num_negative).
infpy.roc.create_rocs_from_thresholds(positive_thresholds, negative_thresholds, num_points=32)[source]

Takes 2 sorted lists: one list is of the thresholds required to classify the positive examples as positive and the other list is of the thresholds required to classify the negative examples as positive.

Returns: A list of tuples (ROC point, threshold).
infpy.roc.generate_roc_points(rocs, sort_negative_first=True)[source]

Generate ROC points but sort negatives before positives at same threshold if asked to. This gives a step-function like ROC curve rather than a smoothed curve.

infpy.roc.get_new_roc_parameter(rocs, for_specificity=True)[source]

Takes a sequence of (parameter, roc) tuples and returns a new parameter that should be tested next.

It chooses this parameter by sorting the sequence and taking the mid-point between the parameters with the largest absolute difference between their specificities or sensitivities (depending on for_specificity parameter).

infpy.roc.label_plot()[source]

Label the x and y axes of a ROC plot.

infpy.roc.label_precision_recall()[source]

Label the x and y axes of a precision-recall plot.

infpy.roc.label_precision_versus_recall()[source]

Label the x and y axes of a precision versus recall plot.

infpy.roc.make_roc_from_threshold_fn(positive_thresholds, negative_thresholds)[source]
Returns: A function that calculates a ROC point given a threshold.
infpy.roc.pick_roc_thresholds(roc_for_threshold_fn, min_threshold, max_threshold, num_points=32)[source]

Tries to pick thresholds to give a smooth ROC curve.

Returns: A list of (roc point, threshold) tuples.
infpy.roc.picked_rocs_from_thresholds(positive_thresholds, negative_thresholds, num_points=32)[source]

Takes 2 sorted lists: one list is of the thresholds required to classify the positive examples as positive and the other list is of the thresholds required to classify the negative examples as positive.

Returns: A list of ROC points.
infpy.roc.plot_precision_recall(roc_thresholds, recall_plot_kwds={}, precision_plot_kwds={}, plot_fn=None)[source]

Plots a precision-recall curve for the given ROCs.

Parameters: roc_thresholds – A sequence of tuples (ROC, threshold). recall_plot_kwds – Passed to the pylab.plot call for the recall. precision_plot_kwds – Passed to the pylab.plot call for the precision. plot_fn – Function used to plot. Use pylab.semilogx for log scale threshold axis. The result of 2 pylab.plot calls as a tuple (recall, precision).
infpy.roc.plot_precision_versus_recall(rocs, **plot_kwds)[source]

Plots precision versus recall for the ROCs in rocs. Adds points at (0,1) and (1,0).

Parameters: rocs – A sequence of ROCs. plot_kwds – All extra keyword arguments are passed to the pylab.plot call. The result of pylab.plot call.
infpy.roc.plot_random_classifier(**kwargs)[source]

Draw a random classifier on a ROC plot. Black dashed line by default.

infpy.roc.plot_roc_points(rocs, **plot_kwds)[source]

Plots TPR versus FPR for the ROCs in rocs. Adds points at (0,0) and (1,1).

Parameters: rocs – A sequence of ROCs. plot_kwds – All extra keyword arguments are passed to the pylab.plot call. The result of pylab.plot call.
infpy.roc.plot_rocpoint(rocpoint, **plotargs)[source]

Plot a single rocpoint. Typically used to indicate where the last point for the AUC50 calculation is.

infpy.roc.plot_rocpoints(rocpoints, fillargs=None, **plot_kwds)[source]

Plots TPR versus FPR for the ROCs in rocpoints.

Parameters: rocpoints – A sequence of ROCs. plot_kwds – All extra keyword arguments are passed to the pylab.plot call. The result of pylab.plot call.
infpy.roc.resize_negative_examples(positive_thresholds, negative_thresholds, num_negative=50)[source]

Reduce the positive and negative thresholds such that there are just 50 (or num_negative) negative examples. The positive thresholds are trimmed accordingly.

infpy.roc.restrict_false_positives(rocpoints, max_fp=50)[source]

Yield the ROC points while the number of true negatives is less than max_tn.

infpy.roc.roc_for_threshold(positive_thresholds, negative_thresholds, value)[source]

Take lists of positive and negative thresholds (in sorted order) and calculate a ROC point for the given value.

infpy.roc.rocs_from_thresholds(positive_thresholds, negative_thresholds, num_points=32)[source]

Takes 2 sorted lists: one list is of the thresholds required to classify the positive examples as positive and the other list is of the thresholds required to classify the negative examples as positive.

Returns: A list of ROC points.
infpy.roc.update_roc(roc, truth_prediction_iterable)[source]

for each (truth,prediction) in iterable, update the ROC calculator

## utils Module¶

infpy.utils.array_is_close(A, B, eps=0.001)[source]

Checks if two numpy arrays are close

Check the approximation to the gradient of the function matches the supplied gradient

f is a function fprime is a function describing the gradient of f

The gradient will be approximated by expansion of f around x and compared with the value of fprime at x

infpy.utils.check_is_close(f1, f2, tol=1e-06, strong_check=True)[source]
infpy.utils.check_is_close_2(left, right, tol=0.0001, strong_or_weak=True)[source]
infpy.utils.check_matrix_is_close(A, B, message, eps=0.001)[source]

Raises error and prints message if matrices are not close

infpy.utils.index_filter(predicate, iterable)[source]
infpy.utils.k_fold_cross_validation(X, K, randomise=False)[source]

Generates K (training, validation) pairs from the items in X.

The validation iterables are a partition of X, and each validation iterable is of length len(X)/K. Each training iterable is the complement (within X) of the validation iterable, and so each training iterable is of length (K-1)*len(X)/K.

For example:

X = [i for i in xrange(97)]
for training, validation in k_fold_cross_validation(X, K=7):
for x in X: assert (x in training) ^ (x in validation), x

infpy.utils.lu_inv(L)[source]
infpy.utils.matrix_from_function(f, shape, dtype, symmetric=False)[source]
infpy.utils.matrix_is_close(A, B, eps=0.001)[source]

Simple test that A and B differ by at most eps in any position

infpy.utils.norm2(x)[source]

Calculates the L2 norm of the vector

infpy.utils.plot_gaussian(mu, sigma, *args, **kwds)[source]

Plot a gaussian with given mu and sigma (first 2 dimensions only)

infpy.utils.plot_gaussian_test()[source]
infpy.utils.plot_line(start, end, *arguments, **keywords)[source]

Plot a line from start to end

infpy.utils.zero_mean_unity_variance(y)[source]

Scales the data to make the variance 1 and the mean 0

Returns (scaled, revert) where
scaled: the scaled and shifted data revert: a unary function that converts back to original data