Installation

If you prefer to work on the source, clone the repository

git clone https://github.com/eth-cscs/abcpy.git

Make sure all requirements are installed

cd abcpy
pip3 install -r requirements.txt

To create a package and install it do

make package
pip3 install build/dist/abcpy-0.1-py3-none-any.whl

Note that ABCpy requires Python3.

Getting Started

Here we show how to use ABCpy to infer parameters of model, given observed some data. As a simple example we consider a Gaussian model, where we want to model the height of grown up humans given the following set of measurement (observation, observed data).

y_obs = [160.82499176, 167.24266737, 185.71695756, 153.7045709, 163.40568812, 140.70658699, 169.59102084, 172.81041696, 187.38782738, 179.66358934, 176.63417241, 189.16082803, 181.98288443, 170.18565017, 183.78493886, 166.58387299, 161.9521899, 155.69213073, 156.17867343, 144.51580379, 170.29847515, 197.96767899, 153.36646527, 162.22710198, 158.70012047, 178.53470703, 170.77697743, 164.31392633, 165.88595994, 177.38083686, 146.67058471763457, 179.41946565658628, 238.02751620619537, 206.22458790620766, 220.89530574344568, 221.04082532837026, 142.25301427453394, 261.37656571434275, 171.63761180867033, 210.28121820385866, 237.29130237612236, 175.75558340169619, 224.54340549862235, 197.42448680731226, 165.88273684581381, 166.55094082844519, 229.54308602661584, 222.99844054358519, 185.30223966014586, 152.69149367593846, 206.94372818527413, 256.35498655339154, 165.43140916577741, 250.19273595481803, 148.87781549665536, 223.05547559193792, 230.03418198709608, 146.13611923127021, 138.24716809523139, 179.26755740864527, 141.21704876815426, 170.89587081800852, 222.96391329259626, 188.27229523693822, 202.67075179617672, 211.75963110985992, 217.45423324370509]

Now, we want to model the height of humans by a Gaussian model which has parameters mean, denoted by \(\mu\), and standard deviation, denoted by \(\sigma\). The goal is to use ABC to infer these yet unknown parameters from the information contained in the observed data.

A pre-requisite for ABC is that we provide certain prior knowledge about the parameters we want to infer. In our case it is quite simple, we know from experience that the average height should be somewhere between 150cm and 200cm, and the standard deviation is around 5 to 25.

from abcpy.distributions import Uniform
prior = Uniform([150, 5],[200, 25], seed=1)

from abcpy.models import Gaussian
model = Gaussian(prior, seed=1)

Further, we need a means to quantify how close our observation is to synthetic data (generated by the model). Often the real and synthetic observations cannot compared directly in a reasonable of efficient way. Thus, summary statistics are used to extract relevant properties from the observations, with the idea the these stastistics then compared.

from abcpy.statistics import Identity
statistics_calculator = Identity(degree = 2, cross = False)

As a distance we chose the LogReg distance here. Note that in ABCpy distance functions operate not on the observations, but on summary statistice.

from abcpy.distances import LogReg
distance_calculator = LogReg(statistics_calculator)

We can now setup a inference scheme – let us chose PMCABC as our inference algorithm of choice. As a pre-requisit it requires a perturbation kernel and a backend. We define both in the following:

from abcpy.distributions import MultiStudentT
mean, cov, df = np.array([.0, .0]), np.eye(2), 3.
kernel = MultiStudentT(mean, cov, df, seed=1)

from abcpy.backends import BackendDummy as Backend
backend = Backend()

We instanciate an PMCABC object and pass the kernel and backend objects to the constructor:

from abcpy.inferences import PMCABC
sampler = PMCABC(model, distance_calculator, kernel, backend, seed=1)

Finally, we need to parametrize and start the actualy sampling:

T, n_sample, n_samples_per_param = 3, 250, 10
eps_arr = np.array([.75])
epsilon_percentile = 10
journal = sampler.sample(y_obs,  T, eps_arr, n_sample, n_samples_per_param, epsilon_percentile)

With this the inferrence process is done and the probabilities of the inferred parameters are stored in the journal object. See Post Analysis for further information on extracting results.

The code currently uses the dummy backend BackendDummy which does not parallelize the execution of the inference schemes, but is very handy quick prototyping and testing. To execute the code you only need to run

python3 gaussian.py

The full source can be found in examples/backends/dummy/pmcabc_gaussian.py.

Post Analysis

The output when sampling from an inferrence scheme is a Journal (abcpy.output.Journal) which holds all the necessary results and convenient methods to do the post analysis.

For example, one can easily access the sampled parameters and corresponding weights using:

print(journal.parameters)
print(journal.weights)

For the post analysis basic functions are provided:

print(journal.posterior_mean())
print(journal.posterior_cov())

Also, to ensure reproducibility, every journal stores the parameters of the algorithm that created it:

print(journal.configuration)

And certainly, a journal can easily be saved to and loaded from disk:

journal.save("experiments.jnl")
new_journal = Journal.fromFile('experiments.jnl')

Using the Spark Backend

To run ABCpy in parallel using Apache Spark, one only needs to use the provided Spark backend. Considering the example from above, the statements for the backend have to be changed to

import pyspark
sc = pyspark.SparkContext()
from abcpy.backends import BackendSpark as Backend
backend = Backend(sc, parallelism=4)

In words, a Spark context has to be created and passed to the Spark backend. Additionally, the level of parallelism can be provided, which defines in a sense in how many blocks the work should be split up. It corresponds to the parallelism of an RDD in Apache Spark terminology. A good value is usually a small multiple of the total number of available cores.

The standard way to run the script on Spark is via the spark-submit command:

PYSPARK_PYTHON=python3 spark-submit gaussian.py

Often Spark installations use Python 2 by default. To make Spark use the required Python 3 interpreter, the PYSPARK_PYTHON environment variable can be set.

The adapted python code can be found in examples/backend/apache_spark/gaussian.py.

Note that in order to run jobs in parallel you need to have Apache Spark installed on the system in question. Details on the installation can be found on the official homepage. Further, keep in mind that the ABCpy library has to be properly installed on the cluster, such that it is available to the Python interpreters on the master and the worker nodes.

Implementing a new Model

Often one wants to use one of the provided inference schemes on a new model, which is not part of ABCpy. We now go through the details of such a scenario using the Gaussian model to exemplify the mechanics.

Every model has to conform to the API specified by the abstract base class abcpy.models.Model. Thus, making a new model compatible with ABCpy, essentially boils down to implementing the following methods:

class Model(metaclass = ABCMeta):
    def __init__(self, prior, seed = None): 
    def set_parameters(self, theta):
    def sample_from_prior():
    def simulate(self, k):
    def get_parameters(self):

In the following we go through a few of the required methods, explain what is expected, and show how it would be implemented for the Gaussian model.

As a general note, one can say that it is always a good idea to consult the reference for implementation details. For the constructor, the reference states the following:

Model.__init__(prior, seed=None)[source]

The constructor must be overwritten by a sub-class to initialize the model with a given prior.

The standard behaviour is that concrete model parameters are sampled from the provided prior. However, it is alo possible for the constructor to provide optional (!) model parameters. In the latter case, the model should be initialized by the provided parameters instead from sampling from the prior.

Parameters:
  • prior (abcpy.distributions.Distribution) – A prior distribution
  • seed (int, optional) – Optional initial seed for the random number generator that can be used in the model. The default value is generated randomly.

Consequently, we would implement a simple version of a Gaussian model as follows:

class Gaussian(Model):
    def __init__(self, prior, seed=None):
        self.prior = prior
        self.sample_from_prior()
        self.rng = np.random.RandomState(seed)

Here we actually initialize the model parameters by calling abcpy.models.Model.sample_from_prior, which is another functions that must be implemented. Its requirements are quite simple:

Model.sample_from_prior()[source]

To be overwritten by any sub-class: should resample the model parameters from the prior distribution.

    def sample_from_prior(self):
        sample = self.prior.sample(1).reshape(-1)
        self.set_parameters(sample)

Let us have a look at the details on implementing abcpy.models.Model.set_parameters:

Model.set_parameters(theta)[source]

This method properly sets the parameters of the model and must be overwritten by a sub-class.

Notes

Make sure to test whether the provided parameters are compatible with the model. Return true if the parameters are accepted by the model and false otherwise. This behavior is expected e.g. by the inference schemes.

Parameters:theta – An array-like structure containing the p parameter of the model, where theta[0] is the first and theta[p-1] is the last parameter.
Returns:TRUE if model accepts the provided parameters, FALSE otherwise
Return type:boolean

For a Gaussian model a simple implementation would look like the following:

    def set_parameters(self, theta):
        theta = np.array(theta)

        if theta.shape[0] > 2: return False
        if theta[1] <= 0: return False

        self.mu = theta[0]
        self.sigma = theta[1]
        return True

Note that abcpy.models.Model.set_parameters is expected to return a boolean dependent on whether the provided parameters are suitable for the model. Thus, we added a few tests to the method.

For the remaining methods that must be implemented, namely abcpy.models.Model.get_parameters and abcpy.models.Model.simulate, we proceed in exactly the same way. This leads to an implementation that might look like the following:

    def get_parameters(self):
        return np.array([self.mu, self.sigma])


    def simulate(self, k):
        return list((self.rng.normal(self.mu, self.sigma, k)).reshape(-1))

Our model now conforms to ABCpy and we can start inferring parameters in the same way (see Getting Started) as we would do with shipped models. The complete example code can be found here