Table Of Contents

Previous topic

Sumatra: automated tracking of scientific computations

Next topic


This Page


Reproducibility, provenance and project management

Reproducibility of experiments is one of the foundation stones of science. A related concept is provenance, being able to track a given scientific result, such as a figure in an article, back through all the analysis steps (verifying the correctness of each) to the original raw data, and the experimental protocol used to obtain it.

In computational, simulation- or numerical analysis-based science, reproduction of previous experiments, and establishing the provenance of results, ought to be easy, given that computers are deterministic, not suffering from the problems of inter-subject and trial-to-trial variability that make reproduction of biological experiments more challenging.

In general, however, it is not easy, perhaps due to the complexity of our code and our computing environments, and the difficulty of capturing every essential piece of information needed to reproduce a computational experiment using existing tools such as spreadsheets, version control systems and paper noteboooks.

What needs to be recorded?

To ensure reproducibility of a computational experiment we need to record:

  • the code that was run
  • any parameter files and command line options
  • the platform on which the code was run

For an individual researcher trying to keep track of a research project with many hundreds or thousands of simulations and/or analyses, it is also useful to record the following:

  • the reason for which the simulation/analysis was run
  • a summary of the outcome of the simulation/analysis

Recording the code might mean storing a copy of the executable, or the source code (including that of any libraries used), the compiler used (including version) and the compilation procedure (e.g. the Makefile, etc.) For interpreted code, it might mean recording the version of the interpreter (and any options used in compiling it) as well as storing a copy of the main script, and of any external modules or packages that are included or imported into the script file.

For projects using version control, “storing a copy of the code” may be replaced with “recording the URL of the repository and the revision number”.

The platform includes the processor architecture(s), the operating system(s), the number of processors (for distributed simulations), etc.

Tools for recording provenance information

The traditional way of recording the information necessary to reproduce an experiment is by noting down all details in a paper notebook, together with copies or print-outs of any results. More modern approaches may replace or augment the paper notebook with a spreadsheet or other hand-rolled database, but still with the feature that all relevant information is entered by hand.

In other areas of science, particularly in applied science laboratories with high-throughput, highly-standardised procedures, electronic lab notebooks and laboratory information management systems (LIMS) are in widespread use, but none of these tools seem to be well suited for tracking simulation experiments.

Challenges for tracking computational experiments

In developing a tool for tracking simulation experiments, something like an electronic lab notebook for computational science, there are a number of challenges:

  • different researchers have very different ways of working and different workflows: command line, GUI, batch-jobs (e.g. in supercomputer environments), or any combination of these for different components (simulation, analysis, graphing, etc.) and phases of a project.
  • some projects are essentially solo endeavours, others collaborative projects, possibly distributed geographically.
  • as much as possible should be recorded automatically. If it is left to the researcher to record critical details there is a risk that some details will be missed or left out, particularly under pressure of deadlines.

The solution we propose is to develop a core library, implemented as a Python package, sumatra, and then to develop a series of interfaces that build on top of this: a command-line interface, a web interface, a graphical interface. Each of these interfaces will enable:

  • launching simulations/analyses with automated recording of provenance information;
  • managing a computational project: browsing, viewing, deleting simulations/analyses.

Alternatively, modellers can use the sumatra package directly in their own code, to enable provenance recording, then simply launch experiments in their usual way.

The core sumatra package needs to:

  • interact with version control systems, such as Subversion, Git, Mercurial, or Bazaar;
  • support launching serial, distributed (via MPI) or batch computations;
  • link to data generated by the computation, whether stored in files or databases;
  • support all and any command-line drivable simulation or analysis programs;
  • support both local and networked storage of information;
  • be extensible, so that components can easily be added for new version control systems, etc.
  • be very easy to use, otherwise it will only be used by the very conscientious.

Further resources

For further background, see the following article:

Davison A.P. (2012) Automated capture of experiment context for easier reproducibility in computational research. Computing in Science and Engineering 14: 48-56 [preprint]

For more detail on how Sumatra is implemented, see:

Davison A.P., Mattioni M., Samarkanov D. and Teleńczuk B. (2014) Sumatra: A Toolkit for Reproducible Research. In: Implementing Reproducible Research, edited by V. Stodden, F. Leisch and R.D. Peng, Chapman & Hall/CRC: Boca Raton, Florida., pp. 57-79. [PDF]

You might also be interested in watching a talk given at a workshop on “Reproducible Research-Tools and Strategies for Scientific Computing” in Vancouver, Canada in July 2011. [video with slides (Silverlight required)] [video only (YouTube)] [slides].