About Leaf

Leaf is a Python tool for the design and management of Bioinformatic Protocols, also known as data flows. We call them "protocols" to stress the importance of producing data flows that make design, execution, maintenance, sharing and reproduction of data analysis processes as efficient as possible. Leaf was developed in a Bioinformatic environment, but can be used in any data analysis project. Leaf is mainly implemented in Python, which can be easily interfaced with other languages such as R.

Features

  • Thin, lightweight, code-independent Abstraction Layer.
  • Data flow graph embedded directly into Python source code.
  • Automatic creation and management of variables associated with node outputs.
  • Automatic persistent storage and retrieval of node outputs.
  • Session persistence (i.e. run half project, reboot machine, automatically start again from the last processed node).
  • Lazy builds (avoid running nodes that are not necessary for the build of a requested resource).
  • Enforcement of code version consistency between nodes (i.e. automatically reprocess all nodes deriving from node A if node A is found to be changed).
  • Automatic time and space requirements statistics.
  • Automatic publishing (producing hypertext with visual representation of the protocol, processing statistics, link to node outputs, node documentation and source code).

Leaf and Python code

A data flow can be seen as a graph, with nodes representing data processing routines and edges representing input/output connections between them. Custom data analysis is made using languages like Python or R (Python can run R code through the Rpy lib), whose aim is to implement the data flow. For example, each node can be implemented as a Python function. Another Python function can implement the high level script running the data flow: it calls appropriate routines in sequence and collects results.

With Leaf you define a graph-based data flow (the protocol) using the LGL (Leaf Graph Language) language. This is written directly into Python source code as a string of characters visually representing a graph. The protocol can then be bound to existing Python code (graph node labels will match Python function names) using pyleaf (the Python library implementing Leaf support). At any moment the original code can be run directly or through the mediation and support of Leaf. In other words, the abstraction layer (AL) between Leaf and Python code is thin and optional, allowing for maximum flexibility and independence between the code and the AL. The custom analysis source code can (and usually has) no reference to the Leaf system and is required to use only very few and simple conventions in order to be correctly driven by pyleaf.

Using the Leaf system

A data analysis with Leaf runs on two different layers (see figure below). The User layer includes user's Python code and Python shell interactions. The Leaf level includes the Leaf Python libraries accepting user's requests. The user writes his custom Python code and defines a high level protocol describing dependencies between his functions. Then he asks Leaf to apply the protocol to some given primary resources (initial data). Leaf runs the appropriate code according to the protocol and stores all derived resources for future use. The user can request any of them from the Python shell or from within other Python scripts identifying them directly by the names used in the protocol (no variable has to be manually created to store the data). Leaf can also export a pdf visualization of the protocol or publish an automatically generated full hypertextual report, including the protocol, nodes documentation and source code, links to output files and various statistics (such as time and space requirements) about the project.


About the Authors

Leaf is designed and developed by Francesco Napolitano at NeuRoNe lab, Dpt. of Computer Science, University of Salerno, Italy. The project is partly funded by the Italian Association for Cancer Research (AIRC) through the University G. D'Annunzio, Chieti.