Concepts

Leaf is a software tool that formalizes a number of concepts and ideas developed while performing real-world data analysis tasks. In this section such main ideas are reported and explained.

Protocols

Protocol is the name that we give to data flows in the context of Bioinformatic data analysis as a parallel to the well established notion of protocol in Biology and Medicine. This way we want to stress the importance of producing bioinformatic data analytic processes that should be as much efficient, reproducible and communicable as possible. We believe that not only the general analytic approach should be documented (which is currently done), but every single instance of a data analysis process could and should be guided and documented by a protocol. The good news is that, by using automatic tools like Leaf, protocols not only add to scientific rigor, but also to efficiency and productivity.

Terminology

Some terms used in the context of Leaf have a specific meaning or are borrowed from scientific areas other than Biology. The following table reports the most important ones.

GraphA graph is a data structure defined by a set of nodes and a set of edges between them. It can be seen as the formal definition of a network. If edges are allowed to have a direction (for example "from A to B but not from B to A") the graph is said directed. If the graph has no cycle (for example a cycle can be: "from A to B, from B to C and from C to A") it is said to be acyclic.
Leaf protocol A DAG (Directed and Acyclic Graph) representing a data analysis flow, including node details (source code and documentation). Nodes represent data processor units (implemented as Python functions) and edges input/output connections between nodes.
Node typesNodes with no inputs are said roots; node with no outputs are said leaves (that is the origin of the name Leaf); nodes producing files on the disk are said file nodes (they are [F]-flagged in LGL).
ResourceAny input or output of a node. Resources can be primary (created/imported by protocol's roots), derived raw (outputs of non-leaf nodes) or derived final (outputs of leaf nodes).
LGLLeaf Graph Language. It is a formal language whose aim is to describe a graph in a readable way.
lglcIt's the LGL compiler. It transforms LGL source code into Dot source code. It is used internally by pyleaf.
pyleafThe Leaf Python library. It is a set of Python classes implementing the engine behind the Leaf system and the interface between it and the user.


Intended audience

Leaf users are assumed to have programming skills. The purpose of Leaf is not to broaden the availability of data analysis tools to a wider range of people. Leaf is rather meant to provide a more efficient tool to people already having skills in custom data analysis (or that are willing to learn such skills). In particular, Leaf is thought for Python/R programmers. Python is a very general purpose high level scripting language and can be very easily interfaced with R, which is one of the best languages in the context of data analysis. We believe that a Python+R approach is a highly efficient scientific programming framework. Python programmers can learn and use Leaf immediately. R programmers will need to learn Python's basics and how to call R code through the Rpy lib.

Differences between Leaf and other data flow managers

Leaf is based on a thin and optional abstraction layer. Common data flow managers use to have a tight dependence between the flow design interface and the source code, with the only way out being an import/export process. In Leaf there is no such dependence: the design layer and the source code can be at any moment used together or as separate entities. The user can choose at any moment to directly run pieces of code or let them be driven by the data flow design. Protocol-aware code can be mixed with protocol-unaware code. For example, a Python function can call a Leaf protocol object (pt below) to obtain a resource, than use it as input to another node through a direct function call:
node2(pt.provide(node1))
This is possible because node2 is actually a Python function. The example is meant to point out a range of possibilities that happen to be very useful in the code development and experimentation phases.

Best practice concepts

Any data analysis pipeline can be seen as a hierarchy of software routines. The protocol is a high level script that runs lower level functions, which can in turn call other functions. Each function that is worth mentioning when communicating the data analysis methodology should be present as a node in the protocol. Other functions are considered low-level details or auxiliary to other functions and are "hidden" in the source code. We use to test new functions by using Leaf to provide data for their inputs and we "promote" them to nodes (that is we include them in the LGL protocol) when their development becomes stable.

Integrated Development Environments for Leaf

Any environment that is good for Python is good for Leaf, including any text editor + command line framework. We use Emacs for a number of reasons, including the fact that it supports both R and Python at the same time with basically no context switch. Unfortunately Emacs is hard to learn, so be prepared to face a steep learning curve or use any alternative solution you prefer.

Leaf design choices

We chose to build Leaf with a text-based graph design tool and a Python interface. Of course we could have chosen a visual graph designer and any other programming language: they are open options. Leaf is a two-layers system and the two layers are designed to be independent (that is replaceable) on purpose. However, we opted for the LGL solution for the following reasons:
  • The LGL compiler (as opposed to any visual framework) is extremely simple and minimal from a software point of view. For this reason it can also be readily ported to any computer architecture supporting a shell and a C++ compiler.
  • LGL is thought for programmers (see "Intended audience"). As programmers, we appreciate the possibility of avoiding context switching between a visual environment and a text editor, particularly because the design of the data flow is a very tiny part of the entire development process.
  • LGL can be readily used in any command line based context. Many computing facilities offer ssh connections to submit jobs through a simple shell interface. LGL can be used even in such minimal environments.
The reasons that brought us to choose Python for the interface layer are not different from many of the reasons that make Python a good programming language in general. Some of the features that we particularly appreciated are:
  • Python is a very high level language: few instructions perform complex tasks. This is ideal for scientific programming.
  • Python is an interpreted language. It is easy to interact with Python code and variables through a shell.
  • Nonetheless Python offers all the features of a real programming language (like OOP).
  • Python has a number of mathematical libraries to directly perform data analysis.
  • For every data analysis task that is not well supported in Python the Rpy lib can be used to easily call R routines.