..
   Copyright 2014 Novartis Institutes for Biomedical Research

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.

.. _recipes-label:

Recipes
=======


The model, together with the persistence layer, is designed to make the writing
of sequences of steps for RNA-Seq data processing rather simple and
and make changing to an alternative step (e.g., aligner, differential expression method, etc...)
trivial. This is also designed to make coexisting variants something a user does not
have to worry about (unless inclined to - the system is open).

.. note::

   The general principles to remember are limited to:

   - Steps require assets to run (and optionally parameters)

   - Assets are constituted of two groups: *source* and *target* elements

   - Parameters are optionally given

   The bundle of source and target assets, parameters, and a step represents a *task*.
   Explanations about step and assets follow.

   >>> source = SomeStep.Assets.Source( inputentity )
   >>> target = SomeStep.Assets.Target( outputentity )
   >>> assets = SomeStep.Assets(source, target)
   >>> step = SomeStep( pathtoexecutable )
   >>> step.run( assets )

Steps and assets
----------------

Concept
^^^^^^^

All steps in the process are connected through intermediate data files which we call assets.
Bioinformatics tools are almost always designed to operate on files, with the occasional pipes being used. 

A *step* can be represented as the step itself with input files (the *source* assets) and output files (the *target* assets).

.. note-stepassets-begin

.. graphviz::

   digraph ExampleAssets {
     "src1"->"step";
     "src1"[label="Some data", shape=invhouse, style=filled, fillcolor=lightgrey];
     "src2"->"step";
     "src2"[label="Other data", shape=invhouse, style=filled, fillcolor=lightgrey];
     "step"[label="Step"];
     "step"->"tgt1";
     "tgt1"[label="Product", shape=box];
   }

.. note-stepassets-end

This group of nodes (here 4 nodes: 2 source, 1 step, 1 target) can also be collapsed as a *step*, and the representation
of a workflow be the graph connecting this summary representation.

For example, the initial steps for aligner-based RNA-Seq can be represented as:

.. note-sa-example-high-begin

.. graphviz::

   digraph HighLevel {
     subgraph cluster_3 {
       label="High-level";
       style=filled;
       color=lightgrey;
       node [style=filled,color=white];
       "Index"->"Align"->"Count";
     }
   }

.. note-sa-example-high-end

The Step-and-Assets graph will then look like this:

.. note-sa-example-begin

.. graphviz::

   digraph StepAssets {
     subgraph cluster_0 {
       label="Index";
       color=lightgrey;
       "ReferenceGenome"->"Indexer";
       "ReferenceGenome"[label="Reference Genome", shape=invhouse, colorscheme=set36, style=filled, fillcolor=1];
       "Indexer"->"IndexedGenomeS";
       "Indexer"[colorscheme=set36, style=filled, fillcolor=2];
       "IndexedGenomeS"[label="Indexed Genome", shape=box, colorscheme=set36, style=filled, fillcolor=2, penwidth=2];
     }
     subgraph cluster_1 {
       label="Align";
       color=lightgrey;
       "IndexedGenomeT"->"Aligner";
       "IndexedGenomeT"[label="Indexed Genome", shape=box, colorscheme=set36, style=filled, fillcolor=2, penwidth=2];
       "Reads1" -> "Aligner";
       "Reads1"[label="Reads 1", shape=invhouse, colorscheme=set36, style=filled, fillcolor=5];
       "Reads2" -> "Aligner";
       "Reads2"[label="Reads 2", shape=invhouse, colorscheme=set36, style=filled, fillcolor=5];
       "Aligner" -> "AlignedReadsS";
       "Aligner"[colorscheme=set36, style=filled, fillcolor=2];
       "AlignedReadsS"[label="Aligned Reads", shape=box, colorscheme=set36, style=filled, fillcolor=2, penwidth=2];
     }
     "IndexedGenomeS"->"IndexedGenomeT"[style="dotted",arrowhead="none"];
     subgraph cluster_2 {
       label="Count";
       color=lightgrey;
       "AlignedReadsT"->"Counter";
       "AlignedReadsT"[label="Aligned Reads", shape=box, colorscheme=set36, style=filled, fillcolor=2, penwidth=2];
       "ReferenceAnnotation" -> "Counter";
       "ReferenceAnnotation"[label="Reference Annotation", shape=invhouse, colorscheme=set36, style=filled, fillcolor=5];
       "Counter" -> "DigitalGeneExpression";
       "Counter"[colorscheme=set36, style=filled, fillcolor=2];
       "DigitalGeneExpression"[label="Digital Gene Expression", shape=box, colorscheme=set36, style=filled, fillcolor=2, penwidth=2];
     }
     "AlignedReadsS"->"AlignedReadsT"[style="dotted",arrowhead="none"];
   }

.. note-sa-example-end

The nodes `Indexed Genome` and `Aligned Reads` are duplicated
for clarity with the grouping of nodes,
but it is the same entity saved on disk produced and used by the two steps.

When using the framework, the easiest way to think about it is to start from
a step (child class
of :class:`StepAbstract`), and look for the attribute :attr:`Assets`.
That attribute is a class representing the step's assets, and is itself 
containing 2 class attributes :attr:`Source`
and :attr:`Target` (themselves classes as well).

For example with the class modeling the aligner "STAR":

.. code-block:: python

   import railroadtracks.rnaseq import StarIndex, StarAlign

   # assets for the indexing of references (indexed for the alignment)
   StarIndex.Assets
   # assets for aligning reads against indexed references
   StarAlign.Assets

The assets are divided into two sets, the `Source` and the `Target`, for the files the step is using and the files
the step is producing respectively. :attr:`Source` and :attr:`Target` are themselves classes.

.. code-block:: python

   import railroadtracks.rnaseq import StarIndex, StarAlign

   # assets for the indexing of references (indexed for the alignment)
   fastaref = rnaseq.SavedFASTA('/path/to/referencegenome.fasta')
   indexedref = rnaseq.FilePattern('/path/to/indexprefix')
   StarIndex.Assets(StarIndex.Assets.Source(fastaref),
                    StarIndex.Assets.Target(indexedref))

The results are somewhat verbose, but if an IDE or an advanced interactive shell such as 
`ipython` is used, autocompletion will make writing such statements relatively painless, and mostly intuitive.
One can just start from the step considered (:class:`StarIndex` in the example above) and everything can be derived
from it.


Unspecified assets
^^^^^^^^^^^^^^^^^^

In a traditional approach, for example a sequence of commands in `bash`, the name
of files must be specified at each step.

We are proposing an alternative with which a full *recipe* can be written without having to take care of file names for
derived data (which can be a relative burden when considering to process the same data in many alternative ways).
Only the original data files such as the reference genome, or the experimental data
from the samples and sequencing, are specified and the other file names will be generated automatically.

Objects inheriting from :class:`AssetSet` are expected to have a method :meth:`AssetSet.createundefined` that
creates an undefined set of assets, and this can be used in recipes (see example below).

.. literalinclude:: ../src/test/test_core.py
   :language: python
   :start-after: # -- createundefined-begin
   :end-before: # -- createundefined-end


Notes about the design
^^^^^^^^^^^^^^^^^^^^^^

Data analysis is often a REPL activity, and we are keeping this in mind. The writing
of a recipes tries to provide:

- autocompletion-based discovery. Documentation is rarely read back-to-back
  (congratulations for making it this far though), and dynamic scientists often proceed by
  trial-and-error with software. The package is trying to provide benevolent support when doing so.

- fail early whenever possible (that is before long computations have already been performed)

- allow the writing of a full sequence of steps and run them unattended (tasks to be performed are stored, and executed when the user wants to)


Setup
-----


.. literalinclude:: ../src/test/test_recipe.py
   :language: python
   :start-after: # -- recipe-init-begin
   :end-before: # -- recipe-init-end


The package is also able to generate a small dataset based on a phage:

.. literalinclude:: ../src/test/test_recipe.py
   :language: python
   :start-after: # -- recipe-data-begin
   :end-before: # -- recipe-data-end


Simple recipe
-------------

.. literalinclude:: ../src/test/test_recipe.py
   :language: python
   :start-after: # -- recipesimple-test-begin
   :end-before: # -- recipesimple-test-end


The variable `wd` contains the directory with all intermediate data and the final results,
and `db_dn` is the database file.

.. note::

   If you clean up after yourself, but want to run the next recipe in this documentation,
   the setup step will have to be run again.


Loops, nested loops, and many variants
--------------------------------------

.. literalinclude:: ../src/test/test_recipe.py
   :language: python
   :start-after: # -- recipeloop-test-begin
   :end-before: # -- recipeloop-test-end


Docstrings
----------

.. automodule:: railroadtracks.easy
   :members:
   :special-members: __init__


Troubleshooting
---------------

The standard Python module :mod:`logging` is used for logging, and its documentation should be checked.
For example, a very simple way to activate logging at the level *DEBUG* on *stdout*:

.. code-block:: python

   import railroadtracks
   logger = railroadtracks.logger
   logger.setLevel(railroadtracks.logging.DEBUG)
   railroadtracks.basicConfig('/tmp/stdout')