.. _section-parsing:

Parsing
=======

.. module:: ngs_plumbing.parsing
   :synopsis: NGS data in files.

Bioinformatics, and parsing and storing collections of (flat) files...
we are worth better than that.


Yet another FASTQ, FASTA, <name it> set of parsers
--------------------------------------------------

Yes, this is just an other parser. The twist here is the attempt
to change the way the data is approached by establishing the
following guiding principles:

- Do not care about the file format, but about the information
  to be extracted

- Forget the sacrosanct atomic step `(flat) file -> tool -> (flat) file`
  and consider the chaining of data streams (of objects).
  Evaluation is lazy. What you need will be fetched when you need it.


The first principle is implemented in the function :func:`parsing.open`:

.. code-block:: python

   # We want to parse out data
   from ngs_plumbing import parsing

   filename = 'foo.fastq'
   # Open the file, guessing the file format from the extension
   sequence_data = parsing.open(filename)

.. note::

   :py:obj:`filename` could have been a FASTA file, compressed
   with `gzip` or `bzip2`,
   or a SAM/BAM file (although this is a little less tested at the moment),
   and it would have worked with the same exact call:

   .. code-block:: python

      filename = 'foo.fastq.gz'
      sequence_data = parsing.open(filename)

The resulting object :py:obj:`sequence_data` can be iterated through,
and data extracted.

.. image:: _static/pipeline_reads.png
   :width: 70%

For example, if we want the sequence and its quality:


.. code-block:: python

   # Iterator through sequences+quality
   seq_stream = sequence_data.iter_seqqual()

The iterator will return objects with attributes
:attr:`sequence` and :attr:`quality`.

No matter how big the sequencing file is, the memory usage remains low.
This is leading to the second principle: chaining and lazy evaluation
with iterators.

The next entry in the iterator can be called with :func:`next`:

.. code-block:: python

   entry = next(seq_stream)
   # length for the read
   len(entry.sequence)

.. note::

   Performances are reasonable for my needs:
   a FASTQ file of 0.430 Gb can be iterated through
   in 4 seconds on my development laptop (iCore 7, Python 3.3).
   That's slower than the empirical theoretical limit, which
   is the counting of lines in the file with 

   .. code-block:: bash

      wc -l

   This is running on 0.13s (after repeated access, so with the file
   buffered by the system). Using a `benchmark on Biostar`_,
   we calculate that this is only 6.7 times slower than the fastest
   C implementation (seqtk), which is not bad (~100 Mb / second)
   considering that this is pure Python (with all the benefits of being usable
   from Python). The low-level interface in :mod:`biopython`
   (:class:`Bio.SeqIO.QualityIO import FastqGeneralIterator`) is
   achieving similar performances as ours on that benchmark
   (I could not get the current biopython to install on Python 3.3,
   and so could not verify it). 

   If speed is a bottleneck, get in touch.

   .. _benchmark on Biostar: http://www.biostars.org/p/10353/


Pulling entries into a pipeline
-------------------------------

A common operation is to filter reads according to quality.
We can just implement a filter (return `True` is passing the
the filter, `False` otherwise):

.. code-block::  python

   AVG_QUAL = 200
   def has_avgqual(x):
       # bytes. Summing them is summing their ASCII codes
       quality = x.quality
       return (sum(quality) / len(quality)) > AVG_QUAL 

   filter(has_avgqual, seq_stream)

Filtering on read lengths (useful on Ion Torrent data, for example)
is similar:

.. code-block::  python

   LENGTH = 150
   def is_long(x):
       # bytes. Summing them is summing their ASCII codes 
       return len(x.sequence) > LENGTH 

   filter(has_avgqual, is_long)


An other possible operation is to look for barcode 'ATACAGAC',
located 33 bases after the begining of the read
(yes, molecular biology tricks make such, or even weirder, situations occur).

.. code-block:: python

   def has_barcode(entry):
       return entry.sequence.startswith('ATACAGAC', 33)

Now we can write the combined filter as:

.. code-block:: python

   barcoded_stream = filter(has_barcode,
                            filter(has_avgqual, seq_stream))
   
The evaluation is lazy, and query the first good barcoded read will only
unwind the stream of sequences (and check the average quality and the
presence of a barcode) until the first such read is found. No intermediate
flat files, the ability to quickly check interactively if anything appears
to be coming out of it:

.. code-block:: python

   good_read = next(barcoded_stream)

.. note::

   When performances matter, combining simple filters
   into one function rather than chain the filter might make a difference.

Now let's imagine for a second that an alignment algorithm for short reads
is/can be accessed as a Python module (we shall call it `groovyaligner`).
This module could have a function `align()` that takes Python objects
with attributes :attr:`sequence` and :attr:`quality`:

.. code-block:: python

   import groovyaligner as ga

   mapping_stream = map(ga.align, barcoded_stream)

Now `mapping_stream` would be an iterator of alignments. Again,
no flat file involved,
no parsing, and actually no computation done until the program
starts pulling elements one by one through the chain of iterators.

.. note::

   There is actually a "groovyaligner" in the work to be included in this package.

This would let one inspect the final outcome of chained iterators
without computing completely all intermediate steps, something we
believe to be really valuable when working interactively:

.. code-block:: python

   from itertools import islice
   # takes the result of the 10 first mappings
   islice(mapping_stream, 0, 10)

Paired-end data
---------------

Paired-end data are often stored in two files. The iterators can work
just fine.

.. code-block:: python

   # We want to parse out data
   from ngs_plumbing import parsing

   filename_1 = 'foo_s_1_1.fastq'
   filename_2 = 'foo_s_1_2.fastq'
   # Open the file, guessing the file format from the extension
   sequence_pairs = zip(parsing.open(filename_1),
                        parsing.open(filename_2))


.. warning::

   With Python 2, :func:`zip` is not returning a generator
   but will get _all_ pairs. Use the :func:`itertools.izip`.

.. note::

   Depending on the sample preparation, one might want
   to change the orientation of one of the reads in the pair.

      def revcomp(x):
          # reverse complement of x.sequence      

Getting the next pair is then:

.. code-block:: python

   pair = next(sequence_pairs)


Formats
-------

The handling of file formats is restricted to NGS data, and
and oriented toward what we consider the most common operations with such
data.

For a complete set of parsers for practically all existing formats
in bioinformatics, there is :mod:`biopython`. 

.. toctree::
   :maxdepth: 2

   fastq
   fasta