FASTQ files
===========

.. module:: ngs_plumbing.fastq
   :synopsis: FASTQ files.

FASTQ files are like FASTA files, but with added information
about the quality of each base in the sequence (sequence
and quality are separated by a line with only a `+`).

.. code-block:: python

   from ngs_plumbing import fastq
   fn = 'myreads.fq'
   fq = fastq.FastqFile(fn)

   entry = next(fq)

.. note::

   A text-based format is not very efficient for larger files.
   Rather than propose a new standard (see below), we give tools to create
   something that will work for your own use.

   .. image:: http://imgs.xkcd.com/comics/standards.png

   In the case of FASTQ files, they generally contain a large number of rows,
   each with potentially a difference length and we have to look for an
   end-of-line character for each one of them. In other words, 
   during the parsing every single character in the file will be compared
   compared to LF, CR or a combo of both. A binary file would make
   things significantly faster (benchmark below).

   - ASCII
   - for a given entry:
       - size (number of characters) of the header for a given entry
       - size of the DNA sequence
       - size of the quality string

   .. code-block:: python

      import struct

      s_sizes = 'III'
      s_head, s_sequence, s_quality = struct.unpack(s_sizes, fh.read(4*3))
      raw = fh.read(s_head + s_sequence + s_quality)
      head, sequence, quality = struct.unpack(s_sizes, '')


Paired FASTQ files
------------------

When sequencing short fragments, a number of technologies allow
to sequence both ends of a fragment or the same ends (in the 3'/5' sense)
of a double stranded fragment. This is called paired-end sequencing
(the most common), mate-pair sequencing.

When sequences for both ends of the fragments are in 2 different files.

.. code-block:: python

   from ngs_plumbing import fastq
   fn_1 = 'myreads_s_1_1.fq'
   fn_2 = 'myreads_s_1_2.fq'
   fq_pair = fastq.FastqFilePair(fn_1, fn_2)

   entry_1, entry_2 = next(fq_pair)