DNA
===

.. :py:module:: ngs_plumbing.dna

Encoding with fewer bits
------------------------

DNA sequences mostly use an alphabet of 4, or sometimes 5, letters: 
A, C, G, T, sometimes N. Representing them as ASCII (or worse Unicode)
strings of characters can be seen as a waste of space since those
strings are designed to accomodate much larger alphabets. A way to optimize
space usage is to encode DNA as 2-bit or 3-bit sequences for 4-letter
and 5-letter alphabets respectively. We refer to this as bit-packing.

.. testsetup:: *

   from ngs_plumbing.dna import PackedDNABytes
   string = b'ATGCAAAATTTTCCCCGGGG'
   packed = PackedDNABytes(string)

.. doctest::

   >>> from ngs_plumbing.dna import PackedDNABytes
   >>> string = b'ATGCAAAATTTTCCCCGGGG'
   >>> packed = PackedDNABytes(string)
   >>> print(packed)
   >>> print(len(packed))

.. note::

   The concept of bit-packit is used internally by a number of high-performance programs
   operating on DNA sequences, and is at the basis of the `.2bit` format. Regarding the
   later, our mapping `2 bits -> DNA` differs because we wanted complementary DNA bases
   to also be complementary at the bit level (a logical not transforming a sequence to
   its complementary).

 
The resulting instance `packed` is roughly little
more than a vector of bytes. Using the
Python subsetting operator `[` will return packed subsequences.

To subset the DNA bytes the convenience method
:meth:`packed.unpack_slice()` can be used:

.. doctest::

   >>> s = packed.unpack_slice(1, 5)   
   >>> print(s)

.. doctest::

   >>> print(dna_vec.bit_encoding)

The complement, reverse, or reverse-complement of the bitpacked sequence can be obtained
without having to unpack the DNA.


.. autoclass:: ngs_plumbing.dna.PackedDNABytes
   :members:

Performances
^^^^^^^^^^^^

Encoding DNA in a 2-bit fashion reduces the space usage by 4 when compared
to an ASCII encoding (2 bits vs 8 bits). A 4 Mb bacterial genome would now
fit into 1 Mb, and would also be much faster to read and write to disk.
 
.. code-block:: python

   from ngs_plumbing import dna
   # create a dummy genome
   random_ecoli = dna.randomgenome(long(4E6))
   
   vec = dna.PackedDNABytes(random_ecoli)
   # save it onto disk
   f = file('myrndcoli.dat', 'w')
   vec.tofile(f)
   f.close()

   # read it back
   vec2 = dna.PackedDNABytes()
   f = file('myrndcoli.dat', 'r')
   vec2.fromfile(f, len(vec))
   f.close()

That concept is simplemented in a simple format FASTAB (see module :mod:`ngs_plumbing.fasta`).

Reading up all DNA strings can also be a performance bottleneck, and
there are convenience methods for handling Python I/O streams.