DNA === .. :py:module:: ngs_plumbing.dna Encoding with fewer bits ------------------------ DNA sequences mostly use an alphabet of 4, or sometimes 5, letters: A, C, G, T, sometimes N. Representing them as ASCII (or worse Unicode) strings of characters can be seen as a waste of space since those strings are designed to accomodate much larger alphabets. A way to optimize space usage is to encode DNA as 2-bit or 3-bit sequences for 4-letter and 5-letter alphabets respectively. We refer to this as bit-packing. .. testsetup:: * from ngs_plumbing.dna import PackedDNABytes string = b'ATGCAAAATTTTCCCCGGGG' packed = PackedDNABytes(string) .. doctest:: >>> from ngs_plumbing.dna import PackedDNABytes >>> string = b'ATGCAAAATTTTCCCCGGGG' >>> packed = PackedDNABytes(string) >>> print(packed) >>> print(len(packed)) .. note:: The concept of bit-packit is used internally by a number of high-performance programs operating on DNA sequences, and is at the basis of the `.2bit` format. Regarding the later, our mapping `2 bits -> DNA` differs because we wanted complementary DNA bases to also be complementary at the bit level (a logical not transforming a sequence to its complementary). The resulting instance `packed` is roughly little more than a vector of bytes. Using the Python subsetting operator `[` will return packed subsequences. To subset the DNA bytes the convenience method :meth:`packed.unpack_slice()` can be used: .. doctest:: >>> s = packed.unpack_slice(1, 5) >>> print(s) .. doctest:: >>> print(dna_vec.bit_encoding) The complement, reverse, or reverse-complement of the bitpacked sequence can be obtained without having to unpack the DNA. .. autoclass:: ngs_plumbing.dna.PackedDNABytes :members: Performances ^^^^^^^^^^^^ Encoding DNA in a 2-bit fashion reduces the space usage by 4 when compared to an ASCII encoding (2 bits vs 8 bits). A 4 Mb bacterial genome would now fit into 1 Mb, and would also be much faster to read and write to disk. .. code-block:: python from ngs_plumbing import dna # create a dummy genome random_ecoli = dna.randomgenome(long(4E6)) vec = dna.PackedDNABytes(random_ecoli) # save it onto disk f = file('myrndcoli.dat', 'w') vec.tofile(f) f.close() # read it back vec2 = dna.PackedDNABytes() f = file('myrndcoli.dat', 'r') vec2.fromfile(f, len(vec)) f.close() That concept is simplemented in a simple format FASTAB (see module :mod:`ngs_plumbing.fasta`). Reading up all DNA strings can also be a performance bottleneck, and there are convenience methods for handling Python I/O streams.