DNA

Encoding with fewer bits

DNA sequences mostly use an alphabet of 4, or sometimes 5, letters: A, C, G, T, sometimes N. Representing them as ASCII (or worse Unicode) strings of characters can be seen as a waste of space since those strings are designed to accomodate much larger alphabets. A way to optimize space usage is to encode DNA as 2-bit or 3-bit sequences for 4-letter and 5-letter alphabets respectively. We refer to this as bit-packing.

>>> from ngs_plumbing.dna import PackedDNABytes
>>> string = b'ATGCAAAATTTTCCCCGGGG'
>>> packed = PackedDNABytes(string)
>>> print(packed)
>>> print(len(packed))

Note

The concept of bit-packit is used internally by a number of high-performance programs operating on DNA sequences, and is at the basis of the .2bit format. Regarding the later, our mapping 2 bits -> DNA differs because we wanted complementary DNA bases to also be complementary at the bit level (a logical not transforming a sequence to its complementary).

The resulting instance packed is roughly little more than a vector of bytes. Using the Python subsetting operator [ will return packed subsequences.

To subset the DNA bytes the convenience method packed.unpack_slice() can be used:

>>> s = packed.unpack_slice(1, 5)
>>> print(s)
>>> print(dna_vec.bit_encoding)

The complement, reverse, or reverse-complement of the bitpacked sequence can be obtained without having to unpack the DNA.

class ngs_plumbing.dna.PackedDNABytes(string='', bits=2, dopack=True)[source]

Representation of a DNA string as an array of bytes

bit_encoding

Number of bits used for encoding

complemented()[source]

Return the complement, still packed.

reversecomplemented()[source]

Return the reverse complement, still packed.

reversed()[source]

Return the reverse, still packed.

unpack_slice(i=None, j=None)[source]

Get the unpacked a subsequence as bytes (zero-offset indexing, like other Python sequences)

Performances

Encoding DNA in a 2-bit fashion reduces the space usage by 4 when compared to an ASCII encoding (2 bits vs 8 bits). A 4 Mb bacterial genome would now fit into 1 Mb, and would also be much faster to read and write to disk.

from ngs_plumbing import dna
# create a dummy genome
random_ecoli = dna.randomgenome(long(4E6))

vec = dna.PackedDNABytes(random_ecoli)
# save it onto disk
f = file('myrndcoli.dat', 'w')
vec.tofile(f)
f.close()

# read it back
vec2 = dna.PackedDNABytes()
f = file('myrndcoli.dat', 'r')
vec2.fromfile(f, len(vec))
f.close()

That concept is simplemented in a simple format FASTAB (see module ngs_plumbing.fasta).

Reading up all DNA strings can also be a performance bottleneck, and there are convenience methods for handling Python I/O streams.

Table Of Contents

Previous topic

Installation

Next topic

DNA with quality

This Page