DNA¶

Encoding with fewer bits¶

DNA sequences mostly use an alphabet of 4, or sometimes 5, letters: A, C, G, T, sometimes N. Representing them as ASCII (or worse Unicode) strings of characters can be seen as a waste of space since those strings are designed to accomodate much larger alphabets. A way to optimize space usage is to encode DNA as 2-bit or 3-bit sequences for 4-letter and 5-letter alphabets respectively. We refer to this as bit-packing.

>>> from ngs_plumbing.dna import PackedDNABytes
>>> string = b'ATGCAAAATTTTCCCCGGGG'
>>> packed = PackedDNABytes(string)
>>> print(packed)
>>> print(len(packed))

Note

The concept of bit-packit is used internally by a number of high-performance programs operating on DNA sequences, and is at the basis of the .2bit format. Regarding the later, our mapping 2 bits -> DNA differs because we wanted complementary DNA bases to also be complementary at the bit level (a logical not transforming a sequence to its complementary).

The resulting instance packed is roughly little more than a vector of bytes. Using the Python subsetting operator [ will return packed subsequences.

To subset the DNA bytes the convenience method packed.unpack_slice() can be used:

>>> s = packed.unpack_slice(1, 5)
>>> print(s)

>>> print(dna_vec.bit_encoding)

The complement, reverse, or reverse-complement of the bitpacked sequence can be obtained without having to unpack the DNA.

class ngs_plumbing.dna.PackedDNABytes(string='', bits=2, dopack=True)[source]¶

Representation of a DNA string as an array of bytes

bit_encoding¶: Number of bits used for encoding

complemented()[source]¶: Return the complement, still packed.

reversecomplemented()[source]¶: Return the reverse complement, still packed.

reversed()[source]¶: Return the reverse, still packed.

unpack_slice(i=None, j=None)[source]¶: Get the unpacked a subsequence as bytes (zero-offset indexing, like other Python sequences)

Performances¶

Encoding DNA in a 2-bit fashion reduces the space usage by 4 when compared to an ASCII encoding (2 bits vs 8 bits). A 4 Mb bacterial genome would now fit into 1 Mb, and would also be much faster to read and write to disk.

from ngs_plumbing import dna
# create a dummy genome
random_ecoli = dna.randomgenome(long(4E6))

vec = dna.PackedDNABytes(random_ecoli)
# save it onto disk
f = file('myrndcoli.dat', 'w')
vec.tofile(f)
f.close()

# read it back
vec2 = dna.PackedDNABytes()
f = file('myrndcoli.dat', 'r')
vec2.fromfile(f, len(vec))
f.close()

That concept is simplemented in a simple format FASTAB (see module ngs_plumbing.fasta).

Reading up all DNA strings can also be a performance bottleneck, and there are convenience methods for handling Python I/O streams.

DNA¶

Encoding with fewer bits¶

Performances¶

Table Of Contents

Previous topic

Next topic

This Page

Navigation

DNA¶

Encoding with fewer bits¶

Performances¶

Table Of Contents

Previous topic

Next topic

This Page

Quick search

Navigation