FASTA files¶

FASTA files have been around for quite a long time, and remain in use today. Their success might have to with their relative structure and the fact that they are text files (ASCII)

• A header line (starting with an >)
• An arbitrary number of lines for the sequence
• Repeat the above if necessary

Countless FASTA parsers have been implemented, but given the simplicity with which one can write offer one here as well so we do not require a third-party package for this alone.

from ngs_plumbing import fasta
fn = 'mygenome.fa'
fa = fasta.FastaFile(fn)

for entry in fa:


Now what we have here is a twist with a way to handle binary FASTA, with the associated benefits of smaller storage space needed, shorter loading times, and shorter access times to retrive a specific entry.

from ngs_plumbing import fasta
fn_a = 'mygenome.fa'
fn_b = 'mygenome.fab'
fasta.FastabFile.from_fastafile(fn_a, fn_b)

fb = fasta.FastabFile(fn_b)


Iterating through the file can be achieved with:

for entry in fa:


Note

the sequence 2-bit encoded and the function ngs_plumbing.dna.bytes_frombit2bytes() should be used to obtain the DNA.

DNA with quality

Next topic

Working with 1D intervals