FASTQ files

FASTQ files are like FASTA files, but with added information about the quality of each base in the sequence (sequence and quality are separated by a line with only a +).

from ngs_plumbing import fastq
fn = 'myreads.fq'
fq = fastq.FastqFile(fn)

entry = next(fq)

Note

A text-based format is not very efficient for larger files. Rather than propose a new standard (see below), we give tools to create something that will work for your own use.

http://imgs.xkcd.com/comics/standards.png

In the case of FASTQ files, they generally contain a large number of rows, each with potentially a difference length and we have to look for an end-of-line character for each one of them. In other words, during the parsing every single character in the file will be compared compared to LF, CR or a combo of both. A binary file would make things significantly faster (benchmark below).

  • ASCII

  • for a given entry:
    • size (number of characters) of the header for a given entry
    • size of the DNA sequence
    • size of the quality string
import struct

s_sizes = 'III'
s_head, s_sequence, s_quality = struct.unpack(s_sizes, fh.read(4*3))
raw = fh.read(s_head + s_sequence + s_quality)
head, sequence, quality = struct.unpack(s_sizes, '')

Paired FASTQ files

When sequencing short fragments, a number of technologies allow to sequence both ends of a fragment or the same ends (in the 3’/5’ sense) of a double stranded fragment. This is called paired-end sequencing (the most common), mate-pair sequencing.

When sequences for both ends of the fragments are in 2 different files.

from ngs_plumbing import fastq
fn_1 = 'myreads_s_1_1.fq'
fn_2 = 'myreads_s_1_2.fq'
fq_pair = fastq.FastqFilePair(fn_1, fn_2)

entry_1, entry_2 = next(fq_pair)

Table Of Contents

Previous topic

Parsing

Next topic

FASTA files

This Page