FASTQ files =========== .. module:: ngs_plumbing.fastq :synopsis: FASTQ files. FASTQ files are like FASTA files, but with added information about the quality of each base in the sequence (sequence and quality are separated by a line with only a `+`). .. code-block:: python from ngs_plumbing import fastq fn = 'myreads.fq' fq = fastq.FastqFile(fn) entry = next(fq) .. note:: A text-based format is not very efficient for larger files. Rather than propose a new standard (see below), we give tools to create something that will work for your own use. .. image:: http://imgs.xkcd.com/comics/standards.png In the case of FASTQ files, they generally contain a large number of rows, each with potentially a difference length and we have to look for an end-of-line character for each one of them. In other words, during the parsing every single character in the file will be compared compared to LF, CR or a combo of both. A binary file would make things significantly faster (benchmark below). - ASCII - for a given entry: - size (number of characters) of the header for a given entry - size of the DNA sequence - size of the quality string .. code-block:: python import struct s_sizes = 'III' s_head, s_sequence, s_quality = struct.unpack(s_sizes, fh.read(4*3)) raw = fh.read(s_head + s_sequence + s_quality) head, sequence, quality = struct.unpack(s_sizes, '') Paired FASTQ files ------------------ When sequencing short fragments, a number of technologies allow to sequence both ends of a fragment or the same ends (in the 3'/5' sense) of a double stranded fragment. This is called paired-end sequencing (the most common), mate-pair sequencing. When sequences for both ends of the fragments are in 2 different files. .. code-block:: python from ngs_plumbing import fastq fn_1 = 'myreads_s_1_1.fq' fn_2 = 'myreads_s_1_2.fq' fq_pair = fastq.FastqFilePair(fn_1, fn_2) entry_1, entry_2 = next(fq_pair)