.. _section-parsing: Parsing ======= .. module:: ngs_plumbing.parsing :synopsis: NGS data in files. Bioinformatics, and parsing and storing collections of (flat) files... we are worth better than that. Yet another FASTQ, FASTA, set of parsers -------------------------------------------------- Yes, this is just an other parser. The twist here is the attempt to change the way the data is approached by establishing the following guiding principles: - Do not care about the file format, but about the information to be extracted - Forget the sacrosanct atomic step `(flat) file -> tool -> (flat) file` and consider the chaining of data streams (of objects). Evaluation is lazy. What you need will be fetched when you need it. The first principle is implemented in the function :func:`parsing.open`: .. code-block:: python # We want to parse out data from ngs_plumbing import parsing filename = 'foo.fastq' # Open the file, guessing the file format from the extension sequence_data = parsing.open(filename) .. note:: :py:obj:`filename` could have been a FASTA file, compressed with `gzip` or `bzip2`, or a SAM/BAM file (although this is a little less tested at the moment), and it would have worked with the same exact call: .. code-block:: python filename = 'foo.fastq.gz' sequence_data = parsing.open(filename) The resulting object :py:obj:`sequence_data` can be iterated through, and data extracted. .. image:: _static/pipeline_reads.png :width: 70% For example, if we want the sequence and its quality: .. code-block:: python # Iterator through sequences+quality seq_stream = sequence_data.iter_seqqual() The iterator will return objects with attributes :attr:`sequence` and :attr:`quality`. No matter how big the sequencing file is, the memory usage remains low. This is leading to the second principle: chaining and lazy evaluation with iterators. The next entry in the iterator can be called with :func:`next`: .. code-block:: python entry = next(seq_stream) # length for the read len(entry.sequence) .. note:: Performances are reasonable for my needs: a FASTQ file of 0.430 Gb can be iterated through in 4 seconds on my development laptop (iCore 7, Python 3.3). That's slower than the empirical theoretical limit, which is the counting of lines in the file with .. code-block:: bash wc -l This is running on 0.13s (after repeated access, so with the file buffered by the system). Using a `benchmark on Biostar`_, we calculate that this is only 6.7 times slower than the fastest C implementation (seqtk), which is not bad (~100 Mb / second) considering that this is pure Python (with all the benefits of being usable from Python). The low-level interface in :mod:`biopython` (:class:`Bio.SeqIO.QualityIO import FastqGeneralIterator`) is achieving similar performances as ours on that benchmark (I could not get the current biopython to install on Python 3.3, and so could not verify it). If speed is a bottleneck, get in touch. .. _benchmark on Biostar: http://www.biostars.org/p/10353/ Pulling entries into a pipeline ------------------------------- A common operation is to filter reads according to quality. We can just implement a filter (return `True` is passing the the filter, `False` otherwise): .. code-block:: python AVG_QUAL = 200 def has_avgqual(x): # bytes. Summing them is summing their ASCII codes quality = x.quality return (sum(quality) / len(quality)) > AVG_QUAL filter(has_avgqual, seq_stream) Filtering on read lengths (useful on Ion Torrent data, for example) is similar: .. code-block:: python LENGTH = 150 def is_long(x): # bytes. Summing them is summing their ASCII codes return len(x.sequence) > LENGTH filter(has_avgqual, is_long) An other possible operation is to look for barcode 'ATACAGAC', located 33 bases after the begining of the read (yes, molecular biology tricks make such, or even weirder, situations occur). .. code-block:: python def has_barcode(entry): return entry.sequence.startswith('ATACAGAC', 33) Now we can write the combined filter as: .. code-block:: python barcoded_stream = filter(has_barcode, filter(has_avgqual, seq_stream)) The evaluation is lazy, and query the first good barcoded read will only unwind the stream of sequences (and check the average quality and the presence of a barcode) until the first such read is found. No intermediate flat files, the ability to quickly check interactively if anything appears to be coming out of it: .. code-block:: python good_read = next(barcoded_stream) .. note:: When performances matter, combining simple filters into one function rather than chain the filter might make a difference. Now let's imagine for a second that an alignment algorithm for short reads is/can be accessed as a Python module (we shall call it `groovyaligner`). This module could have a function `align()` that takes Python objects with attributes :attr:`sequence` and :attr:`quality`: .. code-block:: python import groovyaligner as ga mapping_stream = map(ga.align, barcoded_stream) Now `mapping_stream` would be an iterator of alignments. Again, no flat file involved, no parsing, and actually no computation done until the program starts pulling elements one by one through the chain of iterators. .. note:: There is actually a "groovyaligner" in the work to be included in this package. This would let one inspect the final outcome of chained iterators without computing completely all intermediate steps, something we believe to be really valuable when working interactively: .. code-block:: python from itertools import islice # takes the result of the 10 first mappings islice(mapping_stream, 0, 10) Paired-end data --------------- Paired-end data are often stored in two files. The iterators can work just fine. .. code-block:: python # We want to parse out data from ngs_plumbing import parsing filename_1 = 'foo_s_1_1.fastq' filename_2 = 'foo_s_1_2.fastq' # Open the file, guessing the file format from the extension sequence_pairs = zip(parsing.open(filename_1), parsing.open(filename_2)) .. warning:: With Python 2, :func:`zip` is not returning a generator but will get _all_ pairs. Use the :func:`itertools.izip`. .. note:: Depending on the sample preparation, one might want to change the orientation of one of the reads in the pair. def revcomp(x): # reverse complement of x.sequence Getting the next pair is then: .. code-block:: python pair = next(sequence_pairs) Formats ------- The handling of file formats is restricted to NGS data, and and oriented toward what we consider the most common operations with such data. For a complete set of parsers for practically all existing formats in bioinformatics, there is :mod:`biopython`. .. toctree:: :maxdepth: 2 fastq fasta