Working with 1D intervals

In genomics, DNA is most often though of as a one-dimensional structure: a sequence of DNA bases (with the extra twist from microbiologist that the sequence can be circular).

Features on DNA are identified by their “geographical” location, that is a pair of coordinates: a beginning and an end.

The idea of the module it to handle anything that has the interval protocol, that is a begin and an end. The Interval is a minimal implementation of such an object.

The second important concept in the module is that there are iterables of intervals, preferably ordered on their begin. The IntervalList is an implementation of such a structure as a Python list of intervals.

In-place sorting happens simply with:

>>> from ngs_plumbing.intervals import Interval, IntervalList
>>> itl = IntervalList(Interval(x, y) for x, y in ((3,10),(1,7),(12,16))
>>> itl.sort()

Note

The fuss about sorting has a reason: for several operations, working on intervals sorted on their begin coordinate reduces the complexity to O(n) (to which the complexity of sorting should be added). Readers familiar with samtools will remember that there is a sort command.

Union of intervals in a list

Collapsing intervals in a IntervalList means to reduce all overlapping intervals to the outer coordinates (think of like of an union of the regions defined in the list)

>>> from ngs_plumbing.intervals import Interval, IntervalList
>>> itl = IntervalList(Interval(x, y) for x, y in ((3,10),(1,7),(12,16))
>>> itl.sort() # collapse assumes sorting
>>> cf = IntervalList.collapse_iter(itl)
>>> cf = IntervalList(cf)

Depth: how many times a base is covered by an interval

samtools (exposed to Python through the pysam package) has pileup. Again, we have here a more generic interface (a “protocol” in Python lingo) that will take anything that as a begin and an end.

>>> from ngs_plumbing.intervals import Interval, IntervalList
>>> itl = IntervalList(Interval(x, y) for x, y in ((3,10),(1,7),(12,16))
>>> itl.sort() # depthfilter assumes sorting
>>> df = IntervalList.depthfilter_iter(itl, 2)
>>> itl_f = IntervalList(df)

Table Of Contents

Previous topic

Sampling

Next topic

XSQ

This Page