API: Handling Genomic Intervals

GenomicInterval

The GenomicInterval is effectively pyDNase’s way of storing a BED interval. There are three mandantory fields when creating a new GenomicInterval:

>>> import pyDNase
>>> interval = pyDNase.GenomicInterval("chr1",100,200)
>>> print interval
chr1        100     200     Unnamed1        0.0     +
class pyDNase.GenomicInterval(chrom, start, stop, label=0, score=0, strand='+')

Basic Object which describes reads region of the genome

__init__(chrom, start, stop, label=0, score=0, strand='+')

Initialization routine

Args:

chrom (str): the chromosome

start (int): the start of the interval

stop (int): the end of the interval

Kwargs:

label: The name of the interval (will be given an automatic name if none entered)

score (float): the score of the interval (default: 0)

strand (str): the strand the interval is on (default: “+”)

You might be wondering why this by itself is helpful. It isn’t, until you consider that you can use collections of multiple GenomicInterval instances in a GenomicIntervalSet

GenomicIntervalSet

Often, one may be interested in querying cut information for large numbers of regions in the genome (all the DHSs, for example). We provide a basic way to organise BED files using a GenomicIntervalSet object.

>>> import pyDNase
>>> regions = pyDNase.GenomicIntervalSet("pyDNase/test/data/example.bed")
>>> print len(regions)  # How many regions are in the BED file?
1
>>> print regions
chr6        170863142       170863532       0       0.0     +

Iterating/indexing the GenomicIntervalSet object returns GenomicInterval objects, which are sorted by their order of creation (so the order of the BED file if importing a BED file). You can sort by any of the other attributes that the GenomicInterval has, for example, to iterate by score,

>>> for i in sorted(regions,key=lambda x: x.score):
        print i

The key here, is that as well as querying the BAMHandler for cuts using a string, we can also query using a GenomicInterval object

>>> reads = pyDNase.BAMHandler("pyDNase/test/data/example.bam")
>>> reads[regions[0]]                                                   #Note: I've truncated this output
{'+': array([1,0,0,0,1,11,1,0,0,0,0,0,0,0,1,0,1,1,0,0,0,0,0,0,2, ...]),
 '-': array([0,1,0,0,1,0 ,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,5,0,0, ...])}

For example, one could use this to efficiently calculate the total number of cuts in a DNase-seq dataset using the intervals in a BED file

>>> readcount = 0
>>> for interval in regions:
        readcount += reads[interval]["+"].sum() + reads[interval]["-"].sum()
>>> print readcount
3119

We have overloaded the + operator you can directly add other GenomicIntervalSet or GenomicInterval objects, and you can delete intervals using the del keyword thus:

>>> print regions
chr6        170863142       170863532       0       0.0     +
>>> regions += pyDNase.GenomicInterval("chr10","100000000","200000000", "0", 10, "-")
>>> print regions
chr6        170863142       170863532       0       0.0     +
chr10       100000000       200000000       0       10.0    -
>>> del regions[0]
>>> print regions
chr10       100000000       200000000       0       10.0    -
class pyDNase.GenomicIntervalSet(filename=None)

Container class which stores and allow manipulations of large numbers of GenomicInterval objects. Essentially a way of storing and sorting BED files.

__init__(filename=None)

Inits GenomicIntervalSet. You can also specify a BED file path to load the intervals from

Kwargs:
filename (str): the path to a BED file to initialize the intervals with

If no filename provided, then the set will be empty

loadBEDFile(filename)

Adds all the intervals in a BED file to this GenomicIntervalSet. We’re quite naughty here and allow some non-standard BED formats (along with the official one):

chrom chromStart chromEnd chrom chromStart chromEnd strand chrom chromStart chromEnd name score strand

Any whitespace (tabs or spaces) will be considered separators, so spaces in names cause a problem!

Note

If you don’t supply a strand, we infer that it’s +ve.

Args:
filename: the path to a BED file to load
Raises:
IOError
resizeRegions(toSize)

Resized all GenomicIntervals to a specific size

Args:
toSize: an int of the size to resize all intervals to