API: Handling Genomic Intervals¶
GenomicInterval
¶
The GenomicInterval
is effectively pyDNase’s way of storing a BED interval. There are three mandantory fields when creating a new GenomicInterval
:
>>> import pyDNase
>>> interval = pyDNase.GenomicInterval("chr1",100,200)
>>> print interval
chr1 100 200 Unnamed1 0.0 +
-
class
pyDNase.
GenomicInterval
(chrom, start, stop, label=0, score=0, strand='+')¶ Basic Object which describes reads region of the genome
-
__init__
(chrom, start, stop, label=0, score=0, strand='+')¶ Initialization routine
- Args:
chrom (str): the chromosome
start (int): the start of the interval
stop (int): the end of the interval
- Kwargs:
label: The name of the interval (will be given an automatic name if none entered)
score (float): the score of the interval (default: 0)
strand (str): the strand the interval is on (default: “+”)
-
You might be wondering why this by itself is helpful. It isn’t, until you consider that you can use collections of multiple GenomicInterval
instances in a GenomicIntervalSet
GenomicIntervalSet
¶
Often, one may be interested in querying cut information for large numbers of regions in the genome (all the DHSs, for example). We provide a basic way to organise BED files using a GenomicIntervalSet
object.
>>> import pyDNase
>>> regions = pyDNase.GenomicIntervalSet("pyDNase/test/data/example.bed")
>>> print len(regions) # How many regions are in the BED file?
1
>>> print regions
chr6 170863142 170863532 0 0.0 +
Iterating/indexing the GenomicIntervalSet object returns GenomicInterval objects, which are sorted by their order of creation (so the order of the BED file if importing a BED file). You can sort by any of the other attributes that the GenomicInterval has, for example, to iterate by score,
>>> for i in sorted(regions,key=lambda x: x.score):
print i
The key here, is that as well as querying the BAMHandler for cuts using a string, we can also query using a GenomicInterval object
>>> reads = pyDNase.BAMHandler("pyDNase/test/data/example.bam")
>>> reads[regions[0]] #Note: I've truncated this output
{'+': array([1,0,0,0,1,11,1,0,0,0,0,0,0,0,1,0,1,1,0,0,0,0,0,0,2, ...]),
'-': array([0,1,0,0,1,0 ,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,5,0,0, ...])}
For example, one could use this to efficiently calculate the total number of cuts in a DNase-seq dataset using the intervals in a BED file
>>> readcount = 0
>>> for interval in regions:
readcount += reads[interval]["+"].sum() + reads[interval]["-"].sum()
>>> print readcount
3119
We have overloaded the +
operator you can directly add other GenomicIntervalSet
or GenomicInterval
objects, and you can delete intervals using the del
keyword thus:
>>> print regions
chr6 170863142 170863532 0 0.0 +
>>> regions += pyDNase.GenomicInterval("chr10","100000000","200000000", "0", 10, "-")
>>> print regions
chr6 170863142 170863532 0 0.0 +
chr10 100000000 200000000 0 10.0 -
>>> del regions[0]
>>> print regions
chr10 100000000 200000000 0 10.0 -
-
class
pyDNase.
GenomicIntervalSet
(filename=None)¶ Container class which stores and allow manipulations of large numbers of GenomicInterval objects. Essentially a way of storing and sorting BED files.
-
__init__
(filename=None)¶ Inits GenomicIntervalSet. You can also specify a BED file path to load the intervals from
- Kwargs:
- filename (str): the path to a BED file to initialize the intervals with
If no
filename
provided, then the set will be empty
-
loadBEDFile
(filename)¶ Adds all the intervals in a BED file to this GenomicIntervalSet. We’re quite naughty here and allow some non-standard BED formats (along with the official one):
chrom chromStart chromEnd chrom chromStart chromEnd strand chrom chromStart chromEnd name score strand
Any whitespace (tabs or spaces) will be considered separators, so spaces in names cause a problem!
Note
If you don’t supply a strand, we infer that it’s +ve.
- Args:
- filename: the path to a BED file to load
- Raises:
- IOError
-
resizeRegions
(toSize)¶ Resized all GenomicIntervals to a specific size
- Args:
- toSize: an int of the size to resize all intervals to
-