metaseq._genomic_signal.BamSignal

class metaseq._genomic_signal.BamSignal(fn)[source]

Bases: metaseq._genomic_signal.IntervalSignal

Class for operating on BAM files.

Methods

array(features[, processes, chunksize, ragged]) Creates an MxN NumPy array of genomic signal for the region defined by
genome() “genome” dictionary ready for pybedtools, based on the BAM header.
local_count(*args, **kwargs) The count of genomic signal (typcially BED features) found within an interval.
local_coverage(features, *args, **kwargs) Returns a binned vector of coverage.
mapped_read_count([force]) Counts total reads in a BAM file.

Methods

__init__(fn) Class for operating on BAM files.
array(features[, processes, chunksize, ragged]) Creates an MxN NumPy array of genomic signal for the region defined by
genome() “genome” dictionary ready for pybedtools, based on the BAM header.
local_count(*args, **kwargs) The count of genomic signal (typcially BED features) found within an interval.
local_coverage(features, *args, **kwargs) Returns a binned vector of coverage.
mapped_read_count([force]) Counts total reads in a BAM file.
__init__(fn)[source]

Class for operating on BAM files.

array(features, processes=None, chunksize=1, ragged=False, **kwargs)

Creates an MxN NumPy array of genomic signal for the region defined by each feature in features, where M=len(features) and N=(bins or feature length)

Parameters:

features : iterable of interval-like objects

An iterable of interval-like objects; see docstring for local_coverage method for more details.

processes : int or None

If not None, then create the array in parallel, giving each process chunks of length chunksize to work on.

chunksize : int

features will be split into chunksize pieces, and each piece will be given to a different process. The optimum value is dependent on the size of the features and the underlying data set, but chunksize=100 is a good place to start.

ragged : bool

If False (default), then return a 2-D NumPy array. This requires all rows to have the same number of columns, which you get when supplying bins or if all features are of uniform length. If True, then return a list of 1-D NumPy arrays

Notes

Additional keyword args are passed to local_coverage() which performs the work for each feature; see that method for more details.

genome()[source]

“genome” dictionary ready for pybedtools, based on the BAM header.

local_count(*args, **kwargs)

The count of genomic signal (typcially BED features) found within an interval.

Usually this only makes sense for BED or BAM (not bigWig) files.

Parameters:
  • feature – pybedtools.Interval object
  • stranded – If stranded=True, then only counts signal on the same strand as feature.
local_coverage(features, *args, **kwargs)

Returns a binned vector of coverage.

Computes a 1D vector of coverage at the coordinates for each feature in features, extending each read by fragmentsize bp.

Some arguments cannot be used for bigWig files due to the structure of these files. The parameters docstring below indicates whether or not an argument can be used with bigWig files.

Depending on the arguments provided, this method can return a vector containing values from a single feature or from concatenated features.

An example of the flexibility afforded by the latter case:

features can be a 3-tuple of pybedtools.Intervals representing (TSS + 1kb upstream, gene, TTS + 1kb downstream) and bins can be [100, 1000, 100]. This will return a vector of length 1200 containing the three genomic intervals binned into 100, 1000, and 100 bins respectively. Note that is up to the caller to construct the right axes labels in the final plot!
Parameters:

features : str, interval-like object, or list

Can be a single interval or an iterable yielding intervals.

Interval-like objects must have chrom, start, and stop attributes, and optionally a strand attribute. One exception to this that if features is a single string, it can be of the form “chrom:start-stop” or “chrom:start-stop[strand]”.

If features is a single interval, then return a 1-D array for that interval.

If features is an iterable of intervals, then return a 1-D array that is a concatenation of signal for these intervals.

Available for bigWig.

bins : None, int, list

If bins is None, then each value in the returned array will correspond to one bp in the genome.

If features is a single Interval, then bins is an integer or None.

If features is an iterable of Intervals, bins is an iterable of integers of the same length as features.

Available for bigWig.

fragment_size : None or int

If not None, then each item from the genomic signal (e.g., reads from a BAM file) will be extended fragment_size bp in the 3’ direction. Higher fragment sizes will result in smoother signal. Not available for bigWig.

shift_width : int

Each item from the genomic signal (e.g., reads from a BAM file) will be shifted shift_width bp in the 3’ direction. This can be useful for reconstructing a ChIP-seq profile, using the shift width determined from the peak-caller (e.g., modeled d in MACS). Not available for bigWig.

read_strand : None or str

If read_strand is one of “+” or “-”, then only items from the genomic signal (e.g., reads from a BAM file) on that strand will be considered and reads on the opposite strand ignored. Useful for plotting genomic signal for stranded libraries. Not available for bigWig.

stranded : bool

If True, then the profile will be reversed for features whose strand attribute is “-”.

use_score : bool

If True, then each bin will contain the sum of the score attribute of genomic features in that bin instead of the number of genomic features falling within each bin. Not available for bigWig.

accumulate : bool

If False, then only record that there was something there, rather than acumulating reads. This is useful for making matrices with called peaks. Available for bigWig.

preserve_total : bool

If True, re-scales the returned value so that each binned row’s total is equal to the sum of the original, un-binned data. The units of the returned array will be in “total per bin”. This is useful for, e.g., counting reads in features. If preserve_total is False, then the returned array will have units of “density”; this is more generally useful and is the default behavior. Available for bigWig, but not when using method=”ucsc_summarize”.

method : str; one of [ “summarize” | “get_as_array” | “ucsc_summarize” ]

Only used for bigWig. The method specifies how data are extracted from the bigWig file. “summarize” is the default. It’s quite fast, but may yield slightly different results when compared to running this same function on the BAM file from which the bigWig was created.

“summarize” uses bx-python. The values returned will not be exactly the same as the values returned when local_coverage is called on a BAM, BED, or bigBed file, but they will be close. This method is quite fast, and is the default when bins is not None.

“get_as_array” uses bx-python, but does a separate binning step. This can be slower than the other two methods, but the results are exactly the same as those from a BAM, BED, or bigBed file. This method is always used if bins=None.

“ucsc_summarize” is an alternative version of “summarize”. It uses the UCSC program bigWigSummary, which must already installed and on your path.

processes : int or None

The feature can be split across multiple processes.

Returns:

1-d NumPy array

Notes

If a feature has a “-” strand attribute, then the resulting profile will be relative to a minus-strand feature. That is, the resulting profile will be reversed.

Returns arrays x and y. x is in genomic coordinates, and y is the coverage at each of those coordinates after extending fragments.

The total number of reads is guaranteed to be the same no matter how it’s binned.

(with ideas from http://www-huber.embl.de/users/anders/HTSeq/doc/tss.html)

mapped_read_count(force=False)[source]

Counts total reads in a BAM file.

If a file self.bam + ‘.scale’ exists, then just read the first line of that file that doesn’t start with a “#”. If such a file doesn’t exist, then it will be created with the number of reads as the first and only line in the file.

The result is also stored in self._readcount so that the time-consuming part only runs once; use force=True to force re-count.

Parameters:

force : bool

If True, then force a re-count; otherwise use cached data if available.

This Page