metaseq.results_table.ResultsTable¶

class metaseq.results_table.ResultsTable(data, db=None, import_kwargs=None)[source]¶

Bases: object

Wrapper around a pandas.DataFrame that adds additional functionality.

The underlying pandas.DataFrame is always available with the data attribute.

Any attributes not explicitly in this class will be looked for in the underlying pandas.DataFrame.

Parameters:

data : string or pandas.DataFrame

If string, assumes it’s a filename and calls pandas.read_table(data, **import_kwargs).

db : string or gffutils.FeatureDB

Optional database that can be used to generate features

import_kwargs : dict

These arguments will be passed to pandas.read_table() if data is a filename.

Methods

`TSS`([upstream, downstream])	Creates a BED/GFF file of the 5’ end of each feature represented in the table and returns the resulting pybedtools.BedTool object.
`TTS`([upstream, downstream])	Creates a BED/GFF file of the 3’ end of each feature represented in the table and returns the resulting pybedtools.BedTool object.
`align_with`(other)	Align the dataframe’s index with another.
`attach_db`(db)	Attach a gffutils.FeatureDB for access to features.
`copy`()
`features`([ignore_unknown])	Generator of features.
`five_prime`([upstream, downstream])	Creates a BED/GFF file of the 5’ end of each feature represented in the table and returns the resulting pybedtools.BedTool object.
`genes_in_common`(other)	Convenience method for getting the genes found in both dataframes.
`genes_with_peak`(peaks[, transform_func, ...])	Returns a boolean index of genes that have a peak nearby.
`radviz`(column_names[, transforms])	Radviz plot.
`reindex_to`(x[, attribute])	Returns a copy that only has rows corresponding to feature names in x.
`scatter`(x, y[, xfunc, yfunc, xscale, ...])	Do-it-all method for making annotated scatterplots.
`strip_unknown_features`()	Remove features not found in the gffutils.FeatureDB.
`three_prime`([upstream, downstream])	Creates a BED/GFF file of the 3’ end of each feature represented in the table and returns the resulting pybedtools.BedTool object.
`update`(dataframe)	Updates the current data with a new dataframe.

Methods

`TSS`([upstream, downstream])	Creates a BED/GFF file of the 5’ end of each feature represented in the table and returns the resulting pybedtools.BedTool object.
`TTS`([upstream, downstream])	Creates a BED/GFF file of the 3’ end of each feature represented in the table and returns the resulting pybedtools.BedTool object.
`__init__`(data[, db, import_kwargs])
`align_with`(other)	Align the dataframe’s index with another.
`attach_db`(db)	Attach a gffutils.FeatureDB for access to features.
`copy`()
`features`([ignore_unknown])	Generator of features.
`five_prime`([upstream, downstream])	Creates a BED/GFF file of the 5’ end of each feature represented in the table and returns the resulting pybedtools.BedTool object.
`genes_in_common`(other)	Convenience method for getting the genes found in both dataframes.
`genes_with_peak`(peaks[, transform_func, ...])	Returns a boolean index of genes that have a peak nearby.
`radviz`(column_names[, transforms])	Radviz plot.
`reindex_to`(x[, attribute])	Returns a copy that only has rows corresponding to feature names in x.
`scatter`(x, y[, xfunc, yfunc, xscale, ...])	Do-it-all method for making annotated scatterplots.
`strip_unknown_features`()	Remove features not found in the gffutils.FeatureDB.
`three_prime`([upstream, downstream])	Creates a BED/GFF file of the 3’ end of each feature represented in the table and returns the resulting pybedtools.BedTool object.
`update`(dataframe)	Updates the current data with a new dataframe.

__init__(data, db=None, import_kwargs=None)[source]¶

TSS(upstream=1, downstream=0)¶

Creates a BED/GFF file of the 5’ end of each feature represented in the table and returns the resulting pybedtools.BedTool object. Needs an attached database.

Parameters:

upstream, downstream : int

Number of basepairs up and downstream to include

TTS(upstream=0, downstream=1)¶

Creates a BED/GFF file of the 3’ end of each feature represented in the table and returns the resulting pybedtools.BedTool object. Needs an attached database.

Parameters:

upstream, downstream : int

Number of basepairs up and downstream to include

align_with(other)[source]¶: Align the dataframe’s index with another.

attach_db(db)[source]¶

Attach a gffutils.FeatureDB for access to features.

Useful if you want to attach a db after this instance has already been created.

Parameters:	db : gffutils.FeatureDB

features(ignore_unknown=False)[source]¶

Generator of features.

If a gffutils.FeatureDB is attached, returns a pybedtools.Interval for every feature in the dataframe’s index.

Parameters:

ignore_unknown : bool

If True, silently ignores features that are not found in the db.

five_prime(upstream=1, downstream=0)[source]¶

Creates a BED/GFF file of the 5’ end of each feature represented in the table and returns the resulting pybedtools.BedTool object. Needs an attached database.

Parameters:

upstream, downstream : int

Number of basepairs up and downstream to include

genes_in_common(other)[source]¶: Convenience method for getting the genes found in both dataframes.

genes_with_peak(peaks, transform_func=None, split=False, intersect_kwargs=None, id_attribute='ID', *args, **kwargs)[source]¶

Returns a boolean index of genes that have a peak nearby.

Parameters:

peaks : string or pybedtools.BedTool

If string, then assume it’s a filename to a BED/GFF/GTF file of intervals; otherwise use the pybedtools.BedTool object directly.

transform_func : callable

This function will be applied to each gene object returned by self.features(). Additional args and kwargs are passed to transform_func. For example, if you’re looking for peaks within 1kb upstream of TSSs, then pybedtools.featurefuncs.TSS would be a useful transform_func, and you could supply additional kwargs of upstream=1000 and downstream=0.

This function can return iterables of features, too. For example, you might want to look for peaks falling within the exons of a gene. In this case, transform_func should return an iterable of pybedtools.Interval objects. The only requirement is that the name field of any feature matches the index of the dataframe.

intersect_kwargs : dict

kwargs passed to pybedtools.BedTool.intersect.

id_attribute : str

The attribute in the GTF or GFF file that contains the id of the gene. For meaningful results to be returned, a gene’s ID be also found in the index of the dataframe.

For GFF files, typically you’d use id_attribute=”ID”. For GTF files, you’d typically use id_attribute=”gene_id”.

radviz(column_names, transforms={}, **kwargs)[source]¶

Radviz plot.

Useful for exploratory visualization, a radviz plot can show multivariate data in 2D. Conceptually, the variables (here, specified in column_names) are distributed evenly around the unit circle. Then each point (here, each row in the dataframe) is attached to each variable by a spring, where the stiffness of the spring is proportional to the value of corresponding variable. The final position of a point represents the equilibrium position with all springs pulling on it.

In practice, each variable is normalized to 0-1 (by subtracting the mean and dividing by the range).

This is a very exploratory plot. The order of column_names will affect the results, so it’s best to try a couple different orderings. For other caveats, see [1].

Additional kwargs are passed to self.scatter, so subsetting, callbacks, and other configuration can be performed using options for that method (e.g., genes_to_highlight is particularly useful).

Parameters:

column_names : list

Which columns of the dataframe to consider. The columns provided should only include numeric data, and they should not contain any NaN, inf, or -inf values.

transforms : dict

Dictionary mapping column names to transformations that will be applied just for the radviz plot. For example, np.log1p is a useful function. If a column name is not in this dictionary, it will be used as-is.

ax : matplotlib.Axes

If not None, then plot the radviz on this axes. If None, then a new figure will be created.

kwargs : dict

Additional arguments are passed to self.scatter. Note that not all possible kwargs for self.scatter are necessarily useful for a radviz plot (for example, margninal histograms would not be meaningful).

Notes

This method adds two new variables to self.data: “radviz_x” and “radviz_y”. It then calls the self.scatter method, using these new variables.

The data transformation was adapted from the pandas.tools.plotting.radviz function.

References

[1] Hoffman,P.E. et al. (1997) DNA visual and analytic data mining. In: the Proceedings of the IEEE Visualization. Phoenix, AZ, pp. 437-441.

[2] http://www.agocg.ac.uk/reports/visual/casestud/brunsdon/radviz.htm [3] http://pandas.pydata.org/pandas-docs/stable/visualization.html #radviz

reindex_to(x, attribute='Name')[source]¶

Returns a copy that only has rows corresponding to feature names in x.

Parameters:

x : str or pybedtools.BedTool

BED, GFF, GTF, or VCF where the “Name” field (that is, the value returned by feature[‘Name’]) or any arbitrary attribute

attribute : str

Attribute containing the name of the feature to use as the index.

scatter(x, y, xfunc=None, yfunc=None, xscale=None, yscale=None, xlab=None, ylab=None, genes_to_highlight=None, label_genes=False, marginal_histograms=False, general_kwargs={'picker': True, 'alpha': 0.2, 'color': 'k'}, general_hist_kwargs=None, offset_kwargs={}, label_kwargs=None, ax=None, one_to_one=None, callback=None, xlab_prefix=None, ylab_prefix=None, sizefunc=None, hist_size=0.3, hist_pad=0.0, nan_offset=0.015, pos_offset=0.99, linelength=0.01, neg_offset=0.005, figure_kwargs=None)[source]¶

Do-it-all method for making annotated scatterplots.

Parameters:

x, y : array-like

Variables to plot. Must be names in self.data’s DataFrame. For example, “baseMeanA” and “baseMeanB”

xfunc, yfunc : callable

Functions to apply to xvar and yvar respectively. Default is log2; set to None to have no transformation.

xlab, ylab : string

Labels for x and y axes; default is to use function names for xfunc and yfunc and variable names xvar and yvar, e.g., “log2(baseMeanA)”

ax : None or Axes object

If ax=None, then makes a new fig and returns the Axes object, otherwise, plots onto ax

general_kwargs : dict

Kwargs for matplotlib.scatter; specifies how all points look

genes_to_highlight : list of (index, dict) tuples

Provides lots of control to colors. It is a list of (ind, kwargs) tuples, where each ind specifies genes to plot with kwargs. Each dictionary updates a copy of general_kwargs. If genes_to_highlight has a “name” kwarg, this must be a list that’t the same length as ind. It will be used to label the genes in ind using label_kwargs.

callback : callable

Function to call upon clicking a point. Must accept a single argument which is the gene ID. Default is to print the gene name, but an example of another useful callback would be a mini-browser connected to a genomic_signal object from which the expression data were calculated.

one_to_one : None or dict

If not None, a dictionary of matplotlib.plot kwargs that will be used to plot a 1:1 line.

label_kwargs : dict

Kwargs for labeled genes (e.g., dict=(style=’italic’)). Will only be used if an entry in genes_to_highlight has a name key.

offset_kwargs : dict

Kwargs to be passed to matplotlib.transforms.offset_copy, used for adjusting the positioning of gene labels in relation to the actual point.

xlab_prefix, ylab_prefix : str

Optional label prefix that will be added to the beginning of xlab and/or ylab.

hist_size : float

Size of marginal histograms

hist_pad : float

Spacing between marginal histograms

nan_offset, pos_offset, neg_offset : float

Offset, in units of “fraction of axes” for the NaN, +inf, and -inf “rug plots”

linelength : float

Line length for the rug plots

strip_unknown_features()[source]¶: Remove features not found in the gffutils.FeatureDB. This will typically include ‘ambiguous’, ‘no_feature’, etc, but can also be useful if the database was created from a different one than was used to create the table.

three_prime(upstream=0, downstream=1)[source]¶

Creates a BED/GFF file of the 3’ end of each feature represented in the table and returns the resulting pybedtools.BedTool object. Needs an attached database.

Parameters:

upstream, downstream : int

Number of basepairs up and downstream to include

update(dataframe)[source]¶

Updates the current data with a new dataframe.

This extra step is required to get around the fancy pandas.DataFrame indexing (like .ix, .iloc, etc).

Navigation

metaseq.results_table.ResultsTable¶

This Page

Quick search

Navigation