.. _example_session: Example 1: Average ChIP-seq signal over promoters ================================================= This example demonstrates the use of `metaseq` for performing a common task when analyzing ChIP-seq data: what does transcription factor binding signal look like near transcription start sites? The IPython Notebook of this example can be found in the source directory (`doc/source/example_session.ipynb`) or at http://nbviewer.ipython.org/github/daler/metaseq/blob/master/doc/source/example\_session.ipynb The syntax of this document follows that of the IPython notebook. For example, calls out to the shell are prefixed with a "`!`\ "; if not then the code is run in Python. .. code:: python # Enable in-line plots for this example %matplotlib inline Example data ------------ This example uses data that can be downloaded from https://github.com/daler/metaseq-example-data. See that repository for details on how the data were prepared; here's how to download the prepared data: .. code:: python %%bash example_dir="metaseq-example" if [ -e $example_dir ]; then echo "already exists"; else mkdir -p $example_dir (cd $example_dir \ && wget --progress=dot:giga https://raw.githubusercontent.com/daler/metaseq-example-data/master/metaseq-example-data.tar.gz \ && tar -xzf metaseq-example-data.tar.gz \ && rm metaseq-example-data.tar.gz) fi .. parsed-literal:: --2015-08-06 11:11:20-- https://raw.githubusercontent.com/daler/metaseq-example-data/master/metaseq-example-data.tar.gz Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 23.235.46.133 Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|23.235.46.133|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 96655384 (92M) [application/octet-stream] Saving to: ‘metaseq-example-data.tar.gz’ 0K ........ ........ ........ ........ 34% 20.2M 3s 32768K ........ ........ ........ ........ 69% 13.6M 2s 65536K ........ ........ ........ .... 100% 31.7M=4.8s 2015-08-06 11:11:26 (19.1 MB/s) - ‘metaseq-example-data.tar.gz’ saved [96655384/96655384] .. code:: python # From now on, we can just use `data_dir` data_dir = 'metaseq-example/data' .. code:: python # What did we get? !ls $data_dir .. parsed-literal:: GSM847565_SL2585.gtf.gz GSM847565_SL2585.table GSM847566_SL2592.gtf.gz GSM847566_SL2592.table Homo_sapiens.GRCh37.66_chr17.gtf Homo_sapiens.GRCh37.66_chr17.gtf.db Homo_sapiens.GRCh37.66.gtf.gz wgEncodeHaibTfbsK562Atf3V0416101AlnRep1_chr17.bam wgEncodeHaibTfbsK562Atf3V0416101AlnRep1_chr17.bam.bai wgEncodeHaibTfbsK562Atf3V0416101RawRep1_chr17.bigWig wgEncodeHaibTfbsK562RxlchV0416101AlnRep1_chr17.bam wgEncodeHaibTfbsK562RxlchV0416101AlnRep1_chr17.bam.bai wgEncodeHaibTfbsK562RxlchV0416101RawRep1_chr17.bigWig - The `.bam` and `.bam.bai` files are from an ENCODE project ChIP-Seq experiment in the human erythroid K562 cell line for the ATF3 transcription factor and its associated input control. See the `ENCODE page `__ for details. - The `.bigWig` files are from the same experiment, downloaded from ENCODE - The GTF annotation files are downloaded from Ensembl (release 66) for hg19. - The `Homo_sapiens.GRCh37.66_chr17.gtf.db` file is a `gffutils` database that was prepared from the GTF file. - The `GSM*.gtf.gz` data are Cufflinks output from GEO accession GSM847565 and GSM847566. These were then parsed and reformatted into the corresponding `GSM*.table` files. Getting TSSes ------------- Our goal is to look at the ChIP-seq signal over transcription start sites (TSSes) of genes. Typically in this sort of analysis we start with annotations; here we're using the annotations from Ensembl. If we're lucky, TSSes will already be annotated. Failing that, perhaps 5'UTRs are annotated, so we could take the 5' end of the 5'UTR as the TSS. Let's see what the Ensembl data gives us. .. code:: python !head -n 3 $data_dir/Homo_sapiens.GRCh37.66_chr17.gtf .. parsed-literal:: chr17 protein_coding exon 30898 31270 . - . gene_id "ENSG00000187939"; transcript_id "ENST00000343572"; exon_number "1"; gene_name "DOC2B"; gene_biotype "protein_coding"; transcript_name "DOC2B-201"; chr17 protein_coding CDS 30898 31270 . - 0 gene_id "ENSG00000187939"; transcript_id "ENST00000343572"; exon_number "1"; gene_name "DOC2B"; gene_biotype "protein_coding"; transcript_name "DOC2B-201"; protein_id "ENSP00000343665"; chr17 protein_coding start_codon 31268 31270 . - 0 gene_id "ENSG00000187939"; transcript_id "ENST00000343572"; exon_number "1"; gene_name "DOC2B"; gene_biotype "protein_coding"; transcript_name "DOC2B-201"; GTF files have the feature type in the 3rd field. So what kind of featuretypes do we have here? .. code:: python !cut -f 3 $data_dir/Homo_sapiens.GRCh37.66_chr17.gtf | sort | uniq -c .. parsed-literal:: 34137 CDS 45801 exon 3355 start_codon 3265 stop_codon With only these featuretypes to work with, we would need to do the following to identify the TSS of each transcript: \* find all exons for the transcript \* sort the exons by start position \* if the transcript is on the "+" strand, TSS is the start position of the first exon \* if the transcript is on the "-" strand, TSS is the end position of the last exon Luckily, ``gffutils` `__ is able to infer transcripts and genes from a GTF file. The inferred transcripts and genes are already in the prepared `gffutils` database, at `$data_dir/Homo_sapiens.GRCh37.66_chr17.gtf.db`. First we connect to it: .. code:: python import os import gffutils db = gffutils.FeatureDB(os.path.join(data_dir, 'Homo_sapiens.GRCh37.66_chr17.gtf.db')) We'll use ``pybedtools` `__ for interval manipulation. Here we create a generator function that iterates through all annotated transcripts in the database. For each transcript, we convert it to a `pybedtools.Interval` and use the `TSS` function to give us the 1-bp position of the TSS, and save it as a new file. Here is a general usage pattern for `gffutils` and `pybedtools`: do the work in a generator function, and pass the generator to `pybedtools.BedTool`. This uses very little memory, and scales well to hundreds of thousands of features. .. code:: python import pybedtools from pybedtools.featurefuncs import TSS from gffutils.helpers import asinterval def tss_generator(): """ Generator function to yield TSS of each annotated transcript """ for transcript in db.features_of_type('transcript'): yield TSS(asinterval(transcript), upstream=1, downstream=0) # A BedTool made out of a generator, and saved to file. tsses = pybedtools.BedTool(tss_generator()).saveas('tsses.gtf') Now that we have a TSS file, we can modify it in different ways. Maybe we want to look at TSS +/- 1kb. Or 5kb. Or just 3kb upstream. For this example, let's use `pybedtools` to add 1kb to either side of the TSS. This uses the BEDTools `slop` routine; see the docs for that program for how to make changes to up/downstream distances. .. code:: python tsses_1kb = tsses.slop(b=1000, genome='hg19', output='tsses-1kb.gtf') Creating the arrays ------------------- `metaseq` works with the concepts of signal and windows. In this example, the signal is ChIP data, and the windows are TSS +/- 1kb. The first step is to create “genomic signal” objects out of the data. Since our example files are BAM files, we specify the kind=’bam’, but if you have your own data in a different format (bigWig, bigBed, BED, GFF, GTF, VCF) then specify that format instead (see :func:`metaseq.genomic_signal()`). We need to pass the filenames of the BAM files: .. code:: python import metaseq ip_signal = metaseq.genomic_signal( os.path.join(data_dir, 'wgEncodeHaibTfbsK562Atf3V0416101AlnRep1_chr17.bam'), 'bam') input_signal = metaseq.genomic_signal( os.path.join(data_dir, 'wgEncodeHaibTfbsK562RxlchV0416101AlnRep1_chr17.bam'), 'bam') Now we can create the arrays of signal over each window. Since this can be a time-consuming step, the first time this code is run it will cache the arrays on disk. The next time this code is run, it will be quickly loaded. Trigger a re-run by deleting the `.npz` file. Here, with the `BamSignal.array` method, we bin each promoter region into 100 bins, and calculate the signal in parallel across as many CPUs as are available. We do this for the IP signal and input signals separately. Then, since these are BAM files of mapped reads, we scale the arrays to the library size. The scaled arrays are then saved to disk, along with the windows that were used to create them. .. code:: python import multiprocessing processes = multiprocessing.cpu_count() if not os.path.exists('example.npz'): # The signal is the IP ChIP-seq BAM file. ip_array = ip_signal.array( # Look at signal over these windows tsses_1kb, # Bin signal into this many bins per window bins=100, # Use multiple CPUs. Dramatically speeds up run time. processes=processes) # Do the same thing for input. input_array = input_signal.array( tsses_1kb, bins=100, processes=processes) # Normalize to library size. The values in the array # will be in units of "reads per million mapped reads" ip_array /= ip_signal.mapped_read_count() / 1e6 input_array /= input_signal.mapped_read_count() / 1e6 # Cache to disk. The data will be saved as "example.npz" and "example.features". metaseq.persistence.save_features_and_arrays( features=tsses, arrays={'ip': ip_array, 'input': input_array}, prefix='example', link_features=True, overwrite=True) Loading the arrays ------------------ Now that we’ve saved to disk, at any time in the future we can load the data without having to regenerate them: .. code:: python features, arrays = metaseq.persistence.load_features_and_arrays(prefix='example') Let’s do some double-checks. .. code:: python # How many features? assert len(features) == 5708 # This ought to be exactly the same as the number of features in `tsses_1kb.gtf` assert len(features) == len(tsses_1kb) == 5708 # This shows that `arrays` acts like a dictionary assert sorted(arrays.keys()) == ['input', 'ip'] # This shows that the IP and input arrays have one row per feature, and one column per bin assert arrays['ip'].shape == (5708, 100) == arrays['input'].shape Line plot of average signal --------------------------- Now that we have NumPy arrays of signal over windows, there’s a lot we can do. One easy thing is to simply plot the mean signal of IP and of input. Let’s construct meaningful values for the x-axis, from -1000 to +1000 over 100 bins. We'll do this with a NumPy array. .. code:: python import numpy as np x = np.linspace(-1000, 1000, 100) Then plot, using standard ``matplotlib` `__ commands: .. code:: python # Import plotting tools from matplotlib import pyplot as plt # Create a figure and axes fig = plt.figure() ax = fig.add_subplot(111) # Plot the IP: ax.plot( # use the x-axis values we created x, # axis=0 takes the column-wise mean, so with # 100 columns we'll have 100 means to plot arrays['ip'].mean(axis=0), # Make it red color='r', # Label to show up in legend label='IP') # Do the same thing with the input ax.plot( x, arrays['input'].mean(axis=0), color='k', label='input') # Add a vertical line at the TSS, at position 0 ax.axvline(0, linestyle=':', color='k') # Add labels and legend ax.set_xlabel('Distance from TSS (bp)') ax.set_ylabel('Average read coverage (per million mapped reads)') ax.legend(loc='best'); .. image:: example_session_files/example_session_30_0.png Adding a heatmap ---------------- Let's work on improving this plot, one step at a time. We don't really know if this average signal is due to a handful of really strong peaks, or if it's moderate signal over many peaks. So one improvement would be to include a heatmap of the signal over all the TSSs. First, let's create a single normalized array by subtracting input from IP: .. code:: python normalized_subtracted = arrays['ip'] - arrays['input'] `metaseq` comes with some helper functions to simplify this kind of plotting. The `metaseq.plotutils.imshow` function is one of these; here the arguments are described: .. code:: python # Tweak some font settings so the results look nicer plt.rcParams['font.family'] = 'Arial' plt.rcParams['font.size'] = 10 # the metaseq.plotutils.imshow function does a lot of work, # we just have to give it the right arguments: fig = metaseq.plotutils.imshow( # The array to plot; here, we've subtracted input from IP. normalized_subtracted, # X-axis to use x=x, # Change the default figure size to something smaller for this example figsize=(3, 7), # Make the colorbar limits go from 5th to 99th percentile. # `percentile=True` means treat vmin/vmax as percentiles rather than # actual values. percentile=True, vmin=5, vmax=99, # Style for the average line plot (black line) line_kwargs=dict(color='k', label='All'), # Style for the +/- 95% CI band surrounding the # average line (transparent black) fill_kwargs=dict(color='k', alpha=0.3), ) .. image:: example_session_files/example_session_35_0.png .. code:: python print "asdf" .. parsed-literal:: asdf Sorting the array ----------------- The array is not very meaningful as currently sorted. We can adjust the sorting this either by re-ordering the array before plotting, or using the `sort_by` kwarg when calling `metaseq.plotutils.imshow`. Let's sort the rows by their mean value: .. code:: python fig = metaseq.plotutils.imshow( # These are the same arguments as above. normalized_subtracted, x=x, figsize=(3, 7), vmin=5, vmax=99, percentile=True, line_kwargs=dict(color='k', label='All'), fill_kwargs=dict(color='k', alpha=0.3), # This is new: sort by mean signal sort_by=normalized_subtracted.mean(axis=1) ) .. image:: example_session_files/example_session_38_0.png We can use any number of arbitrary sorting methods. For example, this sorts the rows by the position of the highest signal in the row. Note that the line plot, which is the column-wise average, remains unchanged since we're still using the same data. The rows are just sorted differently. .. code:: python fig = metaseq.plotutils.imshow( # These are the same arguments as above. normalized_subtracted, x=x, figsize=(3, 7), vmin=5, vmax=99, percentile=True, line_kwargs=dict(color='k', label='All'), fill_kwargs=dict(color='k', alpha=0.3), # This is new: sort by mean signal sort_by=np.argmax(normalized_subtracted, axis=1) ) .. image:: example_session_files/example_session_40_0.png Customizing the axes styles --------------------------- Let's go back to the sorted-by-mean version. .. code:: python fig = metaseq.plotutils.imshow( normalized_subtracted, x=x, figsize=(3, 7), vmin=5, vmax=99, percentile=True, line_kwargs=dict(color='k', label='All'), fill_kwargs=dict(color='k', alpha=0.3), sort_by=normalized_subtracted.mean(axis=1) ) .. image:: example_session_files/example_session_42_0.png Now we'll make some tweaks to the plot. The figure returned by `metaseq.plotutils.imshow` has attributes `array_axes`, `line_axes`, and `cax`, which can be used as an easy way to get handles to the axes for further configuration. Let's make some additional tweaks: .. code:: python # "line_axes" is our handle for working on the lower axes. # Add some nicer labels. fig.line_axes.set_ylabel('Average enrichment'); fig.line_axes.set_xlabel('Distance from TSS (bp)'); # "array_axes" is our handle for working on the upper array axes. # Add a nicer axis label fig.array_axes.set_ylabel('Transcripts on chr17') # Remove the x tick labels, since they're redundant # with the line axes fig.array_axes.set_xticklabels([]) # Add a vertical line to indicate zero in both the array axes # and the line axes fig.array_axes.axvline(0, linestyle=':', color='k') fig.line_axes.axvline(0, linestyle=':', color='k') fig.cax.set_ylabel("Enrichment") fig .. image:: example_session_files/example_session_44_0.png Integrating with RNA-seq expression data ======================================== Often we want to compare ChIP-seq data with RNA-seq data. But RNA-seq data typically is presented as gene ID, while ChIP-seq data is presented as genomic coords. These can be tricky to reconcile. We will use example data from ATF3 knockdown experiments them to subset the ChIP signal by those TSSs that were affected by knockdown and those that were not. This example uses pre-processed data downloaded from GEO. We'll use a simple (and naive) 2-fold cutoff to identify transcripts that went up, down, or were unchanged upon ATF3 knockdown. In real-world analysis, you'd probaby have a table from DESeq2 or edgeR analysis that you would use instead. RNA-seq data wrangling: loading data ------------------------------------ The `metaseq.results_table` module has tools for working with this kind of data (for example, the `metaseq.results_table.DESeq2Results` class). Here, we will make a generic `ResultsTable` which handles any kind of tab-delimited data. It's important to specify the index column. This is the column that contains the transcript IDs in these files. .. code:: python from metaseq.results_table import ResultsTable control = ResultsTable( os.path.join(data_dir, 'GSM847565_SL2585.table'), import_kwargs=dict(index_col=0)) knockdown = ResultsTable( os.path.join(data_dir, 'GSM847566_SL2592.table'), import_kwargs=dict(index_col=0)) `metaseq.results_table.ResultsTable` objects are wrappers around `pandas.DataFrame` objects, so if you already know `pandas` you know how to manipulate these objects. The `pandas.DataFrame` is always available as the `data` attribute. Here are the first 5 rows of the `control` object, which show that the index is `id`, which are Ensembl transcript IDs, and there are two columns, `score` and `fpkm`: .. code:: python # --------------------------------------------------------- # Inspect results to see what we're working with print len(control.data) control.data.head() .. parsed-literal:: 85699 .. raw:: html
score fpkm
id
ENST00000456328 108.293111 1.118336
ENST00000515242 87.233019 0.830617
ENST00000518655 175.175609 2.367682
ENST00000473358 343.232679 9.795265
ENST00000408384 0.000000 0.000000
RNA-seq data wrangling: aligning RNA-seq data with ChIP-seq data ---------------------------------------------------------------- We should ensure that `control` and `knockdown` have their transcript IDs in the same order as the rows in the heatmap array, and that they only contain transcript IDs from chr17. The `ResultsTable.reindex_to` method is very useful for this -- it takes a `pybedtools.BedTool` object and re-indexes the underlying dataframe so that the order of the dataframe matches the order of the features in the file. In this way we can re-align RNA-seq data to ChIP-seq data for more direct comparison. Remember the `tsses_1kb` object that we used to create the array? That defined the order of the rows in the array. We can use that to re-index the dataframes. Let's look at the first line from that file to see how the transcript ID information is stored: .. code:: python # --------------------------------------------------------- # Inspect the GTF file originally used to create the array print tsses_1kb[0] .. parsed-literal:: chr17 gffutils_derived transcript 37025255 37027255 . + . transcript_id "ENST00000318008"; gene_id "ENSG00000002834"; The Ensembl transcript ID is stored in the `transcript_id` field of the GTF attributes: :: transcript_id "ENST00000318008"; gene_id "ENSG00000002834"; The `ResultsTable` is indexed by transcript ID. Note that DESeq2 and edgeR results are typically indexed by gene, rather than trancscript, ID. So when working with your own data, be sure to select the GTF attribute whose values will be found in the `ResultsTable` index. Here, we tell the `ResultsTable.reindex_to` method which attribute it should pay attention to when realigning the data: .. code:: python # --------------------------------------------------------- # Re-align the ResultsTables to match the GTF file control = control.reindex_to(tsses, attribute='transcript_id') knockdown = knockdown.reindex_to(tsses, attribute='transcript_id') Note that we now have a different order -- the first 5 rows are now different compared to when we checked before. Also, the number of rows in the table has decreased dramatically. Recall that `tsses_1kb` only contained features from chr17. The original data table had all transcripts. By reindexing the table to match the `tsses_1kb`, we lose all of the non-chr17 transcripts. .. code:: python print len(control) control.data.head() .. parsed-literal:: 5708 .. raw:: html
score fpkm
id
ENST00000318008 433.958279 19.246250
ENST00000419929 NaN NaN
ENST00000433206 40.938322 0.328118
ENST00000435347 450.179142 21.655531
ENST00000443937 451.761068 21.905318
Also note that second transcript, with NaN values. It turns out that transcript was not in the original RNA-seq results data table: .. code:: python original_control = ResultsTable( os.path.join(data_dir, 'GSM847565_SL2585.table'), import_kwargs=dict(index_col=0)) 'ENST00000419929' in original_control.data.index .. parsed-literal:: False This may be because the experiment from GEO used something other than Ensembl annotations when running the analysis. It's actually not clear from the GEO entry what they used. Anyway, in order to make sure the rows in the table match the rows in the array, NaNs are added as values. Let's do some double-checks to make sure things are set up correctly: .. code:: python # Everything should be the same length assert len(control.data) == len(knockdown.data) == len(tsses_1kb) == 5708 # Spot-check some values to make sure the GTF file and the DataFrame match up. assert tsses[0]['transcript_id'] == control.data.index[0] assert tsses[100]['transcript_id'] == control.data.index[100] assert tsses[5000]['transcript_id'] == control.data.index[5000] RNA-seq data wrangling: join control and knockdown data ------------------------------------------------------- Now for some more data-wrangling. We'll use basic ``pandas` `__ operations to merge the control and knockdown data together into a single table. We'll also create a new log2foldchange column. .. code:: python # Join the dataframes and create a new pandas.DataFrame. data = control.data.join(knockdown.data, lsuffix='_control', rsuffix='_knockdown') # Add a log2 fold change variable data['log2foldchange'] = np.log2(data.fpkm_knockdown / data.fpkm_control) data.head() .. raw:: html
score_control fpkm_control score_knockdown fpkm_knockdown log2foldchange
id
ENST00000318008 433.958279 19.246250 386.088132 13.529179 -0.508503
ENST00000419929 NaN NaN NaN NaN NaN
ENST00000433206 40.938322 0.328118 181.442415 2.517192 2.939529
ENST00000435347 450.179142 21.655531 436.579186 19.617419 -0.142600
ENST00000443937 451.761068 21.905318 431.172759 18.859090 -0.216021
We can investigate some basic stats: .. code:: python # --------------------------------------------------------- # How many transcripts on chr17 changed expression? print "up:", sum(data.log2foldchange > 1) print "down:", sum(data.log2foldchange < -1) .. parsed-literal:: up: 735 down: 514 Integrating RNA-seq data with the heatmap ----------------------------------------- Let's return to the heatmap. In addition to the average coverage line we already have, we'd like to add additional lines in another panel. The `metaseq.plotutils.imshow` function is very flexible, and uses `matplotlib.gridspec` for organizing the axes. This means we can ask for an additional axes by overriding the default `height_ratios` tuple, using `(3, 1, 1)`. This says to make 3 axes, where the first one is 3x the height of the other two. .. code:: python fig = metaseq.plotutils.imshow( # Same as before... normalized_subtracted, x=x, figsize=(3, 7), vmin=5, vmax=99, percentile=True, line_kwargs=dict(color='k', label='All'), fill_kwargs=dict(color='k', alpha=0.3), sort_by=normalized_subtracted.mean(axis=1), # Default was (3,1); here we add another number height_ratios=(3, 1, 1) ) # `fig.gs` contains the `matplotlib.gridspec.GridSpec` object, # so we can now create the new axes. bottom_axes = plt.subplot(fig.gs[2, 0]) .. image:: example_session_files/example_session_66_0.png The `metaseq.plotutils.ci_plot` function takes an array and plots the mean signal +/- 95% CI bands. This was actually called automatically before for our line plot of average signal across all TSSes. Now, let's create a custom plot that separates TSSes into up, down, and unchanged in the ATF3 knockdown. Importantly, since we've aligned the RNA-seq data table and the array, we can calculate subsets in the RNA-seq data (as boolean indexes) and use those same indexes into the array itself. For clarity, let's split up each step separately for the upregulated genes. .. code:: python # This is a pandas.Series, True where the log2foldchange was >1 upregulated = (data.log2foldchange > 1) upregulated .. parsed-literal:: id ENST00000318008 False ENST00000419929 False ENST00000433206 True ENST00000435347 False ENST00000443937 False ENST00000359238 False ENST00000393405 True ENST00000439357 False ENST00000452859 True ENST00000003834 True ENST00000379061 False ENST00000457710 False ENST00000003607 False ENST00000540200 True ENST00000158166 True ... ENST00000562182 False ENST00000564549 False ENST00000566140 False ENST00000566930 False ENST00000567452 False ENST00000569893 False ENST00000569279 False ENST00000565271 False ENST00000567351 False ENST00000569284 False ENST00000569543 False ENST00000565120 False ENST00000562555 False ENST00000570002 False ENST00000565472 False Name: log2foldchange, Length: 5708, dtype: bool .. code:: python # This gets us the underlying boolean NumPy array which we # can use to subset the array index = upregulated.values index .. parsed-literal:: array([False, False, True, ..., False, False, False], dtype=bool) .. code:: python # This is the subset of the array where the TSS of the transcript # went up in the ATF3 knockdown upregulated_chipseq_signal = normalized_subtracted[index, :] upregulated_chipseq_signal .. parsed-literal:: array([[ 1.03915645, -1.84141782, 0.03746102, ..., -1.84141782, 3.11746936, 3.11746936], [-1.84141782, 2.07831291, 0. , ..., 1.03915645, 1.03915645, -2.88057427], [-2.88057427, 2.07831291, 2.07831291, ..., 0. , 1.03915645, -1.84141782], ..., [ 1.03915645, -1.84141782, 1.27605155, ..., 0. , 0. , -2.88057427], [ 0. , -2.88057427, 0. , ..., -0.80226136, 1.86838231, 4.15662582], [ 0. , 0. , 0. , ..., -1.84141782, -1.84141782, -8.64172281]]) .. code:: python # We can combine the above steps into the following: subset = normalized_subtracted[(data.log2foldchange > 1).values, :] Now we just use the same technique for the up, down, and unchanged transcripts. Each one of them gets passed to the `ci_plot` method, which plots the line in the color we specify (`line_kwargs`, `fill_kwargs`) on the axes we specify (`bottom_axes`). .. code:: python # Signal over TSSs of transcripts that were activated upon knockdown. metaseq.plotutils.ci_plot( x, normalized_subtracted[(data.log2foldchange > 1).values, :], line_kwargs=dict(color='#fe9829', label='up'), fill_kwargs=dict(color='#fe9829', alpha=0.3), ax=bottom_axes) # Signal over TSSs of transcripts that were repressed upon knockdown metaseq.plotutils.ci_plot( x, normalized_subtracted[(data.log2foldchange < -1).values, :], line_kwargs=dict(color='#8e3104', label='down'), fill_kwargs=dict(color='#8e3104', alpha=0.3), ax=bottom_axes) # Signal over TSSs tof transcripts that did not change upon knockdown metaseq.plotutils.ci_plot( x, normalized_subtracted[((data.log2foldchange >= -1) & (data.log2foldchange <= 1)).values, :], line_kwargs=dict(color='.5', label='unchanged'), fill_kwargs=dict(color='.5', alpha=0.3), ax=bottom_axes); Finally, we do some cleaning up to make the figure look nicer (axes labels, legend, vertical lines at zero): .. code:: python # Clean up redundant x tick labels, and add axes labels fig.line_axes.set_xticklabels([]) fig.array_axes.set_xticklabels([]) fig.line_axes.set_ylabel('Average\nenrichement') fig.array_axes.set_ylabel('Transcripts on chr17') bottom_axes.set_ylabel('Average\nenrichment') bottom_axes.set_xlabel('Distance from TSS (bp)') fig.cax.set_ylabel('Enrichment') # Add the vertical lines for TSS position to all axes for ax in [fig.line_axes, fig.array_axes, bottom_axes]: ax.axvline(0, linestyle=':', color='k') # Nice legend bottom_axes.legend(loc='best', frameon=False, fontsize=8, labelspacing=.3, handletextpad=0.2) fig.subplots_adjust(left=0.3, right=0.8, bottom=0.05) fig .. image:: example_session_files/example_session_75_0.png We can save the figure to disk in different formats for manuscript preparation: .. code:: python fig.savefig('demo.png') fig.savefig('demo.svg') It appears that transcripts unchanged by ATF3 knockdown have the strongest ChIP signal. Transcripts that went up upon knockdown (that is, ATF3 normally represses them) had a slightly higher signal than those transcripts that went down (normally activated by ATF3). Interestingly, even though we used a crude cutoff of 2-fold for a single replicate, and we only looked at chr17, the direction of the relationship we see here -- where ATF3-repressed genes have a higher signal than ATF3-activated -- is consistent with ATF3's known repressive role. Extras ====== This section shows some examples of more advanced `metaseq` usage without as much explanatory text as above. More knowledge about `pandas`, `numpy`, and `matplotlib` are expected here. For further details, see the `metaseq` docs and source code for the functions used below. K-means clustering of ChIP-seq signal ------------------------------------- Note that K-means clustering is non-deterministic -- running it multiple times will give different clusters since the initial state is set randomly. .. code:: python # K-means input data should be normalized (mean=0, stddev=1) from sklearn import preprocessing X_scaled = preprocessing.scale(normalized_subtracted) k = 4 ind, breaks = metaseq.plotutils.new_clustered_sortind( # The array to cluster X_scaled, # Within each cluster, how the rows should be sorted row_key=np.mean, # How each cluster should be sorted cluster_key=np.median, # Number of clusters k=k) .. code:: python # Plot the heatmap again fig = metaseq.plotutils.imshow( normalized_subtracted, x=x, figsize=(3, 9), vmin=5, vmax=99, percentile=True, line_kwargs=dict(color='k', label='All'), fill_kwargs=dict(color='k', alpha=0.3), # A little tricky: `sort_by` expects values to sort by # (say, expression values). But we've pre-calculated # our actual sort index based on clusters, so we transform # it like this sort_by=np.argsort(ind), # This adds a "strip" axes on the right side, useful # for adding extra information. We'll add cluster color # codes here. strip=True, ) # De-clutter by hiding labels plt.setp( fig.strip_axes.get_yticklabels() + fig.strip_axes.get_xticklabels() + fig.array_axes.get_xticklabels(), visible=False) # fig.line_axes.set_ylabel('Average\nenrichement') fig.array_axes.set_ylabel('Transcripts on chr17') fig.strip_axes.yaxis.set_label_position('right') fig.strip_axes.set_ylabel('Cluster') fig.cax.set_ylabel('Enrichment') # Make colors import matplotlib cmap = matplotlib.cm.Spectral colors = cmap(np.arange(k) / float(k)) # This figure will contain average signal for each cluster fig2 = plt.figure(figsize=(10,3)) last_break = 0 cluster_number = 1 n_panel_rows = 1 n_panel_cols = k for color, this_break in zip(colors, breaks): if cluster_number == 1: sharex = None sharey = None else: sharex = fig2.axes[0] sharey = fig2.axes[0] ax = fig2.add_subplot( n_panel_rows, n_panel_cols, cluster_number, sharex=sharex, sharey=sharey) # The y position is somewhat tricky: the array was # displayed using matplotlib.imshow with the argument # `origin="lower"`, which means the row in the plot at y=0 # corresponds to the last row in the array (index=-1). # But the breaks are in array coordinates. So we convert # them by subtracting from the total array size. xpos = 0 width = 1 ypos = len(normalized_subtracted) - this_break height = this_break - last_break rect = matplotlib.patches.Rectangle( (xpos, ypos), width=width, height=height, color=color) fig.strip_axes.add_patch(rect) fig.array_axes.axhline(ypos, color=color, linewidth=2) chunk = normalized_subtracted[last_break:this_break] metaseq.plotutils.ci_plot( x, chunk, ax=ax, line_kwargs=dict(color=color), fill_kwargs=dict(color=color, alpha=0.3), ) ax.axvline(0, color='k', linestyle=':') ax.set_title('cluster %s\n(N=%s)' % (cluster_number, len(chunk))) cluster_number += 1 last_break = this_break .. image:: example_session_files/example_session_82_0.png .. image:: example_session_files/example_session_82_1.png Scatterplots of RNA-seq and ChIP-seq signal ------------------------------------------- More examples of integrating ChIP-seq and RNA-seq. This uses the `data` dataframe created above, which contains RNA-seq data aligned with the ChIP-seq array. .. code:: python # Convert to ResultsTable so we can take advantage of its # `scatter` method rt = ResultsTable(data) # Get the up/down regulated up = rt.log2foldchange > 1 dn = rt.log2foldchange < -1 # Go back to the ChIP-seq data and create a boolean array # that is True only for the top TSSes with the strongest # mean signal tss_means = normalized_subtracted.mean(axis=1) strongest_signal = np.zeros(len(tss_means)) == 1 strongest_signal[np.argsort(tss_means)[-25:]] = True rt.scatter( x='fpkm_control', y='fpkm_knockdown', xfunc=np.log1p, yfunc=np.log1p, genes_to_highlight=[ (up, dict(color='#da3b3a', alpha=0.8)), (dn, dict(color='#00748e', alpha=0.8)), (strongest_signal, dict(color='k', s=50, alpha=1)), ], general_kwargs=dict(marker='.', color='0.5', alpha=0.2, s=5), one_to_one=dict(color='r', linestyle=':') ); .. image:: example_session_files/example_session_84_0.png .. code:: python # Perhaps a better analysis would be to plot average # ChIP-seq signal vs log2foldchange directly. In an imaginary # world where biology is simple, we might expect TSSes with stronger # log2foldchange upon knockdown to have stronger ChIP-seq signal # in the control. # # To take advantage of the `scatter` method of ResultsTable objects, # we simply add the TSS signal means as another variable in the # dataframe. Then we can refer to it by name in `scatter`. # # We'll also use the same colors and genes to highlight from # above. rt.data['tss_means'] = tss_means rt.scatter( x='log2foldchange', y='tss_means', genes_to_highlight=[ (up, dict(color='#da3b3a', alpha=0.8)), (dn, dict(color='#00748e', alpha=0.8)), (strongest_signal, dict(color='k', s=50, alpha=1)), ], general_kwargs=dict(marker='.', color='0.5', alpha=0.2, s=5), yfunc=np.log2); .. image:: example_session_files/example_session_85_0.png