pyDNase Analysis Scripts

pyDNase installs several scripts which also serve as examples of how to use the pyDNAse API. After you have installed pyDNase these will all be installed into your $PATH, so you can run these directly from your terminal by typing in the name of the script (including the .py extension)

Please have a rummage through the source - it’s all documented (and hopefully understandable!)

example_footprint_scores.py

This script tests that everything has been installed and will run correctly. Upon running it, you should see the following window

_images/example_footprint_scores_output.png

If so, congratulations! Everything has installed properly. The red and blue bars correspond to cuts on the positive and negative strand, respectively, and the black line represents the raw Wellington footprint scores.

dnase_cut_counter.py

This is a very simple script that take a BED and a BAM file and produce a new BED file where the score is the number of DNase cuts found in this region.

usage: dnase_cut_counter.py [-h] [-A] regions reads output

Annotates a BED file with with the number of DNase cuts in it

positional arguments:
  regions     BED file
  reads       The BAM file containing the DNase-seq data
  output      filename to write the output to

optional arguments:
  -h, --help  show this help message and exit
  -A          ATAC-seq mode (default: False)

dnase_average_profile.py

Important

Make sure you look at the next section (DNase I cleavage bias correction)

_images/K562CTCFCHIP.png

Average profile of DNase I activity surrounding ChIP-seq confirmed CTCF sites in K562 data.

Average profile plots illustrating DNase activity surrounding a set of regions are frequently used in papers. Here, we provide a simple way to generate one.

usage: dnase_average_profile.py [-h] [-w WINDOW_SIZE] [-bf BIAS_FILE] [-i]
                                [-c] [-n] [-b] [-A]
                                regions reads output

Plots average profile of DNase (or Tn5 for ATAC-seq) activity surrounding a
list of regions in a BED file

positional arguments:
  regions               BED file of the regions you want to generate the
                        average profile for
  reads                 The BAM file containing the DNase-seq data
  output                filename to write the output to

optional arguments:
  -h, --help            show this help message and exit
  -w WINDOW_SIZE, --window_size WINDOW_SIZE
                        Size of flanking area around centre of the regions to
                        plot (default: 100)
  -bf BIAS_FILE, --bias-file BIAS_FILE
                        Location of the sorted, index
  -i                    Ignores any strand information in BED file and plots
                        data relative to reference strand
  -c                    Combine the strand information into one graph
  -n                    Normalise cut counts to a fraction peaks
  -b                    Normalise for cutting bias
  -A                    ATAC-seq mode

Hopefully this is self-explanatory. This script uses matplotlib to generate the output, so it will write a filetype based on the file extension provided (e.g. out.png or output.pdf).

dnase_wig_tracks.py

Often, we want to visualise the raw cut data (just the 5’ most ends of the cuts) from a DNase-seq experiment, as visualising the pileups isn’t helpful here. Here’s the FMR1 promoter viewed as a BAM file in IGV

_images/FMR1a.png

and here’s the corresponding cut locations.

_images/FMR1b.png

We provide dnase_wig_tracks.py that generates a WIG file (we recommend you convert it to a BigWIG file using UCSC’s wigToBigWig) based on a BAM file a list of regions of interest

usage: dnase_wig_tracks.py [-h] [-r] [-A] regions reads fw_output rev_output

Writes two WIG files with the cut information based on the regions in reads
BED file and the reads in reads BAM file

positional arguments:
  regions     BED file of the regions you want to write wig tracks for
  reads       The BAM file containing the read data
  fw_output   Path to write the forward reads wig track to
  rev_output  Path to write the reverse reads wig track to

optional arguments:
  -h, --help  show this help message and exit
  -r, --real  Report cuts on the negative strand as positive numbers instead
              of negative (default: False)
  -A          ATAC-seq mode (default: False)

Note that by default, cuts on the reverse strand will be reported as negative numbers (for visualisation). If you want to be using this data for something else, you can pass the -r flag, which will use the real number of cuts.

dnase_to_JSON.py

This outputs DNase-seq data to JSON

usage: dnase_to_JSON.py [-h] [-w WINDOW_SIZE] [-i] [-A] regions reads output

Writes a JSON file of DNase I cuts for regions from a BED file

positional arguments:
  regions               BED file of the regions
  reads                 BAM file containing the read data
  output                filename to write the JSON output to

optional arguments:
  -h, --help            show this help message and exit
  -w WINDOW_SIZE, --window_size WINDOW_SIZE
                        Resize all regions to a specific length
  -i                    Ignores strand information in BED file
  -A                    ATAC-seq mode (default: False)

dnase_to_javatreeview.py

Important

Make sure you look at the next section (DNase I cleavage bias correction)

_images/K562AP1CHIP.png

Want to make a heatmap? Love JavaTreeView? So do we! This script will generate a CSV file that you can put straight into JavaTreeView to visualize your data.

The options to be aware of here are -i and -a

usage: dnase_to_javatreeview.py [-h] [-w WINDOW_SIZE] [-i] [-o] [-a] [-n] [-c]
                                [-b] [-A] [-bf BIAS_FILE] [-r]
                                regions reads output

Writes a JavaTreeView file based on the regions in reads BED file and the
reads in reads BAM file

positional arguments:
  regions               BED file of the regions you want to generate the
                        heatmap for
  reads                 The BAM file containing the read data
  output                filename to write the CSV output to

optional arguments:
  -h, --help            show this help message and exit
  -w WINDOW_SIZE, --window_size WINDOW_SIZE
                        Size of flanking area around centre of the regions to
                        plot (default: 100)
  -i                    Ignores strand information in BED file
  -o                    Orders output the same as the input (default: orders
                        by FOS)
  -a                    Write absolute cut counts instead strand imbalanced
                        counts
  -n                    Normalise the cut data for each region between 0 and 1
  -c                    Disable memory caching (low RAM mode)
  -b                    Normalise for cutting bias
  -A                    ATAC-seq mode (default: False)
  -bf BIAS_FILE, --bias-file BIAS_FILE
                        Location of the sorted, index
  -r                    Randomise the ordering of the output

wellington_footprints.py

So you want to get footprints from your data? No problem. We provide a handy script that will do this for you. There’s lots of options here, so please read through them carefully. The most basic usage of the script uses the default parameters described in our original paper. If anything goes wrong at any point, then there should be useful error messages telling you exactly what went wrong.

usage: wellington_footprints.py [-h] [-b] [-sh SHOULDER_SIZES]
                                [-fp FOOTPRINT_SIZES] [-d] [-fdr FDR_CUTOFF]
                                [-fdriter FDR_ITERATIONS]
                                [-fdrlimit FDR_LIMIT] [-pv PV_CUTOFFS] [-dm]
                                [-o OUTPUT_PREFIX] [-p P] [-A]
                                regions reads outputdir

Footprint the DHSs in a DNase-seq or ATAC-seq experiment using the Wellington
Algorithm.

positional arguments:
  regions               BED file of the regions you want to footprint
  reads                 The BAM file containing the DNase-seq reads
  outputdir             A writeable directory to write the results to

optional arguments:
  -h, --help            show this help message and exit
  -b, --bonferroni      Performs a bonferroni correction (default: False)
  -sh SHOULDER_SIZES, --shoulder-sizes SHOULDER_SIZES
                        Range of shoulder sizes to try in format
                        "from,to,step" (default: 35,36,1)
  -fp FOOTPRINT_SIZES, --footprint-sizes FOOTPRINT_SIZES
                        Range of footprint sizes to try in format
                        "from,to,step" (default: 11,26,2)
  -d, --one_dimension   Use Wellington 1D instead of Wellington (default:
                        False)
  -fdr FDR_CUTOFF, --FDR_cutoff FDR_CUTOFF
                        Write footprints using the FDR selection method at a
                        specific FDR (default: 0.01)
  -fdriter FDR_ITERATIONS, --FDR_iterations FDR_ITERATIONS
                        How many randomisations to use when performing FDR
                        calculations (default: 100)
  -fdrlimit FDR_LIMIT, --FDR_limit FDR_LIMIT
                        Minimum p-value to be considered significant for FDR
                        calculation (default: -20)
  -pv PV_CUTOFFS, --pv_cutoffs PV_CUTOFFS
                        Select footprints using a range of pvalue cutoffs
                        (default: -10,-20,-30,-40,-50,-75,-100,-300,-500,-700
  -dm, --dont-merge-footprints
                        Disables merging of overlapping footprints (Default:
                        False)
  -o OUTPUT_PREFIX, --output_prefix OUTPUT_PREFIX
                        The prefix for results files (default:
                        <reads.regions>)
  -p P                  Number of processes to use (default: uses all CPUs)
  -A                    ATAC-seq mode (default: False)

wellington_bootstrap.py

This script will calculate the differential footprints between two DNase-seq datasets using the Wellington-bootstrap algorithm.

usage: wellington_bootstrap.py [-h] [-fp FOOTPRINT_SIZES] [-fdr FDR_CUTOFF]
                               [-fdriter FDR_ITERATIONS] [-fdrlimit FDR_LIMIT]
                               [-p PROCESSES] [-A]
                               treatment_bam control_bam bedsites
                               treatment_only_output control_only_output

Scores Differential Footprints using Wellington-Bootstrap.

positional arguments:
  treatment_bam         BAM file for treatment
  control_bam           BAM file for control
  bedsites              BED file of genomic locations to search in
  treatment_only_output
                        File to write treatment specific fooprints scores to
  control_only_output   File to write control specific footprint scores to

optional arguments:
  -h, --help            show this help message and exit
  -fp FOOTPRINT_SIZES, --footprint-sizes FOOTPRINT_SIZES
                        Range of footprint sizes to try in format
                        "from,to,step" (default: 11,26,2)
  -fdr FDR_CUTOFF, --FDR_cutoff FDR_CUTOFF
                        Detect footprints using the FDR selection method at a
                        specific FDR (default: 0.01)
  -fdriter FDR_ITERATIONS, --FDR_iterations FDR_ITERATIONS
                        How many randomisations to use when performing FDR
                        calculations (default: 100)
  -fdrlimit FDR_LIMIT, --FDR_limit FDR_LIMIT
                        Minimum p-value to be considered significant for FDR
                        calculation (default: -20)
  -p PROCESSES, --processes PROCESSES
                        Number of processes to use (default: uses all CPUs)
  -A                    ATAC-seq mode (default: False)

dnase_ddhs_scorer.py

This will score DHSs between two DNase-seq datasets according to the ∆DHS algorithm (He et al. 2012)

usage: dnase_ddhs_scorer.py [-h] [-l] [-A]
                            regions treat_dhs control_dhs reads_treat
                            reads_control output

Annotates a set of DHSs with the dDHS score (He et al. 2012)

positional arguments:
  regions        The set of BED files you wish to annotate with dDHS scores
  treat_dhs      The DHSs belonging to the Treatment
  control_dhs    The DHSs belonging to the control
  reads_treat    The BAM file containing the Treatment DNase-seq data
  reads_control  The BAM file containing the Control DNase-seq data
  output         filename to write the output to

optional arguments:
  -h, --help     show this help message and exit
  -l             low RAM mode (disables caching) (Default: False)
  -A             ATAC-seq mode (default: False)

dnase_bias_estimator.py

This is mainly here for curiosity reasons - the chances are that if you’re trying to calculate this then you’ll know how to do it.

usage: dnase_bias_estimator.py [-h]
                               regions reads genome_sequence genomesize output

Calculates the 6-mer 5' insertion bias for a NGS dataset

positional arguments:
  regions          BED file of the regions you want to exclude from
                   calculating the bias. This is usually the DHSs.
  reads            The sorted, indexed BAM file containing the DNase-seq data
  genome_sequence  The sorted, indexed FASTA file containing the genome
                   sequence
  genomesize       The .chrom.sizes file containing chromosome sizes generated
                   using something like "mysql --user=genome --host=genome-
                   mysql.cse.ucsc.edu -A -e "select chrom, size from
                   hg19.chromInfo" > hg19.chrom.sizes"
  output           output file prefix to write the observed/expected ratios to
                   (will append .txt and .pickle)

optional arguments:
  -h, --help       show this help message and exit