READemption’s subcommands

In general the subcommands need at least one argument - the analysis folder. If this is not given READemption assumes that the current folder is the analysis folder.

create

create generates the required folder structure for input and output files. Once these folders are created the input files have to be placed into the correct locations. As a minimal requirement, RNA-Seqs reads in FASTA format (can be compressed with bzip2 or gzip) must be placed in input/reads and the reference sequence in FASTA format must be copied or linked in input/reference_sequences. For the command gene_quanti annotation files in GFF3 format have to be put into input/annotations.

usage: reademption create [-h] project_path

positional arguments:
  project_path  Name/path of the project.

optional arguments:
  -h, --help    show this help message and exit

align

align performs the clipping and size filtering of the reads, as well as the actual aligning to the reference sequences. It also generates statistics about the steps (e.g. number of aligned reads, number of mappings). As the result of this steps are needed by the other subcommands it has to be run before the others. It requires reads in FASTA or FASTQ format (or counterparts compressed with gzip or bzip2) and reference sequences in FASTA format. align generates the read alignments in BAM format (*.bam) and also index files for those (*.bam.bai). Is also stores unmapped reads so that they can be inspected e.g. to search for contaminations. The file output/align/reports_and_stats/read_alignment_stats.csv lists several mapping statistics. The folder output/align/reports_and_stats/stats_data_json/ contains files with the original countings in JSON format. Please be aware that READemption can perform only basic quality trimming and adapter clipping. If this is not sufficient you can use the FASTX toolkit, cutadapt or other tools for the preprocessing.

usage: reademption align [-h] [--min_read_length MIN_READ_LENGTH]
                         [--processes PROCESSES]
                         [--segemehl_accuracy SEGEMEHL_ACCURACY]
                         [--segemehl_evalue SEGEMEHL_EVALUE]
                         [--segemehl_bin SEGEMEHL_BIN] [--paired_end]
                         [--split] [--poly_a_clipping] [--realign]
                         [--keep_original_alignments] [--lack_bin LACK_BIN]
                         [--fastq] [--check_for_existing_files] [--progress]
                         [--crossalign_cleaning CROSSALIGN_CLEANING_STRING]
                         [project_path]

positional arguments:
  project_path          Path of the project folder. If none is given the
                        current directory is used.

optional arguments:
  -h, --help            show this help message and exit
  --min_read_length MIN_READ_LENGTH, -l MIN_READ_LENGTH
                        Minimal read length after clipping (default 12).
                        Should be higher for eukaryotic species.
  --processes PROCESSES, -p PROCESSES
                        Number of processes that should be used (default 1).
  --segemehl_accuracy SEGEMEHL_ACCURACY, -a SEGEMEHL_ACCURACY
                        Segemehl's minimal accuracy (in %) (default 95).
  --segemehl_evalue SEGEMEHL_EVALUE, -e SEGEMEHL_EVALUE
                        Segemehl's maximal e-value (default 5.0).
  --segemehl_bin SEGEMEHL_BIN, -s SEGEMEHL_BIN
                        Segemehl's binary path (default 'segemehl.x').
  --paired_end, -P      Use this if reads are originating from a paired-end
                        sequencing. The members of a pair must be marked with
                        '_p1' and '_p2' in front of the file type suffixes
                        (e.g. 'my_sample_p1.fa' and 'my_sample_p2.fa' or
                        'my_sample_p1.fa.bz2' and 'my_sample_p2.fa.bz2'). This
                        option cannot be use with polyA tail clipping.
  --split, -S           Run segemehl with read splitting.
  --poly_a_clipping, -c
                        Perform polyA tail clipping. This option cannot be
                        used for paired-end reads.
  --realign, -r         Perform realignment of unmapped reads using 'lack'.
  --keep_original_alignments, -k
                        Only used with --realign/-r. Keep the alignment file
                        of the primary mapper (segemehl) and the realigner
                        (lack) after merging.
  --lack_bin LACK_BIN, -L LACK_BIN
                        Lack's binary path (default 'lack.x').
  --fastq, -q           Input reads are in FASTQ not FASTA format.
  --min_phred_score MIN_PHRED_SCORE, -Q MIN_PHRED_SCORE
                        Minimal Phred score. Works only if read are given in
                        FASTQ format. As soon as a based drop below this value
                        it and all the nucleotides downstream of it will be
                        trimmed off.
  --adapter ADAPTER, -A ADAPTER
                        Adapter sequence. If it is found in a read it and all
                        the nucleotides downstream will be trimmed off.
  --check_for_existing_files, -f
                        Check for existing files (e.g. from a interrupted
                        previous run) and do not overwrite them if they exits.
                        Attention! You have to take care that there are no
                        partially generated files left!
  --reverse_complement, -R
                        Map reverse complement of the input reads.
  --progress, -g          Show progress of the segemehl mapping.
  --crossalign_cleaning CROSSALIGN_CLEANING_STRING, -x CROSSALIGN_CLEANING_STRING
                        Remove reads that are cross-mapped to replicons of
                        different species. To associated species and replicons
                        give a string in the following format: '<ORG_NAME_1>:<
                        org_1_repl1>,<org_1_repl2>,..,<org_1_repl_n>;<ORG_NAME
                        _2>:<org_2_repl1>,<org_2_repl2>,..,<org_2_repl_n>'

coverage

coverage generates strand specific coverage files in wiggle format based on the read alignments. These wiggle files can be viewed in common genome browser like the Integrated genome browser (IGB) or the Integrative genome viewer (IGV). Three sets of wiggle files will be generated: raw counting values without normalization (located in the folder coverage-raw), normalized by the total number of aligned reads (abbreviated as tnoar) and the multiplied by the lowest number of aligned reads of all considered libraries (in folder coverage-tnoar_min_normalized) as well as normalized by the total number of aligned reads and multiplied by one million (coverage-tnoar_mil_normalized). The different normalizations make a visual semi-quantitative comparative possible and enable to perform transcription start site analysis (e.g. using tools like TSSPredator). For each library and set there will be coverage files for the forward and the reverse strand. The coverages for the forward strand have positive values while the one for the reverse stand have negative values in order to make a visual discrimination easy. Per default all reads and each position of them will be considered. To calculate the coverages only based on uniquely aligned read use the --unique_only parameter. If only the first base should be considered add --first_base_only. Reads are aligned to multiple location will account only in fraction to the values of the different positions. For example a read that is mapped to three different location will contribute a value of 1/3 to each of the nucleotiedes of these positions. To turn off this behavior use --skip_read_count_splitting.

usage: reademption coverage [-h] [--unique_only] [--normalize_by_uniquely]
                            [--processes PROCESSES]
                            [--skip_read_count_splitting] [--first_base_only]
                            [--check_for_existing_files]
                            [project_path]

positional arguments:
  project_path          Path of the project folder. If none is given the
                        current directory is used.

optional arguments:
  -h, --help            show this help message and exit
  --unique_only, -u     Use uniquely aligned reads only.
  --normalize_by_uniquely, -U
                        Normalize by the number of uniquely aligned reads. By
                        default the normalization is done based on the total
                        number of aligned reads even if only uniquely aligned
                        reads are used for the coverage calculation.
  --processes PROCESSES, -p PROCESSES
                        Number of processes that should be used (default 1).
  --skip_read_count_splitting, -s
                        Do not split the read counting between different
                        alignings. Default is to do the splitting.
  --non_strand_specific, -d
                        Do not distict between the coverage of the forward and
                        reverse strand but sum them to a single value for each
                        base.
  --first_base_only, -b
                        Only the first bases 5' base of each read aligning is
                        taken into account.
  --check_for_existing_files, -f
                        Check for existing files (e.g. from a interrupted
                        previous run) and do not overwrite them if they exits.
                        Attention! You have to take care that there are no
                        partially generated files left!

gene_quanti

With gene_quanti the number of reads overlapping with each of the annotation entries is counted and the results are combined in tables. At least one GGF3 file with annotations has to be placed in input/annotations. The sequence ID of the sequenced must be precisely the same as the IDs used in the reference sequence FASTA files. To specify the feature classes (the third column in the GFF3 file e.g. CDS, gene, rRNA, tRNA) that should be quantified the parameter --features can be used. Otherwise countings for all annotation entries are generated. Per default sense and anti-sense overlaps are counted and separately listed.

usage: reademption gene_quanti [-h] [--min_overlap MIN_OVERLAP]
                               [--no_count_split_by_alignment_no]
                               [--no_count_splitting_by_gene_no]
                               [--skip_antisense] [--processes PROCESSES]
                               [--features ALLOWED_FEATURES] [--unique_only]
                               [--pseudocounts] [--check_for_existing_files]
                               [project_path]

positional arguments:
  project_path          Path of the project folder. If none is given the
                        current directory is used.

optional arguments:
  -h, --help            show this help message and exit
  --min_overlap MIN_OVERLAP, -o MIN_OVERLAP
                        Minimal read-annotation-overlap (in nt) (default 1).
  --no_count_split_by_alignment_no, -n
                        Do not split read countings by the number of
                        alignments a read has. By default this count splitting
                        is performed.
  --no_count_splitting_by_gene_no, -l
                        Do not split read countings by the number of genes it
                        overlaps with. By default this count splitting is
                        performed.
  --skip_antisense, -a  Do not count anti-sense read-gene-overlaps. By default
                        sense and anti-sense overlaps are counted and
                        separately reported.
  --non_strand_specific
                        Use countings of reads overlapping with a gene on both
                        strands and sum them up.
  --processes PROCESSES, -p PROCESSES
                        Number of processes that should be used (default 1).
  --features ALLOWED_FEATURES, -t ALLOWED_FEATURES
                        Comma separated list of features that should be
                        considered (e.g. gene, cds, region, exon). Other
                        feature will be skipped. If not specified all features
                        will be considered.
  --unique_only, -u     Use uniquely aligned reads only.
  --pseudocounts, -c    Add a pseudocount of 1 to each gene.
  --check_for_existing_files, -f
                        Check for existing files (e.g. from a interrupted
                        previous run) and do not overwrite them if they exits.
                        Attention! You have to take care that there are no
                        partially generated files left!

deseq

Differential gene expression can be performed using deseq which will run a DESeq2 analyses for all possible combinations of conditions. To allocated the conditions to the libraries use the --libs and --conditions parameters (e.g. --libs SamA_R1.fa,SamA_R2.fa,SamB_R1.fa,SamB_R2.fa --conditions SamA,SamA,SamB,SamB).

usage: reademption deseq [-h] --libs LIBS --conditions CONDITIONS
                         [--cooks_cutoff_off]
                         [project_path]

positional arguments:
  project_path          Path of the project folder. If none is given the
                        current directory is used.

optional arguments:
  -h, --help            show this help message and exit
  --libs LIBS, -l LIBS  Comma separated list of libraries.
  --conditions CONDITIONS, -c CONDITIONS
                        Comma separated list of condition in the same order as
                        their corresponding libraries.
  --cooks_cutoff_off, -k

viz_align

viz_align plots histograms of the read length distributions of the reads before and after the read clipping.

usage: reademption viz_align [-h] [project_path]

positional arguments:
  project_path  Path of the project folder. If none is given the current
                directory is used.

optional arguments:
  -h, --help    show this help message and exit

viz_gene_quanti

viz_gene_quanti creates scatterplots in which the raw gene wise quantification values are compared for each library pair (all-against-all). For each comparison the pearson correllation (r) coefficiant is. Additionally, bar charts that visualize the distribution of the read counting of the different annotation classes are plotted.

usage: reademption viz_gene_quanti [-h] [project_path]

positional arguments:
  project_path  Path of the project folder. If none is given the current
                directory is used.

optional arguments:
  -h, --help    show this help message and exit

viz_deseq

viz_deseq generates MA-plots of the comparison (log2 fold changes vs. the base mean) as well as volcano plots (log2 fold changes vs. p-values / adjusted p-values).

usage: reademption viz_deseq [-h] [project_path]

positional arguments:
  project_path  Path of the project folder. If none is given the current
                directory is used.

optional arguments:
  -h, --help    show this help message and exit
--max_pvalue MAX_PVALUE
                        Maximum adjusted p-value for genes considered to be
                        regulated. Genes with adjusted p-values below will be
                        marked red. (default 0.05)