ManualΒΆ

ST Pipeline is a tool to process the Spatial Transcriptomics raw data or single cell data. The data is filtered, aligned to a genome, annotated to a reference, demultiplexed by array coordinates and then aggregated by counts that are not duplicates using the Unique Molecular Indentifiers. The output contains the counts matrix, a stats file, a log file and a BED file with all the transcripts.

The ST Pipeline requires two fastq files, an IDs files (BARCODE, X, Y), the path to a STAR genome index, the path to a annotation file in GTF format and a dataset name.

The ST Pipeline has many parameters, you can see a description of them by typing : st_pipeline_run.py --help

Note that the minimum read length is dependant on the type of kit used, and should be adjusted accordingly, i.e. a 150bp kit should have a different minimum read length than a 75bp kit.

Soft clipping is also not recommended when using the 75bp kit, due to the shorter length.

The UMI filter can be used for array batches 1000L6 and earlier. It is unable to be used for array batches 1000L7 and newer as the UMI in these arrays is fully randomised.

@Author Jose Fernandez Navarro <jose.fernandez.navarro@scilifelab.se>

usage: st_pipeline_run.py [-h]
                        --ids [FILE]
                        --ref-map [FOLDER]
                        --ref-annotation [FILE]
                        --expName [STRING]
                        [--allowed-missed [INT]]
                        [--allowed-kmer [INT]]
                        [--overhang [INT]]
                        [--min-length-qual-trimming [INT]]
                        [--mapping-rv-trimming [INT]]
                        [--length-id [INT]]
                        [--contaminant-index [FOLDER]]
                        [--qual-64]
                        [--htseq-mode [STRING]]
                        [--htseq-no-ambiguous]
                        [--start-id [INT]]
                        [--no-clean-up]
                        [--verbose]
                        [--mapping-threads [INT]]
                        [--min-quality-trimming [INT]]
                        [--bin-path [FOLDER]]
                        [--log-file [STR]]
                        [--output-folder [FOLDER]]
                        [--temp-folder [FOLDER]]
                        [--umi-allowed-mismatches [INT]]
                        [--umi-start-position [INT]]
                        [--umi-end-position [INT]]
                        [--keep-discarded-files]
                        [--remove-polyA [INT]]
                        [--remove-polyT [INT]]
                        [--remove-polyG [INT]]
                        [--remove-polyC [INT]]
                        [--remove-polyN [INT]]
                        [--filter-AT-content [INT%]]
                        [--filter-GC-content [INT%]]
                        [--disable-multimap]
                        [--disable-clipping]
                        [--umi-cluster-algorithm [STRING]]
                        [--min-intron-size [INT]]
                        [--max-intron-size [INT]]
                        [--umi-filter]
                        [--umi-filter-template [STRING]]
                        [--compute-saturation]
                        [--include-non-annotated]
                        [--inverse-mapping-rv-trimming [INT]]
                        [--low-memory]
                        [--two-pass-mode]
                        [--strandness [STRING]]
                        [--umi-quality-bases [INT]]
                        [--umi-counting-offset [INT]]
                        [--demultiplexing-metric [STRING]]
                        [--demultiplexing-multiple-hits-keep-one]
                        [--demultiplexing-trim-sequences [INT]]
                        [--homopolymer-mismatches [INT]]]
                        [--version]
                        fastq_file_fw fastq_file_rv

positional arguments

fastq_file_fw
  Read_1 containing the spatial barcodes and UMIs for each sequence.

fastq_file_rv
  Read_2 containing the gene sequence corresponding to the sequence in
  Read_1.

optional arguments

-h, --help                          Show this help message and exit.
--ids [FILE]                        Path to the file containing the map of
                                    barcodes to the array coordinates.
--ref-map [FOLDER]                  Path to the folder with the STAR index
                                    for the genome that you want to use to
                                    align the reads.
--ref-annotation [FILE]             Path to the reference annotation file
                                    (GTF or GFF format is required) to be
                                    used to annotated the reads.
--expName [STRING]                  Name of the experiment/dataset
                                    (The output files will prepend this
                                    name).
--allowed-missed [INT]              Number of allowed mismatches when
                                    demultiplexing against the barcodes
                                    with TaggD (default: 2).
--allowed-kmer [INT]                KMer length when demultiplexing against
                                    the barcodes with TaggD (default: 6).
--overhang [INT]                    Extra flanking bases added when
                                    demultiplexing against the barcodes.
--min-length-qual-trimming [INT]    Minimum length of the reads after
                                    trimming, shorter reads will be
                                    discarded (default: 25).
--mapping-rv-trimming [INT]         Number of bases to trim in the reverse
                                    reads for the mapping step (5' end)
                                    (default: 0).
--length-id [INT]                   Length of IDs
                                    (the length of the barcodes)
                                    (default: 18).
--contaminant-index [FOLDER]        Path to the folder with a STAR index
                                    with a contaminant genome. Reads will
                                    be filtered against the specified
                                    genome and mapping reads will be
                                    discarded.
--qual-64                           Use phred-64 quality instead of
                                    phred-33(default).
--htseq-mode [STRING]               Mode of Annotation when using HTSeq.
                                    Modes = {union ,
                                    intersection-nonempty(default),
                                    intersection-strict}.
--htseq-no-ambiguous                When using htseq discard reads
                                    annotating ambiguous genes
                                    (default False).
--start-id [INT]                    Start position of the IDs (Barcodes)
                                    in the R1 (counting from 0)
                                    (default: 0).
--no-clean-up                       Do not remove temporary/intermediary
                                    files (useful for debugging).
--verbose                           Show extra information on the log file.
--mapping-threads [INT]             Number of threads to use in the mapping
                                    step (default: 4).
--min-quality-trimming [INT]        Minimum phred quality a base must have
                                    in the trimming step (default: 20).
--bin-path [FOLDER]                 Path to folder where binary executables
                                    are present (system path by default).
--log-file [STR]                    Name of the file that we want to use to
                                    store the logs
                                    (default output to screen).
--output-folder [FOLDER]            Path of the output folder.
--temp-folder [FOLDER]              Path of the location for temporary
                                    files.
--umi-allowed-mismatches [INT]      Number of allowed mismatches
                                    (hamming distance) that UMIs of the
                                    same gene-spot must have in order to
                                    cluster together (default: 1).
--umi-start-position [INT]          Position in R1 (base wise) of the first
                                    base of the UMI (starting by 0)
                                    (default: 18).
--umi-end-position [INT]            Position in R1 (base wise) of the last
                                    base of the UMI (starting by 1)
                                    (default: 27).
--keep-discarded-files              Writes down unaligned, un-annotated
                                    and un-demultiplexed reads to files.
--remove-polyA [INT]                Remove PolyA stretches of the given
                                    length from R2 (default: 15).
--remove-polyT [INT]                Remove PolyT stretches of the given
                                    length from R2 (default: 15).
--remove-polyG [INT]                Remove PolyG stretches of the given
                                    length from R2 (default: 15).
--remove-polyC [INT]                Remove PolyC stretches of the given
                                    length from R2 (default: 15).
--remove-polyN [INT]                Remove PolyN stretches of the given
                                    length from R2 (default: 15).
--filter-AT-content [INT%]          Discards reads whose number of A and T
                                    bases in total are more or equal than
                                    the number given in percentage
                                    (default: 90).
--filter-GC-content [INT%]          Discards reads whose number of G and C
                                    bases in total are more or equal than
                                    the number given in percentage
                                    (default: 90).
--disable-multimap                  If activated, multiple aligned reads
                                    obtained during mapping will be all
                                    discarded. Otherwise the highest scored
                                    one will be kept.
--disable-clipping                  If activated, disable soft-clipping
                                    (local alignment) in the mapping step.
--umi-cluster-algorithm [STRING]    Type of clustering algorithm to use
                                    when performing UMIs duplicates
                                    removal.
                                    Modes = {naive(default), hierarchical, Adjacent and AdjacentBi}.
--min-intron-size [INT]             Minimum allowed intron size when searching for splice variants with STAR\
                                        Splices alignments are disabled by default (=1) but to turn it on set this parameter
                                    to a bigger number, for example 10 or 20. (defauldt: 1)
--max-intron-size [INT]             Maximum allowed intron size when searching for splice variants with STAR
                                        Splices alignments are disabled by default (=1) but to turn it on set this parameter
                                        to a big number, for example 10000 or 100000. (default: 1).
--umi-filter                        Enables the UMI quality filter based on
                                    the template given in
                                    --umi-filter-template.
--umi-filter-template [STRING]      UMI template (IUPAC nucleotide code)
                                    for the UMI filter, default = WSNNWSNNV
--compute-saturation                Performs a saturation curve computation
                                    by sub-sampling the annotated reads,
                                    computing unique molecules and then a
                                    saturation curve
                                    (included in the log file).
--include-non-annotated             Do not discard un-annotated reads
                                    (they will be labeled __no_feature)
--inverse-mapping-rv-trimming [INT] Number of bases to trim in the reverse
                                    reads for the mapping step on the
                                    3' end.
--low-memory                        Writes temporary records into disk in
                                    order to save memory but gaining a
                                    speed penalty.
--two-pass-mode                     Activates the 2 pass mode in STAR to
                                    also map against splice variants.
--strandness [STRING]               What strandness mode to use when
                                    annotating with htseq-count
                                    [no, yes(default), reverse].
--umi-quality-bases [INT]           Maximum number of low quality bases
                                    allowed in an UMI (default: 8).
--umi-counting-offset [INT]         Expression count for each gene-spot
                                    combination is expressed as the number
                                    of unique UMIs in each strand/start
                                    position. However some reads might have
                                    slightly different start positions due
                                    to amplification artifacts. This
                                    parameters allows one to define an
                                    offset from where to count unique UMIs
                                    (default: 150).
--demultiplexing-metric             Distance metric for TaggD demultiplexing:
                                    Subglobal, Levenshtein or Hamming
                                    (default: Subglobal)
--demultiplexing-multiple-hits-keep-one  When multiple ambiguous hits with same score are
                                    found in the demultiplexing, keep one (random)
--demultiplexing-trim-sequences     Trims from the barcodes in the input file when doing demultiplexing.
                                    The bases given in the list of tuples as START END START END .. where
                                    START is the integer position of the first base (0 based) and END is the integer
                                    position of the last base (1 based).
                                    Trimmng sequences can be given several times.
--homopolymer-mismatches                        Number of mismatches allowed when removing homopolymers. (default: 0)
--version                           Show program's version number and exit