ManualΒΆ
ST Pipeline is a tool to process the Spatial Transcriptomics raw data or single cell data. The data is filtered, aligned to a genome, annotated to a reference, demultiplexed by array coordinates and then aggregated by counts that are not duplicates using the Unique Molecular Indentifiers. The output contains the counts matrix, a stats file, a log file and a BED file with all the transcripts.
The ST Pipeline requires two fastq files, an IDs files (BARCODE, X, Y), the path to a STAR genome index, the path to a annotation file in GTF format and a dataset name.
The ST Pipeline has many parameters, you can see a description of them by typing : st_pipeline_run.py --help
Note that the minimum read length is dependant on the type of kit used, and should be adjusted accordingly, i.e. a 150bp kit should have a different minimum read length than a 75bp kit.
Soft clipping is also not recommended when using the 75bp kit, due to the shorter length.
The UMI filter can be used for array batches 1000L6 and earlier. It is unable to be used for array batches 1000L7 and newer as the UMI in these arrays is fully randomised.
@Author Jose Fernandez Navarro <jose.fernandez.navarro@scilifelab.se>
usage: st_pipeline_run.py [-h]
--ids [FILE]
--ref-map [FOLDER]
--ref-annotation [FILE]
--expName [STRING]
[--allowed-missed [INT]]
[--allowed-kmer [INT]]
[--overhang [INT]]
[--min-length-qual-trimming [INT]]
[--mapping-rv-trimming [INT]]
[--length-id [INT]]
[--contaminant-index [FOLDER]]
[--qual-64]
[--htseq-mode [STRING]]
[--htseq-no-ambiguous]
[--start-id [INT]]
[--no-clean-up]
[--verbose]
[--mapping-threads [INT]]
[--min-quality-trimming [INT]]
[--bin-path [FOLDER]]
[--log-file [STR]]
[--output-folder [FOLDER]]
[--temp-folder [FOLDER]]
[--umi-allowed-mismatches [INT]]
[--umi-start-position [INT]]
[--umi-end-position [INT]]
[--keep-discarded-files]
[--remove-polyA [INT]]
[--remove-polyT [INT]]
[--remove-polyG [INT]]
[--remove-polyC [INT]]
[--remove-polyN [INT]]
[--filter-AT-content [INT%]]
[--filter-GC-content [INT%]]
[--disable-multimap]
[--disable-clipping]
[--umi-cluster-algorithm [STRING]]
[--min-intron-size [INT]]
[--max-intron-size [INT]]
[--umi-filter]
[--umi-filter-template [STRING]]
[--compute-saturation]
[--include-non-annotated]
[--inverse-mapping-rv-trimming [INT]]
[--low-memory]
[--two-pass-mode]
[--strandness [STRING]]
[--umi-quality-bases [INT]]
[--umi-counting-offset [INT]]
[--demultiplexing-metric [STRING]]
[--demultiplexing-multiple-hits-keep-one]
[--demultiplexing-trim-sequences [INT]]
[--homopolymer-mismatches [INT]]]
[--version]
fastq_file_fw fastq_file_rv
positional arguments
fastq_file_fw
Read_1 containing the spatial barcodes and UMIs for each sequence.
fastq_file_rv
Read_2 containing the gene sequence corresponding to the sequence in
Read_1.
optional arguments
-h, --help Show this help message and exit.
--ids [FILE] Path to the file containing the map of
barcodes to the array coordinates.
--ref-map [FOLDER] Path to the folder with the STAR index
for the genome that you want to use to
align the reads.
--ref-annotation [FILE] Path to the reference annotation file
(GTF or GFF format is required) to be
used to annotated the reads.
--expName [STRING] Name of the experiment/dataset
(The output files will prepend this
name).
--allowed-missed [INT] Number of allowed mismatches when
demultiplexing against the barcodes
with TaggD (default: 2).
--allowed-kmer [INT] KMer length when demultiplexing against
the barcodes with TaggD (default: 6).
--overhang [INT] Extra flanking bases added when
demultiplexing against the barcodes.
--min-length-qual-trimming [INT] Minimum length of the reads after
trimming, shorter reads will be
discarded (default: 25).
--mapping-rv-trimming [INT] Number of bases to trim in the reverse
reads for the mapping step (5' end)
(default: 0).
--length-id [INT] Length of IDs
(the length of the barcodes)
(default: 18).
--contaminant-index [FOLDER] Path to the folder with a STAR index
with a contaminant genome. Reads will
be filtered against the specified
genome and mapping reads will be
discarded.
--qual-64 Use phred-64 quality instead of
phred-33(default).
--htseq-mode [STRING] Mode of Annotation when using HTSeq.
Modes = {union ,
intersection-nonempty(default),
intersection-strict}.
--htseq-no-ambiguous When using htseq discard reads
annotating ambiguous genes
(default False).
--start-id [INT] Start position of the IDs (Barcodes)
in the R1 (counting from 0)
(default: 0).
--no-clean-up Do not remove temporary/intermediary
files (useful for debugging).
--verbose Show extra information on the log file.
--mapping-threads [INT] Number of threads to use in the mapping
step (default: 4).
--min-quality-trimming [INT] Minimum phred quality a base must have
in the trimming step (default: 20).
--bin-path [FOLDER] Path to folder where binary executables
are present (system path by default).
--log-file [STR] Name of the file that we want to use to
store the logs
(default output to screen).
--output-folder [FOLDER] Path of the output folder.
--temp-folder [FOLDER] Path of the location for temporary
files.
--umi-allowed-mismatches [INT] Number of allowed mismatches
(hamming distance) that UMIs of the
same gene-spot must have in order to
cluster together (default: 1).
--umi-start-position [INT] Position in R1 (base wise) of the first
base of the UMI (starting by 0)
(default: 18).
--umi-end-position [INT] Position in R1 (base wise) of the last
base of the UMI (starting by 1)
(default: 27).
--keep-discarded-files Writes down unaligned, un-annotated
and un-demultiplexed reads to files.
--remove-polyA [INT] Remove PolyA stretches of the given
length from R2 (default: 15).
--remove-polyT [INT] Remove PolyT stretches of the given
length from R2 (default: 15).
--remove-polyG [INT] Remove PolyG stretches of the given
length from R2 (default: 15).
--remove-polyC [INT] Remove PolyC stretches of the given
length from R2 (default: 15).
--remove-polyN [INT] Remove PolyN stretches of the given
length from R2 (default: 15).
--filter-AT-content [INT%] Discards reads whose number of A and T
bases in total are more or equal than
the number given in percentage
(default: 90).
--filter-GC-content [INT%] Discards reads whose number of G and C
bases in total are more or equal than
the number given in percentage
(default: 90).
--disable-multimap If activated, multiple aligned reads
obtained during mapping will be all
discarded. Otherwise the highest scored
one will be kept.
--disable-clipping If activated, disable soft-clipping
(local alignment) in the mapping step.
--umi-cluster-algorithm [STRING] Type of clustering algorithm to use
when performing UMIs duplicates
removal.
Modes = {naive(default), hierarchical, Adjacent and AdjacentBi}.
--min-intron-size [INT] Minimum allowed intron size when searching for splice variants with STAR\
Splices alignments are disabled by default (=1) but to turn it on set this parameter
to a bigger number, for example 10 or 20. (defauldt: 1)
--max-intron-size [INT] Maximum allowed intron size when searching for splice variants with STAR
Splices alignments are disabled by default (=1) but to turn it on set this parameter
to a big number, for example 10000 or 100000. (default: 1).
--umi-filter Enables the UMI quality filter based on
the template given in
--umi-filter-template.
--umi-filter-template [STRING] UMI template (IUPAC nucleotide code)
for the UMI filter, default = WSNNWSNNV
--compute-saturation Performs a saturation curve computation
by sub-sampling the annotated reads,
computing unique molecules and then a
saturation curve
(included in the log file).
--include-non-annotated Do not discard un-annotated reads
(they will be labeled __no_feature)
--inverse-mapping-rv-trimming [INT] Number of bases to trim in the reverse
reads for the mapping step on the
3' end.
--low-memory Writes temporary records into disk in
order to save memory but gaining a
speed penalty.
--two-pass-mode Activates the 2 pass mode in STAR to
also map against splice variants.
--strandness [STRING] What strandness mode to use when
annotating with htseq-count
[no, yes(default), reverse].
--umi-quality-bases [INT] Maximum number of low quality bases
allowed in an UMI (default: 8).
--umi-counting-offset [INT] Expression count for each gene-spot
combination is expressed as the number
of unique UMIs in each strand/start
position. However some reads might have
slightly different start positions due
to amplification artifacts. This
parameters allows one to define an
offset from where to count unique UMIs
(default: 150).
--demultiplexing-metric Distance metric for TaggD demultiplexing:
Subglobal, Levenshtein or Hamming
(default: Subglobal)
--demultiplexing-multiple-hits-keep-one When multiple ambiguous hits with same score are
found in the demultiplexing, keep one (random)
--demultiplexing-trim-sequences Trims from the barcodes in the input file when doing demultiplexing.
The bases given in the list of tuples as START END START END .. where
START is the integer position of the first base (0 based) and END is the integer
position of the last base (1 based).
Trimmng sequences can be given several times.
--homopolymer-mismatches Number of mismatches allowed when removing homopolymers. (default: 0)
--version Show program's version number and exit