Manual ------ ST Pipeline is a tool to process the Spatial Transcriptomics raw data or single cell data. The data is filtered, aligned to a genome, annotated to a reference, demultiplexed by array coordinates and then aggregated by counts that are not duplicates using the Unique Molecular Indentifiers. The output contains the counts matrix, a stats file, a log file and a BED file with all the transcripts. The ST Pipeline requires two fastq files, an IDs files (BARCODE, X, Y), the path to a STAR genome index, the path to a annotation file in GTF format and a dataset name. The ST Pipeline has many parameters, you can see a description of them by typing : st_pipeline_run.py --help Note that the minimum read length is dependant on the type of kit used, and should be adjusted accordingly, i.e. a 150bp kit should have a different minimum read length than a 75bp kit. Soft clipping is also not recommended when using the 75bp kit, due to the shorter length. The UMI filter can be used for array batches 1000L6 and earlier. It is unable to be used for array batches 1000L7 and newer as the UMI in these arrays is fully randomised. @Author Jose Fernandez Navarro <jose.fernandez.navarro@scilifelab.se> .. code-block:: bash usage: st_pipeline_run.py [-h] --ids [FILE] --ref-map [FOLDER] --ref-annotation [FILE] --expName [STRING] [--allowed-missed [INT]] [--allowed-kmer [INT]] [--overhang [INT]] [--min-length-qual-trimming [INT]] [--mapping-rv-trimming [INT]] [--length-id [INT]] [--contaminant-index [FOLDER]] [--qual-64] [--htseq-mode [STRING]] [--htseq-no-ambiguous] [--start-id [INT]] [--no-clean-up] [--verbose] [--mapping-threads [INT]] [--min-quality-trimming [INT]] [--bin-path [FOLDER]] [--log-file [STR]] [--output-folder [FOLDER]] [--temp-folder [FOLDER]] [--umi-allowed-mismatches [INT]] [--umi-start-position [INT]] [--umi-end-position [INT]] [--keep-discarded-files] [--remove-polyA [INT]] [--remove-polyT [INT]] [--remove-polyG [INT]] [--remove-polyC [INT]] [--remove-polyN [INT]] [--filter-AT-content [INT%]] [--filter-GC-content [INT%]] [--disable-multimap] [--disable-clipping] [--umi-cluster-algorithm [STRING]] [--min-intron-size [INT]] [--max-intron-size [INT]] [--umi-filter] [--umi-filter-template [STRING]] [--compute-saturation] [--include-non-annotated] [--inverse-mapping-rv-trimming [INT]] [--low-memory] [--two-pass-mode] [--strandness [STRING]] [--umi-quality-bases [INT]] [--umi-counting-offset [INT]] [--demultiplexing-metric [STRING]] [--demultiplexing-multiple-hits-keep-one] [--demultiplexing-trim-sequences [INT]] [--homopolymer-mismatches [INT]]] [--version] fastq_file_fw fastq_file_rv **positional arguments** .. code-block:: bash fastq_file_fw Read_1 containing the spatial barcodes and UMIs for each sequence. fastq_file_rv Read_2 containing the gene sequence corresponding to the sequence in Read_1. **optional arguments** .. code-block:: bash -h, --help Show this help message and exit. --ids [FILE] Path to the file containing the map of barcodes to the array coordinates. --ref-map [FOLDER] Path to the folder with the STAR index for the genome that you want to use to align the reads. --ref-annotation [FILE] Path to the reference annotation file (GTF or GFF format is required) to be used to annotated the reads. --expName [STRING] Name of the experiment/dataset (The output files will prepend this name). --allowed-missed [INT] Number of allowed mismatches when demultiplexing against the barcodes with TaggD (default: 2). --allowed-kmer [INT] KMer length when demultiplexing against the barcodes with TaggD (default: 6). --overhang [INT] Extra flanking bases added when demultiplexing against the barcodes. --min-length-qual-trimming [INT] Minimum length of the reads after trimming, shorter reads will be discarded (default: 25). --mapping-rv-trimming [INT] Number of bases to trim in the reverse reads for the mapping step (5' end) (default: 0). --length-id [INT] Length of IDs (the length of the barcodes) (default: 18). --contaminant-index [FOLDER] Path to the folder with a STAR index with a contaminant genome. Reads will be filtered against the specified genome and mapping reads will be discarded. --qual-64 Use phred-64 quality instead of phred-33(default). --htseq-mode [STRING] Mode of Annotation when using HTSeq. Modes = {union , intersection-nonempty(default), intersection-strict}. --htseq-no-ambiguous When using htseq discard reads annotating ambiguous genes (default False). --start-id [INT] Start position of the IDs (Barcodes) in the R1 (counting from 0) (default: 0). --no-clean-up Do not remove temporary/intermediary files (useful for debugging). --verbose Show extra information on the log file. --mapping-threads [INT] Number of threads to use in the mapping step (default: 4). --min-quality-trimming [INT] Minimum phred quality a base must have in the trimming step (default: 20). --bin-path [FOLDER] Path to folder where binary executables are present (system path by default). --log-file [STR] Name of the file that we want to use to store the logs (default output to screen). --output-folder [FOLDER] Path of the output folder. --temp-folder [FOLDER] Path of the location for temporary files. --umi-allowed-mismatches [INT] Number of allowed mismatches (hamming distance) that UMIs of the same gene-spot must have in order to cluster together (default: 1). --umi-start-position [INT] Position in R1 (base wise) of the first base of the UMI (starting by 0) (default: 18). --umi-end-position [INT] Position in R1 (base wise) of the last base of the UMI (starting by 1) (default: 27). --keep-discarded-files Writes down unaligned, un-annotated and un-demultiplexed reads to files. --remove-polyA [INT] Remove PolyA stretches of the given length from R2 (default: 15). --remove-polyT [INT] Remove PolyT stretches of the given length from R2 (default: 15). --remove-polyG [INT] Remove PolyG stretches of the given length from R2 (default: 15). --remove-polyC [INT] Remove PolyC stretches of the given length from R2 (default: 15). --remove-polyN [INT] Remove PolyN stretches of the given length from R2 (default: 15). --filter-AT-content [INT%] Discards reads whose number of A and T bases in total are more or equal than the number given in percentage (default: 90). --filter-GC-content [INT%] Discards reads whose number of G and C bases in total are more or equal than the number given in percentage (default: 90). --disable-multimap If activated, multiple aligned reads obtained during mapping will be all discarded. Otherwise the highest scored one will be kept. --disable-clipping If activated, disable soft-clipping (local alignment) in the mapping step. --umi-cluster-algorithm [STRING] Type of clustering algorithm to use when performing UMIs duplicates removal. Modes = {naive(default), hierarchical, Adjacent and AdjacentBi}. --min-intron-size [INT] Minimum allowed intron size when searching for splice variants with STAR\ Splices alignments are disabled by default (=1) but to turn it on set this parameter to a bigger number, for example 10 or 20. (defauldt: 1) --max-intron-size [INT] Maximum allowed intron size when searching for splice variants with STAR Splices alignments are disabled by default (=1) but to turn it on set this parameter to a big number, for example 10000 or 100000. (default: 1). --umi-filter Enables the UMI quality filter based on the template given in --umi-filter-template. --umi-filter-template [STRING] UMI template (IUPAC nucleotide code) for the UMI filter, default = WSNNWSNNV --compute-saturation Performs a saturation curve computation by sub-sampling the annotated reads, computing unique molecules and then a saturation curve (included in the log file). --include-non-annotated Do not discard un-annotated reads (they will be labeled __no_feature) --inverse-mapping-rv-trimming [INT] Number of bases to trim in the reverse reads for the mapping step on the 3' end. --low-memory Writes temporary records into disk in order to save memory but gaining a speed penalty. --two-pass-mode Activates the 2 pass mode in STAR to also map against splice variants. --strandness [STRING] What strandness mode to use when annotating with htseq-count [no, yes(default), reverse]. --umi-quality-bases [INT] Maximum number of low quality bases allowed in an UMI (default: 8). --umi-counting-offset [INT] Expression count for each gene-spot combination is expressed as the number of unique UMIs in each strand/start position. However some reads might have slightly different start positions due to amplification artifacts. This parameters allows one to define an offset from where to count unique UMIs (default: 150). --demultiplexing-metric Distance metric for TaggD demultiplexing: Subglobal, Levenshtein or Hamming (default: Subglobal) --demultiplexing-multiple-hits-keep-one When multiple ambiguous hits with same score are found in the demultiplexing, keep one (random) --demultiplexing-trim-sequences Trims from the barcodes in the input file when doing demultiplexing. The bases given in the list of tuples as START END START END .. where START is the integer position of the first base (0 based) and END is the integer position of the last base (1 based). Trimmng sequences can be given several times. --homopolymer-mismatches Number of mismatches allowed when removing homopolymers. (default: 0) --version Show program's version number and exit