`ecotag`: assigns sequences to taxa¶

ecotag is the tool that assigns sequences to a taxon based on sequence similarity. The program first searches the reference database for the reference sequence(s) (hereafter referred to as ‘primary reference sequence(s)’) showing the highest similarity with the query sequence. Then it looks for all other reference sequences (hereafter referred to as ‘secondary reference sequences’) whose similarity with the primary reference sequence(s) is equal or higher than the similarity between the primary reference and the query sequences. Finally, it assigns the query sequence to the most recent common ancestor of the primary and secondary reference sequences.

As input, ecotag requires the sequences to be assigned, a reference database in fasta format, where each sequence is associated with a taxon identified by a unique taxid, and a taxonomy database where taxonomic information is stored for each taxid.

Example:
> ecotag -d embl_r113  -R ReferenceDB.fasta \
  --sort=count -m 0.95 -r seq.fasta > seq_tag.fasta
The above command specifies that each sequence stored in seq.fasta is compared to those in the reference database called ReferenceDB.fasta for taxonomic assignment. In the output file seq_tag.fasta, the sequences are sorted from highest to lowest counts. When there is no reference sequence with a similarity equal or higher than 0.95 for a given sequence, no taxonomic information is provided for this sequence in seq_tag.fasta.

`ecotag` specific options¶

-R <FILENAME>, --ref-database=<FILENAME>¶: <FILENAME> is the fasta file containing the reference sequences

-m FLOAT, --minimum-identity=FLOAT¶: When the best match with the reference database present an identity level below FLOAT, the taxonomic assignment for the sequence record is not computed. The sequence record is nevertheless included in the output file. FLOAT is included in a [0,1] interval.

--minimum-circle=FLOAT¶: minimum identity considered for the assignment circle. FLOAT is included in a [0,1] interval.

-x RANK, --explain=RANK¶

-u, --uniq¶: When this option is specified, the program first dereplicates the sequence records to work on unique sequences only. This option greatly improves the program’s speed, especially for highly redundant datasets.

--sort=<KEY>¶: The output is sorted based on the values of the relevant attribute.

-r, --reverse¶: The output is sorted in reverse order (should be used with the –sort option). (Works even if the –sort option is not set, but could not find on what the output is sorted).

-E FLOAT, --errors=FLOAT¶: FLOAT is the fraction of reference sequences that will be ignored when looking for the most recent common ancestor. This option is useful when a non-negligible proportion of reference sequences is expected to be assigned to the wrong taxon, for example because of taxonomic misidentification. FLOAT is included in a [0,1] interval.

--cache-size=INTEGER¶: A cache for computed similarities is maintained by ecotag. the default size for this cache is 1,000,000 of scores. This option allows to change the cache size.

Options to specify input format¶

Restrict the analysis to a sub-part of the input file¶

--skip <N>¶: The N first sequence records of the file are discarded from the analysis and not reported to the output file

--only <N>¶: Only the N next sequence records of the file are analyzed. The following sequences in the file are neither analyzed, neither reported to the output file. This option can be used conjointly with the –skip option.

Sequence annotated format¶

--genbank¶: Input file is in genbank format.

--embl¶: Input file is in embl format.

Specifying the sequence type¶

--nuc¶: Input file contains nucleic sequences.

--prot¶: Input file contains protein sequences.

Options to specify output format¶

Standard output format¶

--fasta-output¶: Output sequences in OBITools fasta format

--fastq-output¶: Output sequences in Sanger fastq format

Generating an ecoPCR database¶

--ecopcrdb-output=<PREFIX_FILENAME>¶: Creates an ecoPCR database from sequence records results

Miscellaneous option¶

--uppercase¶: Print sequences in upper case (default is lower case)

Common options¶

-h, --help¶: Shows this help message and exits.

--DEBUG¶: Sets logging in debug mode.

`ecotag` added sequence attributes¶

best_identity

best_match

family

family_name

genus

genus_name

id_status

order

order_name

rank

scientific_name

species

species_list

species_name

taxid

`ecotag`: assigns sequences to taxa¶

`ecotag` specific options¶

Options to specify input format¶

Restrict the analysis to a sub-part of the input file¶

Sequence annotated format¶

Specifying the sequence type¶

Options to specify output format¶

Standard output format¶

Generating an ecoPCR database¶

Miscellaneous option¶

Common options¶

`ecotag` added sequence attributes¶

Table Of Contents

Previous topic

Next topic

This Page

ecotag: assigns sequences to taxa¶

ecotag specific options¶

Taxonomy related options¶

Options to specify input format¶

Restrict the analysis to a sub-part of the input file¶

Sequence annotated format¶

fasta related format¶

fastq related format¶

ecoPCR related format¶

Specifying the sequence type¶

Options to specify output format¶

Standard output format¶

Generating an ecoPCR database¶

Miscellaneous option¶

Common options¶

ecotag added sequence attributes¶

`ecotag`: assigns sequences to taxa¶

`ecotag` specific options¶

`ecotag` added sequence attributes¶