obiaddtaxids: adds taxids to sequence records using an ecopcr database

The obiaddtaxids command annotates sequence records with a taxid based on a taxon scientific name stored in the sequence record header.

Taxonomic information linking a taxid to a taxon scientific name is stored in a database formatted as an ecoPCR database (see obitaxonomy) or a NCBI taxdump (see NCBI ftp site).

The way to extract the taxon scientific name from the sequence record header can be specified by two options:

  • By default, the sequence identifier is used. Underscore characters (_) are substituted by spaces before looking for the taxon scientific name into the taxonomic database.
  • If the input file is an OBITools extended fasta format, the -k option specifies the attribute containing the taxon scientific name.
  • If the input file is a fasta file imported from the UNITE or from the SILVA web sites, the -f option allows specifying this source and parsing correctly the associated taxonomic information.

For each sequence record, obiaddtaxids tries to match the extracted taxon scientific name with those stored in the taxonomic database.

  • If a match is found, the sequence record is annotated with the corresponding taxid.

Otherwise,

  • If the -g option is set and the taxon name is composed of two words and only the first one is found in the taxonomic database at the ‘genus’ rank, obiaddtaxids considers that it found the genus associated with this sequence record and it stores this sequence record in the file specified by the -g option.
  • If the -u option is set and no taxonomic information is retrieved from the scientific taxon name, the sequence record is stored in the file specified by the -u option.

Example

> obiaddtaxids -T species_name -g genus_identified.fasta \
               -u unidentified.fasta -d my_ecopcr_database \
               my_sequences.fasta > identified.fasta

Tries to match the value associated with the species_name key of each sequence record from the my_sequences.fasta file with a taxon name from the ecoPCR database my_ecopcr_database.

  • If there is an exact match, the sequence record is stored in the identified.fasta file.

  • If not and the species_name value is composed of two words, obiaddtaxids considers the first word as a genus name and tries to find it into the taxonomic database.

    • If a genus is found, the sequence record is stored in the genus_identified.fasta file.
    • Otherwise the sequence record is stored in the unidentified.fasta file.

obiaddtaxids specific options

-f <FORMAT>, --format=<FORMAT>

Format of the sequence file. Possible formats are:

-k <KEY>, --key-name=<KEY>

Key of the attribute containing the taxon name in sequence files in the OBITools extended fasta format.

-a <ANCESTOR>, --restricting_ancestor=<ANCESTOR>

Enables to restrict the search of taxids under a specified ancestor.

<ANCESTOR> can be a taxid (integer) or a key (string).

  • If it is a taxid, this taxid is used to restrict the search for all the sequence records.
  • If it is a key, obiaddtaxids looks for the ancestor taxid in the corresponding attribute. This allows having a different ancestor restriction for each sequence record.
-g <FILENAME>, --genus_found=<FILENAME>

File used to store sequences with a match found for the genus.

Caution

this option is not valid with the UNITE format.

-u <FILENAME>, --unidentified=<FILENAME>

File used to store sequences with no taxonomic match found.

Common options

-h, --help

Shows this help message and exits.

--DEBUG

Sets logging in debug mode.

obiaddtaxids added sequence attribute