obiaddtaxids: adds taxids to sequence records using an ecopcr database¶
The obiaddtaxids command annotates sequence records with a taxid based on 
a taxon scientific name stored in the sequence record header.
Taxonomic information linking a taxid to a taxon scientific name is stored in a database formatted as an ecoPCR database (see obitaxonomy) or a NCBI taxdump (see NCBI ftp site).
The way to extract the taxon scientific name from the sequence record header can be specified by two options:
- By default, the sequence identifier is used. Underscore characters (
 _) are substituted by spaces before looking for the taxon scientific name into the taxonomic database.- If the input file is an
 OBIToolsextended fasta format, the-koption specifies the attribute containing the taxon scientific name.- If the input file is a fasta file imported from the UNITE or from the SILVA web sites, the
 -foption allows specifying this source and parsing correctly the associated taxonomic information.
For each sequence record, obiaddtaxids tries to match the extracted taxon scientific name 
with those stored in the taxonomic database.
- If a match is found, the sequence record is annotated with the corresponding taxid.
 
Otherwise,
- If the
 -goption is set and the taxon name is composed of two words and only the first one is found in the taxonomic database at the ‘genus’ rank,obiaddtaxidsconsiders that it found the genus associated with this sequence record and it stores this sequence record in the file specified by the-goption.- If the
 -uoption is set and no taxonomic information is retrieved from the scientific taxon name, the sequence record is stored in the file specified by the-uoption.Example
> obiaddtaxids -T species_name -g genus_identified.fasta \ -u unidentified.fasta -d my_ecopcr_database \ my_sequences.fasta > identified.fastaTries to match the value associated with the
species_namekey of each sequence record from themy_sequences.fastafile with a taxon name from the ecoPCR databasemy_ecopcr_database.
If there is an exact match, the sequence record is stored in the
identified.fastafile.If not and the
species_namevalue is composed of two words,obiaddtaxidsconsiders the first word as a genus name and tries to find it into the taxonomic database.
- If a genus is found, the sequence record is stored in the
 genus_identified.fastafile.- Otherwise the sequence record is stored in the
 unidentified.fastafile.
obiaddtaxids specific options¶
- 
-f<FORMAT>,--format=<FORMAT>¶ Format of the sequence file. Possible formats are:
raw: for regularOBIToolsextended fasta files (default value).UNITE: for fasta files downloaded from the UNITE web site.SILVA: for fasta files downloaded from the SILVA web site.
- 
-k<KEY>,--key-name=<KEY>¶ Key of the attribute containing the taxon name in sequence files in the
OBIToolsextended fasta format.
- 
-a<ANCESTOR>,--restricting_ancestor=<ANCESTOR>¶ Enables to restrict the search of taxids under a specified ancestor.
<ANCESTOR>can be a taxid (integer) or a key (string).- If it is a taxid, this taxid is used to restrict the search for all the sequence records.
 - If it is a key, 
obiaddtaxidslooks for the ancestor taxid in the corresponding attribute. This allows having a different ancestor restriction for each sequence record. 
- 
-g<FILENAME>,--genus_found=<FILENAME>¶ File used to store sequences with a match found for the genus.
Caution
this option is not valid with the UNITE format.
- 
-u<FILENAME>,--unidentified=<FILENAME>¶ File used to store sequences with no taxonomic match found.