obiaddtaxids
: adds taxids to sequence records using an ecopcr database¶
The obiaddtaxids
command annotates sequence records with a taxid based on
a taxon scientific name stored in the sequence record header.
Taxonomic information linking a taxid to a taxon scientific name is stored in a database formatted as an ecoPCR database (see obitaxonomy) or a NCBI taxdump (see NCBI ftp site).
The way to extract the taxon scientific name from the sequence record header can be specified by two options:
- By default, the sequence identifier is used. Underscore characters (
_
) are substituted by spaces before looking for the taxon scientific name into the taxonomic database.- If the input file is an
OBITools
extended fasta format, the-k
option specifies the attribute containing the taxon scientific name.- If the input file is a fasta file imported from the UNITE or from the SILVA web sites, the
-f
option allows specifying this source and parsing correctly the associated taxonomic information.
For each sequence record, obiaddtaxids
tries to match the extracted taxon scientific name
with those stored in the taxonomic database.
- If a match is found, the sequence record is annotated with the corresponding taxid.
Otherwise,
- If the
-g
option is set and the taxon name is composed of two words and only the first one is found in the taxonomic database at the ‘genus’ rank,obiaddtaxids
considers that it found the genus associated with this sequence record and it stores this sequence record in the file specified by the-g
option.- If the
-u
option is set and no taxonomic information is retrieved from the scientific taxon name, the sequence record is stored in the file specified by the-u
option.Example
> obiaddtaxids -T species_name -g genus_identified.fasta \ -u unidentified.fasta -d my_ecopcr_database \ my_sequences.fasta > identified.fastaTries to match the value associated with the
species_name
key of each sequence record from themy_sequences.fasta
file with a taxon name from the ecoPCR databasemy_ecopcr_database
.
If there is an exact match, the sequence record is stored in the
identified.fasta
file.If not and the
species_name
value is composed of two words,obiaddtaxids
considers the first word as a genus name and tries to find it into the taxonomic database.
- If a genus is found, the sequence record is stored in the
genus_identified.fasta
file.- Otherwise the sequence record is stored in the
unidentified.fasta
file.
obiaddtaxids
specific options¶
-
-f
<FORMAT>
,
--format
=<FORMAT>
¶ Format of the sequence file. Possible formats are:
raw
: for regularOBITools
extended fasta files (default value).UNITE
: for fasta files downloaded from the UNITE web site.SILVA
: for fasta files downloaded from the SILVA web site.
-
-k
<KEY>
,
--key-name
=<KEY>
¶ Key of the attribute containing the taxon name in sequence files in the
OBITools
extended fasta format.
-
-a
<ANCESTOR>
,
--restricting_ancestor
=<ANCESTOR>
¶ Enables to restrict the search of taxids under a specified ancestor.
<ANCESTOR>
can be a taxid (integer) or a key (string).- If it is a taxid, this taxid is used to restrict the search for all the sequence records.
- If it is a key,
obiaddtaxids
looks for the ancestor taxid in the corresponding attribute. This allows having a different ancestor restriction for each sequence record.
-
-g
<FILENAME>
,
--genus_found
=<FILENAME>
¶ File used to store sequences with a match found for the genus.
Caution
this option is not valid with the UNITE format.
-
-u
<FILENAME>
,
--unidentified
=<FILENAME>
¶ File used to store sequences with no taxonomic match found.