.. moduleauthor:: Jaime Huerta-Cepas

.. versionadded:: 2.3

.. currentmodule:: ete2

Dealing with the NCBI Taxonomy database
=================================================

ETE's `ncbi_taxonomy` module provides utilities to efficiently query a local
copy of the NCBI Taxonomy database. The class :class:`NCBITaxa` offers
methods to convert from taxid to names (and vice versa), to fetch pruned
topologies connecting a given set of species, or to download rank, names and
lineage track information.
 
It is also fully integrated with :class:`PhyloTree` instances through the
:func:`PhyloNode.annotate_ncbi_taxa` method.

Setting up a local copy of the NCBI taxonomy database
-------------------------------------------------------

The first time you attempt to use :class:`NCBITaxa`, ETE will detect that your
local database is empty and it will attempt to download the latest NCBI taxonomy
database (~300MB) and will store a parsed version of it in your home directory:
`~/.etetoolkit/taxa.sqlite`. All future imports of _`NCBITaxa` will detect the
local database and will skip this step.

::

   from ete2 import NCBITaxa
   ncbi = NCBITaxa()

Upgrading the local database
------------------------------

Use the method :NCBITaxa:`update_taxonomy_database` to download and parse the
latest database from the NCBI ftp site. Your current local database will be
overwritten.

::

   from ete2 import NCBITaxa
   ncbi = NCBITaxa()
   ncbi.update_taxonomy_database()


Getting taxid information 
-----------------------------

you can fetch species names, ranks and linage track information for your taxids
using the following methods:

 - :func:`NCBITaxa.get_rank`
 - :func:`NCBITaxa.get_lineage`
 - :func:`NCBITaxa.get_taxid_translator`
 - :func:`NCBITaxa.get_name_translator`
 - :func:`NCBITaxa.translate_to_names`

The so called get-translator-functions will return a dictionary converting
between taxids and species names. Either species or linage names/taxids are
accepted as input.

::
 
   from ete2 import NCBITaxa
   ncbi = NCBITaxa()
   taxid2name = ncbi.get_taxid_translator([9606, 9443])
   print taxid2name
   # {9443: u'Primates', 9606: u'Homo sapiens'}
   
   name2taxid = ncbi.get_name_translator(["Homo sapiens", "primates"])
   print name2taxid
   # {'Homo sapiens': 9606, 'primates': 9443}

Other functions allow to extract further information using taxid numbers as a query. 

::

   from ete2 import NCBITaxa
   ncbi = NCBITaxa()

   print ncbi.get_rank([9606, 9443])
   # {9443: u'order', 9606: u'species'}

   print ncbi.get_lineage(9606) 

   # [1, 131567, 2759, 33154, 33208, 6072, 33213, 33511, 7711, 89593, 7742,
   # 7776, 117570, 117571, 8287, 1338369, 32523, 32524, 40674, 32525, 9347,
   # 1437010, 314146, 9443, 376913, 314293, 9526, 314295, 9604, 207598, 9605,
   # 9606]


And you can combine combine all at once:

::

   from ete2 import NCBITaxa
   ncbi = NCBITaxa()

   lineage = ncbi.get_lineage(9606) 
   print lineage

   # [1, 131567, 2759, 33154, 33208, 6072, 33213, 33511, 7711, 89593, 7742,
   # 7776, 117570, 117571, 8287, 1338369, 32523, 32524, 40674, 32525, 9347,
   # 1437010, 314146, 9443, 376913, 314293, 9526, 314295, 9604, 207598, 9605,
   # 9606]

   names = ncbi.get_taxid_translator(lineage)
   print [names[taxid] for taxid in lineage]
 
   # [u'root', u'cellular organisms', u'Eukaryota', u'Opisthokonta', u'Metazoa',
   # u'Eumetazoa', u'Bilateria', u'Deuterostomia', u'Chordata', u'Craniata',
   # u'Vertebrata', u'Gnathostomata', u'Teleostomi', u'Euteleostomi',
   # u'Sarcopterygii', u'Dipnotetrapodomorpha', u'Tetrapoda', u'Amniota',
   # u'Mammalia', u'Theria', u'Eutheria', u'Boreoeutheria', u'Euarchontoglires',
   # u'Primates', u'Haplorrhini', u'Simiiformes', u'Catarrhini', u'Hominoidea',
   # u'Hominidae', u'Homininae', u'Homo', u'Homo sapiens']


Getting descendant taxa
-----------------------------

Given a taxid or a taxa name from an internal node in the NCBI taxonomy tree,
their descendants can be retrieved as follows:

::

   from ete2 import NCBITaxa
   ncbi = NCBITaxa()

   descendants = ncbi.get_descendant_taxa('Homo')
   print ncbi.translate_to_names(descendants)

   # [u'Homo heidelbergensis', u'Homo sapiens ssp. Denisova', u'Homo sapiens neanderthalensis']

   # you can easily ignore subspecies, so only taxa labeled as "species" will be reported:
   descendants = ncbi.get_descendant_taxa('Homo', collapse_subspecies=True)
   print ncbi.translate_to_names(descendants)

   # [u'Homo sapiens', u'Homo heidelbergensis']

   # or even returned as an annotated tree
   tree = ncbi.get_descendant_taxa('Homo', collapse_subspecies=True, return_tree=True)
   print tree.get_ascii(attributes=['sci_name', 'taxid'])

   #           /-Homo sapiens, 9606
   # -Homo, 9605
   #           \-Homo heidelbergensis, 1425170


Getting NCBI species tree topology
---------------------------------------

Getting the NCBI taxonomy tree for a given set of species is one of the most
useful ways to get all information at once. The method
:func:`NCBITaxa.get_topology` allows to query your local NCBI database and
extract the smallest tree that connects all your query taxids. It returns a
normal ETE tree in which all nodes, internal or leaves, are annotated for
lineage, scientific names, ranks, and so on.

::


   from ete2 import NCBITaxa
   ncbi = NCBITaxa()
   
   tree = ncbi.get_topology([9606, 9598, 10090, 7707, 8782])
   print tree.get_ascii(attributes=["sci_name", "rank"])

   #                     /-Dendrochirotida, order
   #                    |
   #                    |                                                                /-Pan troglodytes, species
   # -Deuterostomia, no rank                                           /Homininae, subfamily
   #                    |                /Euarchontoglires, superorder                   \-Homo sapiens, species
   #                    |               |                           |
   #                     \Amniota, no rank                           \-Mus musculus, species
   #                                    |
   #                                     \-Aves, class


If needed, all intermediate nodes connecting the species can also be kept in the tree: 

::


   from ete2 import NCBITaxa
   ncbi = NCBITaxa()
   
   tree = ncbi.get_topology([2, 33208], intermediate_nodes=True)
   print tree.get_ascii(attributes=["sci_name"])

   #                  /Eukaryota - Opisthokonta - Metazoa
   # -cellular organisms
   #                  \-Bacteria


Automatic tree annotation using NCBI taxonomy
--------------------------------------------------

NCBI taxonomy annotation consists of adding additional information to any
internal a leaf node in a give user tree. Only an attribute containing the taxid
associated to each node is required for the nodes in the query tree. The
annotation process will add the following features to the nodes:

 - sci_name  
 - taxid
 - named_lineage 
 - lineage 
 - rank

Note that, for internal nodes, taxid can be automatically inferred based on
their sibling nodes. The easiest way to annotate a tree is to use a
:class:`PhyloTree` instance where the species name attribute is transparently
used as the taxid attribute.  Note that the :PhyloNode:`annotate_ncbi_taxa`:
function will also return the used name, lineage and rank translators.

Remember that species names in `PhyloTree` instances are automatically extracted
from leaf names. The parsing method can be easily adapted to any formatting:

::


   from ete2 import PhyloTree

   # load the whole leaf name as species taxid
   tree = PhyloTree('((9606, 9598), 10090);', sp_naming_function=lambda name: name)
   tax2names, tax2lineages, tax2rank = tree.annotate_ncbi_taxa()

   # split names by '|' and return the first part as the species taxid 
   tree = PhyloTree('((9606|protA, 9598|protA), 10090|protB);', sp_naming_function=lambda name: name.split('|')[0])
   tax2names, tax2lineages, tax2rank = tree.annotate_ncbi_taxa()

   print tree.get_ascii(attributes=["name", "sci_name", "taxid"])


   #                                             /-9606|protA, Homo sapiens, 9606
   #                          /, Homininae, 207598
   #-, Euarchontoglires, 314146                  \-9598|protA, Pan troglodytes, 9598
   #                         |
   #                          \-10090|protB, Mus musculus, 10090


Alternatively, you can also use the :func:`NCBITaxa.annotate_tree` function to
annotate a custom tree instance.

::

   from ete2 import Tree, NCBITaxa
   ncbi = NCBITaxa()
   tree = Tree("")
   ncbi.annotate_tree(tree, taxid_attr="name")