.. role:: latex(raw) :format: latex Welcome to the ``OBITools`` =========================== :latex:`~` :latex:`\vspace{\fill}` The ``OBITools`` package is a set of programs specifically designed for analyzing NGS data in a DNA metabarcoding context, taking into account taxonomic information. It is distributed as an open source software available on the following website: http://metabarcoding.org/obitools. Citation: Boyer F., Mercier C., Bonin A., Taberlet P., Coissac E. (2014) OBITools: a Unix-inspired software package for DNA metabarcoding. *Molecular Ecology Resources*, submitted. :latex:`\vspace{\fill}` Installing the ``OBITools`` =========================== Availability of the ``OBITools`` ................................ The ``OBITools`` are open source and protected by the CeCILL 2.1 license (`http://www.cecill.info/licences/Licence_CeCILL_V2.1-en.html `_). The ``OBITools`` are deposited on the Python Package Index (PyPI : `https://pypi.python.org/pypi/obitools `_) and all the sources can be downloaded from our subversion server (`http://www.grenoble.prabi.fr/public-svn/OBISofts/OBITools `_). Prerequisites ............. To install the ``OBITools``, you need that these softwares are installed on your system: * Python 2.7 (installed by default on most ``Unix`` systems, available from `the Python website `_) * ``gcc`` (installed by default on most ``Unix`` systems, available from the GNU sites dedicated to `GCC `_ and `GMake `_) On a linux system ^^^^^^^^^^^^^^^^^ You have to take care that the Python-dev packages are installed. On MacOSX ^^^^^^^^^ The C compiler and all the other compilation tools are included in the `XCode `_ application not installed by default. The Python included in the system is not suitable for running the ``OBITools``. You have to install a complete distribution of Python that you can download as a `MacOSX package from the Python website `_. Downloading and installing the ``OBITools`` ........................................... The ``OBITools`` are downloaded and installed using the :download:`get-obitools.py <../../../get_obitools/get-obitools.py>` script. This is a user level installation that does not need administrator privilege. Once downloaded, move the file :download:`get-obitools.py <../../../get_obitools/get-obitools.py>` in the directory where you want to install the ``OBITools``. From a Unix terminal you must now run the command : .. code-block:: bash > python get-obitools.py The script will create a new directory at the place you are running it in which all the ``OBITools`` will be installed. No system privilege are required, and you system will not be altered in any way by the obitools installation. The newly created directory is named OBITools-VERSION where version is substituted by the latest version number available. Inside the newly created directory all the ``OBITools`` are installed. Close to this directory there is a shell script named ``obitools``. Running this script activate the ``OBITools`` by reconfiguring your Unix environment. .. code-block:: bash > ./obitools Once activated you can desactivate the ``OBITools`` byt typing the command ``exit``. .. code-block:: bash > exit OBITools are no more activated, Bye... ====================================== System level installation ......................... To install the ``OBITools`` at the system level you can follow two options : - copy the ``obitools`` script in a usual directory for installing program like ``/usr/local/bin`` but never move the ``OBITools`` directory itself after the installation by the :download:`get-obitools.py <../../../get_obitools/get-obitools.py>`. - The other solution is to add the ``export/bin`` directory located in the ``OBITools`` directory to the ``PATH``environment variable. Retrieving the sources of the OBITools ...................................... If you want to compile by yourself the ``OBITools``, you will need to install the same prerequisite: .. code-block:: bash > pip install -U virtualenv > pip install -U sphinx > pip install -U cython moreover you need to install any subversion client (a list of clients is available from `Wikipedia `_) Then you can download the .. code-block:: bash > svn co http://www.grenoble.prabi.fr/public-svn/OBISofts/OBITools/branches/OBITools-1.00/ OBITools This command will create a new directory called ``OBITools``. Compiling and installing the ``OBITools`` ......................................... From the directory where you retrieved the sources, execute the following commands: .. code-block:: bash > cd OBITools > python setup.py --serenity install Once installed, you can test your installation by running the commands of the :doc:`tutorials <./tutorials>`. Introduction ============ DNA metabarcoding is an emerging approach for biodiversity studies (Taberlet et al. 2012). Originally mainly developed by microbiologists (e.g. Sogin et al. 2006), it is now widely used for plants (e.g. Sonstebo et al. 2010, Parducci et al. 2012, Yoccoz et al. 2012) and animals from meiofauna (e.g. Chariton et al. 2010, Baldwin et al. 2013) to larger organisms (e.g. Andersen et al. 2012, Thomsen et al. 2012). Interestingly, this method is not limited to *sensu stricto* biodiversity surveys, but it can also be implemented in other ecological contexts such as for herbivore (e.g. Valentini et al. 2009, Kowalczyk et al. 2011) or carnivore (e.g. Deagle et al. 2009, Shehzad et al. 2012) diet analyses. Whatever the biological question under consideration, the DNA metabarcoding methodology relies heavily on next-generation sequencing (NGS), and generates considerable numbers of DNA sequence reads (typically million of reads). Manipulation of such large datasets requires dedicated programs usually running on a Unix system. Unix is an operating system, whose first version was created during the sixties. Since its early stages, it is dedicated to scientific computing and includes a large set of simple tools to efficiently process text files. Most of those programs can be viewed as filters extracting information from a text file to create a new text file. These programs process text files as streams, line per line, therefore allowing computation on a huge dataset without requiring a large memory. Unix programs usually print their results to their standard output (*stdout*), which by default is the terminal, so the results can be examined on screen. The main philosophy of the Unix environment is to allow easy redirection of the *stdout* either to a file, for saving the results, or to the standard input (*stdin*) of a second program thus allowing to easily create complex processing from simple base commands. Access to Unix computers is increasingly easier for scientists nowadays. Indeed, the Linux operating system, an open source version of Unix, can be freely installed on every PC machine and the MacOSX operating system, running on Apple computers, is also a Unix system. The ``OBITools`` programs imitate Unix standard programs because they usually act as filters, reading their data from text files or the *stdin* and writing their results to the *stdout*. The main difference with classical Unix programs is that text files are not analyzed line per line but sequence record per sequence record (see below for a detailed description of a sequence record). Compared to packages for similar purposes like mothur (Schloss et al. 2009) or QIIME (Caporaso et al. 2010), the ``OBITools`` mainly rely on filtering and sorting algorithms. This allows users to set up versatile data analysis pipelines (Figure 1), adjustable to the broad range of DNA metabarcoding applications. The innovation of the ``OBITools`` is their ability to take into account the taxonomic annotations, ultimately allowing sorting and filtering of sequence records based on the taxonomy. |Pipeline example for a standard biodiversity survey| References .......... Andersen K, Bird KL, Rasmussen M, Haile J, Breuning-Madsen H, Kj�r KH, Orlando L, Gilbert MTP, Willerslev E (2012) Meta-barcoding of "dirt" DNA from soil reflects vertebrate biodiversity. Molecular Ecology, 21, 1966-1979. Baldwin DS, Colloff MJ, Rees GN, Chariton AA, Watson GO, Court LN, Hartley DM, Morgan Mj, King AJ, Wilson JS, Hodda M, Hardy CM (2013) Impacts of inundation and drought on eukaryote biodiversity in semi-arid floodplain soils. Molecular Ecology, 22, 1746-1758. Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD, Costello EK, Fierer N, Pena AG, Goodrich JK, Gordon JI, Huttley GA, Kelley ST, Knights D, Koenig JE, Ley RE, Lozupone CA, McDonald D, Muegge BD, Pirrung M, Reeder J, Sevinsky JR, Tumbaugh PJ, Walters WA, Widmann J, Yatsunenko T, Zaneveld J, Knight R (2010) QIIME allows analysis of high-throughput community sequencing data. Nature Methods, 7, 335-336. Chariton AA, Court LN, Hartley DM, Colloff MJ, Hardy CM (2010) Ecological assessment of estuarine sediments by pyrosequencing eukaryotic ribosomal DNA. Frontiers in Ecology and the Environment, 8, 233-238. Deagle BE, Kirkwood R, Jarman SN (2009) Analysis of Australian fur seal diet by pyrosequencing prey DNA in faeces. Molecular Ecology, 18, 2022-2038. Kowalczyk R, Taberlet P, Coissac E, Valentini A, Miquel C, Kaminski T, W�jcik JM (2011) Influence of management practices on large herbivore diet - case of European bison in Bialowieza Primeval Forest (Poland). Forest Ecology and Management, 261, 821-828. Parducci L, Jorgensen T, Tollefsrud MM, Elverland E, Alm T, Fontana SL, Bennett KD, Haile J, Matetovici I, Suyama Y, Edwards ME, Andersen K, Rasmussen M, Boessenkool S, Coissac E, Brochmann C, Taberlet P, Houmark-Nielsen M, Larsen NK, Orlando L, Gilbert MTP, Kjaer KH, Alsos IG, Willerslev E (2012) Glacial Survival of Boreal Trees in Northern Scandinavia. Science, 335, 1083-1086. Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, Lesniewski RA, Oakley BB, Parks DH, Robinson CJ, Sahl JW, Stres B, Thallinger GG, Van Horn DJ, Weber CF (2009) Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Applied and Environmental Microbiology, 75, 7537-7541. Shehzad W, Riaz T, Nawaz MA, Miquel C, Poillot C, Shah SA, Pompanon F, Coissac E, Taberlet P (2012) Carnivore diet analysis based on next generation sequencing: application to the leopard cat (*Prionailurus bengalensis*) in Pakistan. Molecular Ecology, 21, 1951-1965. Sogin ML, Morrison HG, Huber JA, Welch DM, Huse SM, Neal PR, Arrieta JM, Herndl GJ (2006) Microbial diversity in the deep sea and the underexplored "rare biosphere". Proceedings of the National Academy of Sciences of the United States of America, 103, 12115-12120. S�nsteb� JH, Gielly L, Brysting A, Reidar E, Edwards M, Haile J, Willerslev E, Coissac E, Rioux D, Sannier J, Taberlet P, Brochmann C (2010) Using next-generation sequencing for molecular reconstruction of past Arctic vegetation and climate. Molecular Ecology Resources, 10, 1009-1018. Taberlet P, Coissac E, Hajibabaei M, Rieseberg LH (2012) Environmental DNA. Molecular Ecology, 21, 1789-1793. Thomsen PF, Kielgast J, Iversen LL, Wiuf C, Rasmussen M, Gilbert MTP, Orlando L, Willerslev E (2012) Monitoring endangered freshwater biodiversity using environmental DNA. Molecular Ecology, 21, 2565-2573. Valentini A, Miquel C, Nawaz MA, Bellemain E, Coissac E, Pompanon F, Gielly L, Cruaud C, Nascetti G, Wincker P, Swenson JE, Taberlet P (2009) New perspectives in diet analysis based on DNA barcoding and parallel pyrosequencing: the trnL approach. Molecular Ecology Resources, 9, 51-60. Yoccoz NG, Br�then KA, Gielly L, Haile J, Edwards ME, Goslar T, von Stedingk H, Brysting AK, Coissac E, Pompanon F, S�nsteb� JH, Miquel C, Valentini A, de Bello F, Chave J, Thuiller W, Wincker P, Cruaud C, Gavory F, Rasmussen M, Gilbert MTP, Orlando L, Brochmann C, Willerslev E, Taberlet P (2012) DNA from soil mirrors plant taxonomic and growth form diversity. Molecular Ecology, 21, 3647-3655. Basic concepts of the ``OBITools`` ================================== Once installed, the ``OBITools`` enrich the Unix command line interface with a set of new commands dedicated to NGS data processing. Most of them have a name starting with the `obi` prefix. They automatically recognize the input file format amongst most of the standard sequence file formats (i.e. :doc:`fasta `, :doc:`fastq `, :doc:`EMBL `, and :doc:`GenBank ` formats). Nevertheless, options are available to enforce some format specificity such as the encoding system used in :doc:`fastq ` files for quality codes. Most of the basic Unix commands have their ``OBITools`` equivalent (e.g. `obihead` *vs* `head`, `obitail` *vs* `tail`, `obigrep` *vs* `grep`), which is convenient for scientists familiar with Unix. The main difference between any standard Unix command and its ``OBITools`` counterpart is that the treatment unit is no longer the text line but the sequence record. As a sequence record is more complex than a single text line, the ``OBITools`` programs have many supplementary options compared to their Unix equivalents. The structure of a sequence record .................................. The ``OBITools`` commands consider a sequence record as an entity composed of five distinct elements. Two of them are mandatory, the identifier (*id*) and the DNA or protein sequence itself. The *id* is a single word composed of characters, digits, and other symbols like dots or underscores excluding spaces. Formally, the *ids* should be unique within a dataset and should identify each sequence record unambiguously, but only a few ``OBITools`` actually rely on this property. The sequence is an ordered set of characters corresponding to nucleotides or amino-acids according to the International Union of Pure and Applied Chemistry (IUPAC) nomenclature (Cornish-Bowden 1985). The three other elements composing a sequence record are optional. They consist in a sequence definition, a quality vector, and a set of attributes. The sequence definition is a free text describing the sequence briefly. The quality vector associates a quality score to each nucleotide or amino-acid. Usually this quality score is the result of the base-calling process by the sequencer. The last element is a set of attributes qualifying the sequence, each attribute being described by a `key=value` pair. The set of attributes is the central concept of the ``OBITools`` system. When an ``OBITools`` command is run on the sequence records included in a dataset, the result of the computation often consist in the addition of new attributes completing the annotation of each sequence record. This strategy of sequence annotation allows the ``OBITools`` to return their results as a new sequence record file that can be used as the input of another ``OBITools`` program, ultimately creating complex pipelines. Managed sequence file formats ............................. Most of the ``OBITools`` commands read sequence records from a file or from the *stdin*, make some computations on the sequence records and output annotated sequence records. As inputs, the ``OBITools`` are able to automatically recognize the most common sequence file formats (i.e. :doc:`fasta `, :doc:`fastq `, :doc:`EMBL `, and :doc:`GenBank `). They are also able to read `ecoPCR` (Ficetola et al. 2010) result files and `ecoPCR`/`ecoPrimers` formatted sequence databases (Riaz et al. 2011) as ordinary sequence files. File format outputs are more limited. By default, sequences without and with quality information are written in :doc:`fasta ` and Sanger :doc:`fastq ` formats, respectively. However, dedicated options allow enforcing the output format, and the ``OBITools`` are also able to write sequences in the `ecoPCR`/`ecoPrimers` database format, to produce reference databases for these programs. In the :doc:`fasta ` or :doc:`fastq ` format, the attributes are written in the header line just after the *id*, following a `key=value;` format (Figure 2). |The structure of an OBITools sequence record and its representation in fasta and fastq formats| Taxonomical aspects ................... Filtering and annotation steps in the processing of DNA metabarcoding sequence data are greatly eased by the explicit association of taxonomic information to sequences together with an easy access to the taxonomy. Taxonomic information, including a taxonomic identifier, can thus be stored in the set of attributes of each sequence record. Specifically, the `taxid` attribute is used by the ``OBITools`` when querying taxonomic information of a sequence record, nevertheless several ``OBITools`` commands can annotate sequence records with taxonomy-related attributes for the user's convenience. The value of the `taxid` attribute must be a unique integer referring unambiguously to one taxon in the taxonomic associated database. Although this is not mandatory, the NCBI taxonomy is a preferred source of taxonomic information as the ``OBITools`` provide commands to easily extract the full taxonomic information from it. The command `obitaxonomy` is useful to build a taxonomic database in the ``OBITools`` format from a dump of the NCBI taxonomic database (downloadable at the following URL: ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz). Implemented algorithms ...................... Most of the algorithms implemented in the ``OBITools`` are basic algorithms allowing sampling, filtering and annotation of sequence records based on their associated attribute set or sequence (e.g. `obisample`, `obigrep`, `obiannotate`). Some others implement algorithms directly related to NGS or to DNA metabarcoding (e.g. `illuminapairedend`, `ngsfilter`, `ecotag`). Finally, a few of them do not run on sequence records and/or do not provide their results as sequence records. Amongst them, `oligotag` (Coissac 2012) generates a set of short oligonucleotide sequences (hereafter referred to as `tags` useful to uniquely identify individual samples within a single NGS library containing many samples. Hereby, we will describe some of the implemented algorithms pertaining directly to DNA metabarcoding, as well as the corresponding programs. A full description of all programs included in the ``OBITools`` suite is available on the web http://metabarcoding.org/obitools/doc. Implementation of the ``OBITools`` ................................... The ``OBITools`` are a set of Python programs relying on an eponym Python library. The ``OBITools`` library is mainly developed in Python (version 2.7 see (http://www.python.org). For increasing the speed of execution, many parts of the ``OBITools`` library are developed using `cython` (http://cython.org/, a Python to C compiler) or the C language directly. The ``OBITools`` compile on Unix systems including Linux and MacOSX. References .......... Coissac E (2012) Oligotag: a program for designing sets of tags for next-generation sequencing of multiplexed samples. In: Data Production and Analysis in Population Genomics: Methods and Protocols (eds. Pompanon F, Bonin A), pp. 13-31. Springer Science+Business Media, New York. Cornish-Bowden A (1985) Nomenclature for incompletely specified bases in nucleic acid sequences: recommendations 1984. Nucleic Acids Research, 13, 3021-3030. Ficetola GF, Coissac E, Zundel S, Riaz T, Shehzad W, Bessi�re J, Taberlet P, Pompanon F (2010) An in silico approach for the evaluation of DNA barcodes. BMC Genomics, 11, 434. Riaz T, Shehzad W, Viari A, Pompanon F, Taberlet P, Coissac E (2011) ecoPrimers: inference of new DNA barcode markers from whole genome sequence analysis. Nucleic Acids Research, 39, e145. .. |Pipeline example for a standard biodiversity survey| image:: fig-Pipeline.* .. |The structure of an OBITools sequence record and its representation in fasta and fastq formats| image:: fig-Record.*