Welcome to the OBITools

The OBITools package is a set of programs specifically designed for analyzing NGS data in a DNA metabarcoding context, taking into account taxonomic information. It is distributed as an open source software available on the following website: http://metabarcoding.org/obitools.

Citation: Boyer F., Mercier C., Bonin A., Taberlet P., Coissac E. (2014) OBITools: a Unix-inspired software package for DNA metabarcoding. Molecular Ecology Resources, submitted.

Installing the OBITools

Availability of the OBITools

The OBITools are open source and protected by the CeCILL 2.1 license (http://www.cecill.info/licences/Licence_CeCILL_V2.1-en.html).

The OBITools are deposited on the Python Package Index (PyPI : https://pypi.python.org/pypi/obitools) and all the sources can be downloaded from our subversion server (http://www.grenoble.prabi.fr/public-svn/OBISofts/OBITools).

Prerequisites

To install the OBITools, you need that these softwares are installed on your system:

  • Python 2.7 (installed by default on most Unix systems, available from the Python website)
  • gcc (installed by default on most Unix systems, available from the GNU sites dedicated to GCC and GMake)

On a linux system

You have to take care that the Python-dev packages are installed.

On MacOSX

The C compiler and all the other compilation tools are included in the XCode application not installed by default. The Python included in the system is not suitable for running the OBITools. You have to install a complete distribution of Python that you can download as a MacOSX package from the Python website.

Downloading and installing the OBITools

The OBITools are downloaded and installed using the get-obitools.py script. This is a user level installation that does not need administrator privilege.

Once downloaded, move the file get-obitools.py in the directory where you want to install the OBITools. From a Unix terminal you must now run the command :

> python get-obitools.py

The script will create a new directory at the place you are running it in which all the OBITools will be installed. No system privilege are required, and you system will not be altered in any way by the obitools installation.

The newly created directory is named OBITools-VERSION where version is substituted by the latest version number available.

Inside the newly created directory all the OBITools are installed. Close to this directory there is a shell script named obitools. Running this script activate the OBITools by reconfiguring your Unix environment.

> ./obitools

Once activated you can desactivate the OBITools byt typing the command exit.

> exit

OBITools are no more activated, Bye...
======================================

System level installation

To install the OBITools at the system level you can follow two options :

  • copy the obitools script in a usual directory for installing program like /usr/local/bin but never move the OBITools directory itself after the installation by the get-obitools.py.
  • The other solution is to add the export/bin directory located in the OBITools directory to the ``PATH``environment variable.

Retrieving the sources of the OBITools

If you want to compile by yourself the OBITools, you will need to install the same prerequisite:

> pip install -U virtualenv

> pip install -U sphinx

> pip install -U cython

moreover you need to install any subversion client (a list of clients is available from Wikipedia)

Then you can download the

> svn co http://www.grenoble.prabi.fr/public-svn/OBISofts/OBITools/branches/OBITools-1.00/ OBITools

This command will create a new directory called OBITools.

Compiling and installing the OBITools

From the directory where you retrieved the sources, execute the following commands:

> cd OBITools

> python setup.py --serenity install

Once installed, you can test your installation by running the commands of the tutorials.

Introduction

DNA metabarcoding is an emerging approach for biodiversity studies (Taberlet et al. 2012). Originally mainly developed by microbiologists (e.g. Sogin et al. 2006), it is now widely used for plants (e.g. Sonstebo et al. 2010, Parducci et al. 2012, Yoccoz et al. 2012) and animals from meiofauna (e.g. Chariton et al. 2010, Baldwin et al. 2013) to larger organisms (e.g. Andersen et al. 2012, Thomsen et al. 2012). Interestingly, this method is not limited to sensu stricto biodiversity surveys, but it can also be implemented in other ecological contexts such as for herbivore (e.g. Valentini et al. 2009, Kowalczyk et al. 2011) or carnivore (e.g. Deagle et al. 2009, Shehzad et al. 2012) diet analyses. Whatever the biological question under consideration, the DNA metabarcoding methodology relies heavily on next-generation sequencing (NGS), and generates considerable numbers of DNA sequence reads (typically million of reads). Manipulation of such large datasets requires dedicated programs usually running on a Unix system. Unix is an operating system, whose first version was created during the sixties. Since its early stages, it is dedicated to scientific computing and includes a large set of simple tools to efficiently process text files. Most of those programs can be viewed as filters extracting information from a text file to create a new text file. These programs process text files as streams, line per line, therefore allowing computation on a huge dataset without requiring a large memory. Unix programs usually print their results to their standard output (stdout), which by default is the terminal, so the results can be examined on screen. The main philosophy of the Unix environment is to allow easy redirection of the stdout either to a file, for saving the results, or to the standard input (stdin) of a second program thus allowing to easily create complex processing from simple base commands. Access to Unix computers is increasingly easier for scientists nowadays. Indeed, the Linux operating system, an open source version of Unix, can be freely installed on every PC machine and the MacOSX operating system, running on Apple computers, is also a Unix system. The OBITools programs imitate Unix standard programs because they usually act as filters, reading their data from text files or the stdin and writing their results to the stdout. The main difference with classical Unix programs is that text files are not analyzed line per line but sequence record per sequence record (see below for a detailed description of a sequence record). Compared to packages for similar purposes like mothur (Schloss et al. 2009) or QIIME (Caporaso et al. 2010), the OBITools mainly rely on filtering and sorting algorithms. This allows users to set up versatile data analysis pipelines (Figure 1), adjustable to the broad range of DNA metabarcoding applications. The innovation of the OBITools is their ability to take into account the taxonomic annotations, ultimately allowing sorting and filtering of sequence records based on the taxonomy.

Pipeline example for a standard biodiversity survey

References

Andersen K, Bird KL, Rasmussen M, Haile J, Breuning-Madsen H, Kj�r KH, Orlando L, Gilbert MTP, Willerslev E (2012) Meta-barcoding of “dirt” DNA from soil reflects vertebrate biodiversity. Molecular Ecology, 21, 1966-1979.

Baldwin DS, Colloff MJ, Rees GN, Chariton AA, Watson GO, Court LN, Hartley DM, Morgan Mj, King AJ, Wilson JS, Hodda M, Hardy CM (2013) Impacts of inundation and drought on eukaryote biodiversity in semi-arid floodplain soils. Molecular Ecology, 22, 1746-1758.

Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD, Costello EK, Fierer N, Pena AG, Goodrich JK, Gordon JI, Huttley GA, Kelley ST, Knights D, Koenig JE, Ley RE, Lozupone CA, McDonald D, Muegge BD, Pirrung M, Reeder J, Sevinsky JR, Tumbaugh PJ, Walters WA, Widmann J, Yatsunenko T, Zaneveld J, Knight R (2010) QIIME allows analysis of high-throughput community sequencing data. Nature Methods, 7, 335-336.

Chariton AA, Court LN, Hartley DM, Colloff MJ, Hardy CM (2010) Ecological assessment of estuarine sediments by pyrosequencing eukaryotic ribosomal DNA. Frontiers in Ecology and the Environment, 8, 233-238.

Deagle BE, Kirkwood R, Jarman SN (2009) Analysis of Australian fur seal diet by pyrosequencing prey DNA in faeces. Molecular Ecology, 18, 2022-2038.

Kowalczyk R, Taberlet P, Coissac E, Valentini A, Miquel C, Kaminski T, W�jcik JM (2011) Influence of management practices on large herbivore diet - case of European bison in Bialowieza Primeval Forest (Poland). Forest Ecology and Management, 261, 821-828.

Parducci L, Jorgensen T, Tollefsrud MM, Elverland E, Alm T, Fontana SL, Bennett KD, Haile J, Matetovici I, Suyama Y, Edwards ME, Andersen K, Rasmussen M, Boessenkool S, Coissac E, Brochmann C, Taberlet P, Houmark-Nielsen M, Larsen NK, Orlando L, Gilbert MTP, Kjaer KH, Alsos IG, Willerslev E (2012) Glacial Survival of Boreal Trees in Northern Scandinavia. Science, 335, 1083-1086.

Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, Lesniewski RA, Oakley BB, Parks DH, Robinson CJ, Sahl JW, Stres B, Thallinger GG, Van Horn DJ, Weber CF (2009) Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Applied and Environmental Microbiology, 75, 7537-7541.

Shehzad W, Riaz T, Nawaz MA, Miquel C, Poillot C, Shah SA, Pompanon F, Coissac E, Taberlet P (2012) Carnivore diet analysis based on next generation sequencing: application to the leopard cat (Prionailurus bengalensis) in Pakistan. Molecular Ecology, 21, 1951-1965.

Sogin ML, Morrison HG, Huber JA, Welch DM, Huse SM, Neal PR, Arrieta JM, Herndl GJ (2006) Microbial diversity in the deep sea and the underexplored “rare biosphere”. Proceedings of the National Academy of Sciences of the United States of America, 103, 12115-12120.

S�nsteb� JH, Gielly L, Brysting A, Reidar E, Edwards M, Haile J, Willerslev E, Coissac E, Rioux D, Sannier J, Taberlet P, Brochmann C (2010) Using next-generation sequencing for molecular reconstruction of past Arctic vegetation and climate. Molecular Ecology Resources, 10, 1009-1018.

Taberlet P, Coissac E, Hajibabaei M, Rieseberg LH (2012) Environmental DNA. Molecular Ecology, 21, 1789-1793.

Thomsen PF, Kielgast J, Iversen LL, Wiuf C, Rasmussen M, Gilbert MTP, Orlando L, Willerslev E (2012) Monitoring endangered freshwater biodiversity using environmental DNA. Molecular Ecology, 21, 2565-2573.

Valentini A, Miquel C, Nawaz MA, Bellemain E, Coissac E, Pompanon F, Gielly L, Cruaud C, Nascetti G, Wincker P, Swenson JE, Taberlet P (2009) New perspectives in diet analysis based on DNA barcoding and parallel pyrosequencing: the trnL approach. Molecular Ecology Resources, 9, 51-60.

Yoccoz NG, Br�then KA, Gielly L, Haile J, Edwards ME, Goslar T, von Stedingk H, Brysting AK, Coissac E, Pompanon F, S�nsteb� JH, Miquel C, Valentini A, de Bello F, Chave J, Thuiller W, Wincker P, Cruaud C, Gavory F, Rasmussen M, Gilbert MTP, Orlando L, Brochmann C, Willerslev E, Taberlet P (2012) DNA from soil mirrors plant taxonomic and growth form diversity. Molecular Ecology, 21, 3647-3655.

Basic concepts of the OBITools

Once installed, the OBITools enrich the Unix command line interface with a set of new commands dedicated to NGS data processing. Most of them have a name starting with the obi prefix. They automatically recognize the input file format amongst most of the standard sequence file formats (i.e. fasta, fastq, EMBL, and GenBank formats). Nevertheless, options are available to enforce some format specificity such as the encoding system used in fastq files for quality codes. Most of the basic Unix commands have their OBITools equivalent (e.g. obihead vs head, obitail vs tail, obigrep vs grep), which is convenient for scientists familiar with Unix. The main difference between any standard Unix command and its OBITools counterpart is that the treatment unit is no longer the text line but the sequence record. As a sequence record is more complex than a single text line, the OBITools programs have many supplementary options compared to their Unix equivalents.

The structure of a sequence record

The OBITools commands consider a sequence record as an entity composed of five distinct elements. Two of them are mandatory, the identifier (id) and the DNA or protein sequence itself. The id is a single word composed of characters, digits, and other symbols like dots or underscores excluding spaces. Formally, the ids should be unique within a dataset and should identify each sequence record unambiguously, but only a few OBITools actually rely on this property. The sequence is an ordered set of characters corresponding to nucleotides or amino-acids according to the International Union of Pure and Applied Chemistry (IUPAC) nomenclature (Cornish-Bowden 1985). The three other elements composing a sequence record are optional. They consist in a sequence definition, a quality vector, and a set of attributes. The sequence definition is a free text describing the sequence briefly. The quality vector associates a quality score to each nucleotide or amino-acid. Usually this quality score is the result of the base-calling process by the sequencer. The last element is a set of attributes qualifying the sequence, each attribute being described by a key=value pair. The set of attributes is the central concept of the OBITools system. When an OBITools command is run on the sequence records included in a dataset, the result of the computation often consist in the addition of new attributes completing the annotation of each sequence record. This strategy of sequence annotation allows the OBITools to return their results as a new sequence record file that can be used as the input of another OBITools program, ultimately creating complex pipelines.

Managed sequence file formats

Most of the OBITools commands read sequence records from a file or from the stdin, make some computations on the sequence records and output annotated sequence records. As inputs, the OBITools are able to automatically recognize the most common sequence file formats (i.e. fasta, fastq, EMBL, and GenBank). They are also able to read ecoPCR (Ficetola et al. 2010) result files and ecoPCR/ecoPrimers formatted sequence databases (Riaz et al. 2011) as ordinary sequence files. File format outputs are more limited. By default, sequences without and with quality information are written in fasta and Sanger fastq formats, respectively. However, dedicated options allow enforcing the output format, and the OBITools are also able to write sequences in the ecoPCR/ecoPrimers database format, to produce reference databases for these programs. In the fasta or fastq format, the attributes are written in the header line just after the id, following a key=value; format (Figure 2).

The structure of an OBITools sequence record and its representation in fasta and fastq formats

Taxonomical aspects

Filtering and annotation steps in the processing of DNA metabarcoding sequence data are greatly eased by the explicit association of taxonomic information to sequences together with an easy access to the taxonomy. Taxonomic information, including a taxonomic identifier, can thus be stored in the set of attributes of each sequence record. Specifically, the taxid attribute is used by the OBITools when querying taxonomic information of a sequence record, nevertheless several OBITools commands can annotate sequence records with taxonomy-related attributes for the user’s convenience. The value of the taxid attribute must be a unique integer referring unambiguously to one taxon in the taxonomic associated database. Although this is not mandatory, the NCBI taxonomy is a preferred source of taxonomic information as the OBITools provide commands to easily extract the full taxonomic information from it. The command obitaxonomy is useful to build a taxonomic database in the OBITools format from a dump of the NCBI taxonomic database (downloadable at the following URL: ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz).

Implemented algorithms

Most of the algorithms implemented in the OBITools are basic algorithms allowing sampling, filtering and annotation of sequence records based on their associated attribute set or sequence (e.g. obisample, obigrep, obiannotate). Some others implement algorithms directly related to NGS or to DNA metabarcoding (e.g. illuminapairedend, ngsfilter, ecotag). Finally, a few of them do not run on sequence records and/or do not provide their results as sequence records. Amongst them, oligotag (Coissac 2012) generates a set of short oligonucleotide sequences (hereafter referred to as tags useful to uniquely identify individual samples within a single NGS library containing many samples. Hereby, we will describe some of the implemented algorithms pertaining directly to DNA metabarcoding, as well as the corresponding programs. A full description of all programs included in the OBITools suite is available on the web http://metabarcoding.org/obitools/doc.

Implementation of the OBITools

The OBITools are a set of Python programs relying on an eponym Python library. The OBITools library is mainly developed in Python (version 2.7 see (http://www.python.org). For increasing the speed of execution, many parts of the OBITools library are developed using cython (http://cython.org/, a Python to C compiler) or the C language directly. The OBITools compile on Unix systems including Linux and MacOSX.

References

Coissac E (2012) Oligotag: a program for designing sets of tags for next-generation sequencing of multiplexed samples. In: Data Production and Analysis in Population Genomics: Methods and Protocols (eds. Pompanon F, Bonin A), pp. 13-31. Springer Science+Business Media, New York.

Cornish-Bowden A (1985) Nomenclature for incompletely specified bases in nucleic acid sequences: recommendations 1984. Nucleic Acids Research, 13, 3021-3030.

Ficetola GF, Coissac E, Zundel S, Riaz T, Shehzad W, Bessi�re J, Taberlet P, Pompanon F (2010) An in silico approach for the evaluation of DNA barcodes. BMC Genomics, 11, 434.

Riaz T, Shehzad W, Viari A, Pompanon F, Taberlet P, Coissac E (2011) ecoPrimers: inference of new DNA barcode markers from whole genome sequence analysis. Nucleic Acids Research, 39, e145.