About MOSAIC¶
Code is available from github or PyPI.
Introduction¶
Ortholog detection (OD) is a critical first step for a variety of comparative genomic and phylogenetic inference tasks. In general, existing OD methods can be classified as tree-based, graph-based, or a hybrid of the two. Comparative performance between these approaches varies by dataset, however, frustrating attempts to identify a single best method. In contrast, MOSAIC improves ortholog detection by leveraging complementary between diverse OD methods. The result is striking gains in the number of detected orthologs relative to existing approaches, while simultaneously maintaining or improving functional-, phylogenetic-, and sequence identity-based measures of ortholog quality.
How MOSAIC works¶
MOSAIC is based on the concept of cluster optimization. Given a set of multiple sequence alignments (MSAs), MOSAIC builds a cluster of orthologous sequences, and optimizes that cluster according to specified measures of cluster quality and completeness.
- Note:
- All MSAs must have at least one sequence in common which can be defined as the reference sequence. This makes intuitive sense for MSA integration. It also roots the cluster, obviating problems with non-identifiability.
MOSAIC is designed to provide a great deal of flexibility in the construction and optimization of orthogolous clusters. By default, users may construct clusters using similarity statistics based on percent identity or a blast-style bit score. In addition, MOSAIC provides two default methods for cluster optimization. For each species, MOSAIC can choose the ortholog that has the highest similarity to the reference sequence (‘to ref’ optimization), or it can optimize the pairwise similarities between all sequences (‘pairwise’ optimization). Alternatively, MOSAIC can accept a user-defined methods for similarity quantification and cluster optimization. In all cases, it is recommended to apply a species-specific score cutoff prior to optimization. This option is provided through the speccutoffs option to Mosaic.
Installing MOSAIC¶
Optional steps¶
- If one desires a custom scoring matrix for bit score similarities, pandas must be installed.
Note: pandas is included in the anaconda and Enthought distributions mentioned below.
Required steps¶
Download and install python, or a “kitchen sink” distribution of python, such as anaconda or Enthought.
- Install Biopython.
Note: Biopython is included in the anaconda and academic Enthought distributions mentioned above.
Obtain stretcher, a fast global aligner that can be installed as part of the EMBOSS toolkit.
Install MOSAIC using easy_install bio-mosaic from the command-line.
Tutorial¶
A common task in the analysis of protein-coding regions is to produce a cDNA alignment from the alignment protein sequences. Since functional amino acid sequences are more conserved over long evolutionary time scales than their underlying cDNA, amino acid sequences tend to produce much more accurate alignments than their cDNA counterparts. However, since cDNA alignments contain information about mutations that did not affect the protein coding sequence (so-called synonymous mutations), these alignments are often much more informative for inferring evolutionary history. Thus, we will present the steps necessary for building cluster-optimized cDNA alignments from diverse AA alignments.
Reading in MSAs¶
First we will read in our sets of multiple sequence alignments:
import mosaic
filenamelistAA = ['method1_AA/filename1.fasta', 'method2_AA/filename2.fasta']
filenamelistDNA = ['method1_DNA/filename1.fasta', 'method2_DNA/filename2.fasta']
methodnames = ['method1', 'method2']
MSAlistAA = [AlignIO.read(f, "fasta") for f in filenamelistAA]
MSAlistDNA = [AlignIO.read(f, "fasta") for f in filenamelistDNA]
multiMSA = mosaic.MultiMSA(MSAlistAA, MSAlist2=MSAlistDNA, specfunc1=myspecfunc,
methodnames1=methodnames, methodnames2=methodnames)
We have now read in all our MSAs and combined them within a MultiMSA object. Note the use of myspecfunc, which parses sequence identifiers and returns a species name. If naming conventions conflict across MSAs, one may also provide, using the options methodnames1 and methodnames2 a list of lists defining the species order for each MSA Alternatively, one can define an arbitrary number of MultiMSA objects, each with its own parsing function. These MultiMSA objects can them be catenated together using the + operator:
multiMSA1 = mosaic.MultiMSA(MSAlistAA[:1], MSAlistDNA[:1], specfunc1=myspecfunc1,
methodnames1=methodnames, methodnames2=methodnames)
multiMSA2 = mosaic.MultiMSA(MSAlistAA[1:], MSAlistDNA[1:], specfunc1=myspecfunc2,
methodnames1=methodnames, methodnames2=methodnames)
multiMSA = multiMSA1 + multiMSA2
Defining and optimizing clusters¶
One may then define and optimize the resulting sequence clusters in several ways. The first step in building a cluster is to define similarities between sequences. By default, MOSAIC allows the user to calculate similarities by length-scaled bit scores or by percent identity. Orthologs can then be chosen to optimize similarity to the reference sequence or to maximize the average pairwise similarity between all accepted sequences:
optimizedcluster = mosaic.Mosaic(multiMSA, optrule='toref', ref='Hom', edgefunc='perID')
Output and realignment¶
Once MOSAIC determines the optimal orthologous sequences, these sequences can be written out to a file. The inclspec and inclmet options specify whether or not to explicitly include the species- and method-of-origin in the output name of a given sequence. Alternatively, one may use the labelfunc option to provide a custom function for creating output labels.:
optimizedcluster.write_unaligned(filename1='../alignedAA_MOSAIC/%s' %testfile,
filename2='../alignedDNA_MOSAIC/%s' %testfile,
inclspec=True, inclmet=True,
specorder=['Hom', 'Pan', 'Gor', 'Pon']
)
After unaligned sequences have been written, one may then align those sequences using msaprobs.
optimizedcluster.align(filename1_aligned, filename2=filename2_aligned)
When two filenames are provided, MOSAIC assumes the second is a DNA sequence and automatically performs an AA to DNA alignment conversion using pal2nal.
Customizing MOSAIC¶
Custom edge functions may be defined using the edgefunc argument to the MOSAIC constructor. In addition, one may specify a custom optimization function using the optfunc argument. Alignment scoring parameters may also be modified via the scoremat, stretcher_gapopen, and stretcher_gapextend parameters.
Example script¶
A working example of a MOSAIC script is available in the code repository. From within the mosaic directory, this script can be invoked with the following syntax: python mosaic_example.py testfiles.txt. This will read in genes from testfiles.txt and perform the full MOSAIC pipeline for all default scoring and optimization rules. This script may easily be adapted to user-supplied alignments by specifying a different input file and altering the getMSAlist_for_me function to correctly retrieve MSAs given the provided list of gene names. The methodnames, specnames, and specorder variables should also be changed to reflect the methods and species of interest. Finally, cutoffs_perID and cutoffs_bit should be modified if species cutoffs are desired for percent identity or bit score edge weights, respectively.
- Note:
- If you do not need MOSAIC to align the optimal ortholog set, the call optimizedcluster.align(...) may be commented out, thus obviating the need to install msaprobs. Note, however, that pal2nal.pl is included in the code repository).