MOSAIC 1.0 documentation

Module description

«  About MOSAIC   ::   Contents

_images/MOSAIC.jpg

Module description

Code is available here

Helper functions

mosaic.pairwisealign(seq1, seq2, **kwargs)

Globally align two sequences.

Parameters:
  • seq1 (str) – Sequence 1
  • seq2 (str) – Sequence 2
  • gapopen – The cost for opening a gap.
  • gapextend – The cost for extending a gap.
Parm AA:

True if protein sequences, false otherwise.

Returns:

Sequence 1 aligned, sequence 2 aligned

Return type:

tuple

MultiMSA class

class mosaic.MultiMSA(MSAlist1, MSAlist2=None, MSAspec1=None, MSAspec2=None, specfunc1=None, specfunc2=None, methodnames1=None, methodnames2=None, specorder=None)

Data object for containing sets of multiple sequence alignments.

Parameters:
  • MSAlist1 – A list of multiple sequence alignments (class Bio.Align.MultipleSeqAlignment, read in with Bio.AlignIO.read)
  • MSAlist2 – The corresponding AA or DNA alignments (optional)
  • MSAspec1 – A list of lists specifying the order of species in each MSAlist1 alignment(This or specfunc1 must be set)
  • MSAspec2 – Same as above for MSAlist2. If neither MSAspec2 or specfunc2 are set, species order is assumed to be the same as MSAlist1
  • specfunc1 – A parsing function that returns species names from sequence labels (This or MSAspec1 must be set)
  • specfunc2 – Same as above for MSAlist2. If neither MSAspec2 or specfunc2 are set, species order is assumed to be the same as MSAlist1
  • methodnames1 – The names of the methods used corresponding to alignments in MSAlist1 (optional)
  • methodnames2 – The names of the methods used corresponding to alignments in MSAlist2 (optional)
  • specorder – The desired order of species for downstream output.
Returns:

A MultiMSA object.

Mosaic class

class mosaic.Mosaic(multiMSA, ref, useonlyspec=None, speccutoffs=None, edgefunc='perID', optrule='pairwise', ignoregaps=False, customoptfunc=None, AA=True, scoremat=None, stretcher_gapopen=8, stretcher_gapextend=1, similaritythresh=-1000000.0)

Allows for integration of sets of multiple sequence alignments

Parameters:
  • multiMSA – a MultiMSA object
  • ref – The species to use as the reference to anchor the sequence cluster.
  • useonlyspec – If specified, use only the supplied subset of species.
  • speccutoffs – Pass a dictionary of cutoffs. Corresponds to species-specific cutoffs.
  • edgefunc – The function to calculate similarities between species. Can be ‘perID’, ‘bitscore’, or a user-specified function of the form: func(seq1, seq2, **kwargs)
  • (‘pairwise’) (optrule) – The rule for optimization. Can be ‘pairwise’, ‘toref’, or a user-specified function of the form: func(edgeweightmatrix, \*\*kwargs)
  • ignoregaps – Ignore gaps in alignment scoring?
  • AA – True if the primary MSA set is amino acid. False otherwise.
  • customoptfunc – Custom optimization function
  • scoremat – Custom scoring matrix for ‘bitscore’
  • stretcher_gapopen – Gap opening penalty for global pairwise sequence alignment.
  • stretcher_gapextend – Gap extension penalty for global pairwise sequence alignment.
Returns:

Mosaic object

AAtoDNA(f_AA_aligned, f_DNA_unaligned, f_DNA_out)
Create a DNA multiple sequence alignment
from an amino acid multiple sequence alignment
Parameters:
  • f_AA_aligned – File containing aligned amino acid sequences
  • f_DNA_unaligned – The file containing unaligned DNA sequences
  • f_DNA_out – The desired output DNA filename.

Note

This function requires pal2nal.

align(filename1, filename2=None, AAtoDNA=True)

Align orthologous sequences.

Parameters:
  • filename1 – Output filename for primary alignments
  • filename2 – Output filename for secondary alignments
  • AAtoDNA – Specifies that secondary sequences are DNA and should be aligned based on AA alignment.
alignfunc(f_in, f_out, c=5, ir=500, **kwargs)

Create multiple sequence alignment from unaligned sequences

Parameters:
  • f_in – The file of unaligned sequence.
  • f_out – The desired output filename.
  • ir – Specifies the -ir flag to msaprobs
  • c – Specifies the -c flag to msaprobs

Note

This function requires msaprobs.

calc_sim_mat_pairwise()
Internal function: For pairwise optimization,
calculate the similarity matrix that will define the cluster of sequences.
Parameters:
  • self.multiMSA.MSAdict1 – A dictionary of dictionaries of sequences.
  • self.multiMSA.methodnames1 – A list of the names of the methods producing each MSA.
  • self.edgefunc – The function used to calculate the similarity between two sequences.
Returns:

A (nmethods*nspec) x (nmethods*nspec) matrix of edgeweights, stored to self.edgeweights_pairwise

Note

Sequences are blocked by method (according to self.methodnames).
These blocks are ordered by the specified species order (self.allspecs).
calc_sim_mat_toref()
Internal function: For “to reference” optimization,
calculate the similarity vector that will relate each sequence to the reference.
Parameters:
  • self.multiMSA.MSAdict1 – A dictionary of dictionaries of sequences.
  • self.multiMSA.methodnames1 – A list of the names of the methods producing each MSA.
  • self.edgefunc – The function used to calculate the similarity between two sequences.
Returns:

A (nspec) x (nspec) matrix of edgeweights, stored to self.edgeweights_toref

Note

This is the stage at which filtering takes place. Any sequence below the similarity cutoff is not assigned to the self.edgeweights_toref matrix.

getbitscore(seq1, seq2)

Calculate a bitscore between two (unaligned) sequences.

Parameters:
  • seq1 – Sequence 1
  • seq2 – Sequence 2
Variables:
  • self.AA – True if sequence alphabet is amino acids, false otherwise.
  • self.stretcher_gapopen – The penalty for opening a gap in the alignment.
  • self.stretcher_gapextend – The penalty for extending a gap in the alignment.
  • self.scoremat – The score matrix to use for the scoring of the alignment.
Returns:

A bit score for the aligned sequences.

Note

pandas is required to manage scoring matrices.

getperID(seq1, seq2)

Calculate percent identity between two (unaligned) sequences.

Parameters:
  • seq1 – Sequence 1
  • seq2 – Sequence 2
Variables:
  • self.AA – True if sequence alphabet is amino acids, false otherwise.
  • self.ignoregaps – True if gaps in the first (reference) sequence are to be ignored.
Returns:

A percent identity for the aligned sequences.

opt_cluster_toref()
Internal function: Takes sequences from each species with
the highest similarity to the reference.
optimize_cluster()

Internal function: optimize the sequence cluster.

If self.optrule is ‘pairwise’, optimize cluster by picking the
sequence for each species that optimizes the pairwise distance to current best sequences. This is repeated cyclically until convergence is reached.
If self.optrule is ‘toref’, pick the sequence from each species
that is most similar to the reference sequence
If self.optrule is ‘custom’, apply the function defined
in self.customoptfunc to the self.edgeweights_toref and/or self.edgeweights_pairwise matrices
optloop_pairwise()
Internal function: optimize cluster using pairwise similarities
and Gibbs sampling.
write_unaligned(filename1, filename2=None, inclspec=False, inclmet=False, specorder=None, labelfunc=None)

Write (unaligned) optimal sequences to a file.

Parameters:
  • filename1 – The file to which to write the primary MSAlist
  • filename2 – The file to which to write the secondary MSAlist
  • inclspec – Whether to include the species name in the sequence labels
  • inclmet – Whether to include the method name in the sequence labels
  • specorder – If specified, a different order for species in the output
  • labelfunc – If specified, a function to output sequence labels. Should be of the form labelfunc(seq.name, species, method)

«  About MOSAIC   ::   Contents