Module description¶
Code is available here
Helper functions¶
- mosaic.pairwisealign(seq1, seq2, **kwargs)¶
Globally align two sequences.
Parameters: - seq1 (str) – Sequence 1
- seq2 (str) – Sequence 2
- gapopen – The cost for opening a gap.
- gapextend – The cost for extending a gap.
Parm AA: True if protein sequences, false otherwise.
Returns: Sequence 1 aligned, sequence 2 aligned
Return type: tuple
MultiMSA class¶
- class mosaic.MultiMSA(MSAlist1, MSAlist2=None, MSAspec1=None, MSAspec2=None, specfunc1=None, specfunc2=None, methodnames1=None, methodnames2=None, specorder=None)¶
Data object for containing sets of multiple sequence alignments.
Parameters: - MSAlist1 – A list of multiple sequence alignments (class Bio.Align.MultipleSeqAlignment, read in with Bio.AlignIO.read)
- MSAlist2 – The corresponding AA or DNA alignments (optional)
- MSAspec1 – A list of lists specifying the order of species in each MSAlist1 alignment(This or specfunc1 must be set)
- MSAspec2 – Same as above for MSAlist2. If neither MSAspec2 or specfunc2 are set, species order is assumed to be the same as MSAlist1
- specfunc1 – A parsing function that returns species names from sequence labels (This or MSAspec1 must be set)
- specfunc2 – Same as above for MSAlist2. If neither MSAspec2 or specfunc2 are set, species order is assumed to be the same as MSAlist1
- methodnames1 – The names of the methods used corresponding to alignments in MSAlist1 (optional)
- methodnames2 – The names of the methods used corresponding to alignments in MSAlist2 (optional)
- specorder – The desired order of species for downstream output.
Returns: A MultiMSA object.
Mosaic class¶
- class mosaic.Mosaic(multiMSA, ref, useonlyspec=None, speccutoffs=None, edgefunc='perID', optrule='pairwise', ignoregaps=False, customoptfunc=None, AA=True, scoremat=None, stretcher_gapopen=8, stretcher_gapextend=1, similaritythresh=-1000000.0)¶
Allows for integration of sets of multiple sequence alignments
Parameters: - multiMSA – a MultiMSA object
- ref – The species to use as the reference to anchor the sequence cluster.
- useonlyspec – If specified, use only the supplied subset of species.
- speccutoffs – Pass a dictionary of cutoffs. Corresponds to species-specific cutoffs.
- edgefunc – The function to calculate similarities between species. Can be ‘perID’, ‘bitscore’, or a user-specified function of the form: func(seq1, seq2, **kwargs)
- (‘pairwise’) (optrule) – The rule for optimization. Can be ‘pairwise’, ‘toref’, or a user-specified function of the form: func(edgeweightmatrix, \*\*kwargs)
- ignoregaps – Ignore gaps in alignment scoring?
- AA – True if the primary MSA set is amino acid. False otherwise.
- customoptfunc – Custom optimization function
- scoremat – Custom scoring matrix for ‘bitscore’
- stretcher_gapopen – Gap opening penalty for global pairwise sequence alignment.
- stretcher_gapextend – Gap extension penalty for global pairwise sequence alignment.
Returns: Mosaic object
- AAtoDNA(f_AA_aligned, f_DNA_unaligned, f_DNA_out)¶
- Create a DNA multiple sequence alignment
- from an amino acid multiple sequence alignment
Parameters: - f_AA_aligned – File containing aligned amino acid sequences
- f_DNA_unaligned – The file containing unaligned DNA sequences
- f_DNA_out – The desired output DNA filename.
Note
This function requires pal2nal.
- align(filename1, filename2=None, AAtoDNA=True)¶
Align orthologous sequences.
Parameters: - filename1 – Output filename for primary alignments
- filename2 – Output filename for secondary alignments
- AAtoDNA – Specifies that secondary sequences are DNA and should be aligned based on AA alignment.
- alignfunc(f_in, f_out, c=5, ir=500, **kwargs)¶
Create multiple sequence alignment from unaligned sequences
Parameters: - f_in – The file of unaligned sequence.
- f_out – The desired output filename.
- ir – Specifies the -ir flag to msaprobs
- c – Specifies the -c flag to msaprobs
Note
This function requires msaprobs.
- calc_sim_mat_pairwise()¶
- Internal function: For pairwise optimization,
- calculate the similarity matrix that will define the cluster of sequences.
Parameters: - self.multiMSA.MSAdict1 – A dictionary of dictionaries of sequences.
- self.multiMSA.methodnames1 – A list of the names of the methods producing each MSA.
- self.edgefunc – The function used to calculate the similarity between two sequences.
Returns: A (nmethods*nspec) x (nmethods*nspec) matrix of edgeweights, stored to self.edgeweights_pairwise
Note
Sequences are blocked by method (according to self.methodnames).These blocks are ordered by the specified species order (self.allspecs).
- calc_sim_mat_toref()¶
- Internal function: For “to reference” optimization,
- calculate the similarity vector that will relate each sequence to the reference.
Parameters: - self.multiMSA.MSAdict1 – A dictionary of dictionaries of sequences.
- self.multiMSA.methodnames1 – A list of the names of the methods producing each MSA.
- self.edgefunc – The function used to calculate the similarity between two sequences.
Returns: A (nspec) x (nspec) matrix of edgeweights, stored to self.edgeweights_toref
Note
This is the stage at which filtering takes place. Any sequence below the similarity cutoff is not assigned to the self.edgeweights_toref matrix.
- getbitscore(seq1, seq2)¶
Calculate a bitscore between two (unaligned) sequences.
Parameters: - seq1 – Sequence 1
- seq2 – Sequence 2
Variables: - self.AA – True if sequence alphabet is amino acids, false otherwise.
- self.stretcher_gapopen – The penalty for opening a gap in the alignment.
- self.stretcher_gapextend – The penalty for extending a gap in the alignment.
- self.scoremat – The score matrix to use for the scoring of the alignment.
Returns: A bit score for the aligned sequences.
Note
pandas is required to manage scoring matrices.
- getperID(seq1, seq2)¶
Calculate percent identity between two (unaligned) sequences.
Parameters: - seq1 – Sequence 1
- seq2 – Sequence 2
Variables: - self.AA – True if sequence alphabet is amino acids, false otherwise.
- self.ignoregaps – True if gaps in the first (reference) sequence are to be ignored.
Returns: A percent identity for the aligned sequences.
- opt_cluster_toref()¶
- Internal function: Takes sequences from each species with
- the highest similarity to the reference.
- optimize_cluster()¶
Internal function: optimize the sequence cluster.
- If self.optrule is ‘pairwise’, optimize cluster by picking the
- sequence for each species that optimizes the pairwise distance to current best sequences. This is repeated cyclically until convergence is reached.
- If self.optrule is ‘toref’, pick the sequence from each species
- that is most similar to the reference sequence
- If self.optrule is ‘custom’, apply the function defined
- in self.customoptfunc to the self.edgeweights_toref and/or self.edgeweights_pairwise matrices
- optloop_pairwise()¶
- Internal function: optimize cluster using pairwise similarities
- and Gibbs sampling.
- write_unaligned(filename1, filename2=None, inclspec=False, inclmet=False, specorder=None, labelfunc=None)¶
Write (unaligned) optimal sequences to a file.
Parameters: - filename1 – The file to which to write the primary MSAlist
- filename2 – The file to which to write the secondary MSAlist
- inclspec – Whether to include the species name in the sequence labels
- inclmet – Whether to include the method name in the sequence labels
- specorder – If specified, a different order for species in the output
- labelfunc – If specified, a function to output sequence labels. Should be of the form labelfunc(seq.name, species, method)