Module description¶

Code is available here

Helper functions¶

mosaic.pairwisealign(seq1, seq2, **kwargs)¶

Globally align two sequences.

Parameters:	seq1 (str) – Sequence 1 seq2 (str) – Sequence 2 gapopen – The cost for opening a gap. gapextend – The cost for extending a gap.
Parm AA:	True if protein sequences, false otherwise.
Returns:	Sequence 1 aligned, sequence 2 aligned
Return type:	tuple

MultiMSA class¶

class mosaic.MultiMSA(MSAlist1, MSAlist2=None, MSAspec1=None, MSAspec2=None, specfunc1=None, specfunc2=None, methodnames1=None, methodnames2=None, specorder=None)¶

Data object for containing sets of multiple sequence alignments.

Parameters:

Parameters:	MSAlist1 – A list of multiple sequence alignments (class Bio.Align.MultipleSeqAlignment, read in with Bio.AlignIO.read) MSAlist2 – The corresponding AA or DNA alignments (optional) MSAspec1 – A list of lists specifying the order of species in each MSAlist1 alignment(This or specfunc1 must be set) MSAspec2 – Same as above for MSAlist2. If neither MSAspec2 or specfunc2 are set, species order is assumed to be the same as MSAlist1 specfunc1 – A parsing function that returns species names from sequence labels (This or MSAspec1 must be set) specfunc2 – Same as above for MSAlist2. If neither MSAspec2 or specfunc2 are set, species order is assumed to be the same as MSAlist1 methodnames1 – The names of the methods used corresponding to alignments in MSAlist1 (optional) methodnames2 – The names of the methods used corresponding to alignments in MSAlist2 (optional) specorder – The desired order of species for downstream output.
Returns:	A MultiMSA object.

MSAlist1 – A list of multiple sequence alignments (class Bio.Align.MultipleSeqAlignment, read in with Bio.AlignIO.read)
MSAlist2 – The corresponding AA or DNA alignments (optional)
MSAspec1 – A list of lists specifying the order of species in each MSAlist1 alignment(This or specfunc1 must be set)
MSAspec2 – Same as above for MSAlist2. If neither MSAspec2 or specfunc2 are set, species order is assumed to be the same as MSAlist1
specfunc1 – A parsing function that returns species names from sequence labels (This or MSAspec1 must be set)
specfunc2 – Same as above for MSAlist2. If neither MSAspec2 or specfunc2 are set, species order is assumed to be the same as MSAlist1
methodnames1 – The names of the methods used corresponding to alignments in MSAlist1 (optional)
methodnames2 – The names of the methods used corresponding to alignments in MSAlist2 (optional)
specorder – The desired order of species for downstream output.

Returns:

A MultiMSA object.

Mosaic class¶

class mosaic.Mosaic(multiMSA, ref, useonlyspec=None, speccutoffs=None, edgefunc='perID', optrule='pairwise', ignoregaps=False, customoptfunc=None, AA=True, scoremat=None, stretcher_gapopen=8, stretcher_gapextend=1, similaritythresh=-1000000.0)¶

Allows for integration of sets of multiple sequence alignments

Parameters:

Parameters:	multiMSA – a MultiMSA object ref – The species to use as the reference to anchor the sequence cluster. useonlyspec – If specified, use only the supplied subset of species. speccutoffs – Pass a dictionary of cutoffs. Corresponds to species-specific cutoffs. edgefunc – The function to calculate similarities between species. Can be ‘perID’, ‘bitscore’, or a user-specified function of the form: func(seq1, seq2, kwargs) (‘pairwise’)** (optrule) – The rule for optimization. Can be ‘pairwise’, ‘toref’, or a user-specified function of the form: `func(edgeweightmatrix, \\kwargs)` ignoregaps – Ignore gaps in alignment scoring? AA – True if the primary MSA set is amino acid. False otherwise. customoptfunc – Custom optimization function scoremat – Custom scoring matrix for ‘bitscore’ stretcher_gapopen – Gap opening penalty for global pairwise sequence alignment. stretcher_gapextend – Gap extension penalty for global pairwise sequence alignment.
Returns:	Mosaic object

multiMSA – a MultiMSA object
ref – The species to use as the reference to anchor the sequence cluster.
useonlyspec – If specified, use only the supplied subset of species.
speccutoffs – Pass a dictionary of cutoffs. Corresponds to species-specific cutoffs.
edgefunc – The function to calculate similarities between species. Can be ‘perID’, ‘bitscore’, or a user-specified function of the form: func(seq1, seq2, **kwargs)
(‘pairwise’) (optrule) – The rule for optimization. Can be ‘pairwise’, ‘toref’, or a user-specified function of the form: func(edgeweightmatrix, \*\*kwargs)
ignoregaps – Ignore gaps in alignment scoring?
AA – True if the primary MSA set is amino acid. False otherwise.
customoptfunc – Custom optimization function
scoremat – Custom scoring matrix for ‘bitscore’
stretcher_gapopen – Gap opening penalty for global pairwise sequence alignment.
stretcher_gapextend – Gap extension penalty for global pairwise sequence alignment.

Returns:

Mosaic object

AAtoDNA(f_AA_aligned, f_DNA_unaligned, f_DNA_out)¶

Create a DNA multiple sequence alignment: from an amino acid multiple sequence alignment

Parameters:	f_AA_aligned – File containing aligned amino acid sequences f_DNA_unaligned – The file containing unaligned DNA sequences f_DNA_out – The desired output DNA filename.

Note

This function requires pal2nal.

align(filename1, filename2=None, AAtoDNA=True)¶

Align orthologous sequences.

Parameters:	filename1 – Output filename for primary alignments filename2 – Output filename for secondary alignments AAtoDNA – Specifies that secondary sequences are DNA and should be aligned based on AA alignment.

alignfunc(f_in, f_out, c=5, ir=500, **kwargs)¶

Create multiple sequence alignment from unaligned sequences

Parameters:	f_in – The file of unaligned sequence. f_out – The desired output filename. ir – Specifies the -ir flag to msaprobs c – Specifies the -c flag to msaprobs

Note

This function requires msaprobs.

calc_sim_mat_pairwise()¶

Internal function: For pairwise optimization,: calculate the similarity matrix that will define the cluster of sequences.

Parameters:	self.multiMSA.MSAdict1 – A dictionary of dictionaries of sequences. self.multiMSA.methodnames1 – A list of the names of the methods producing each MSA. self.edgefunc – The function used to calculate the similarity between two sequences.
Returns:	A (nmethodsnspec) x (nmethodsnspec) matrix of edgeweights, stored to `self.edgeweights_pairwise`

Note

Sequences are blocked by method (according to self.methodnames).

These blocks are ordered by the specified species order (self.allspecs).

calc_sim_mat_toref()¶

Internal function: For “to reference” optimization,: calculate the similarity vector that will relate each sequence to the reference.

Parameters:	self.multiMSA.MSAdict1 – A dictionary of dictionaries of sequences. self.multiMSA.methodnames1 – A list of the names of the methods producing each MSA. self.edgefunc – The function used to calculate the similarity between two sequences.
Returns:	A (nspec) x (nspec) matrix of edgeweights, stored to `self.edgeweights_toref`

Note

This is the stage at which filtering takes place. Any sequence below the similarity cutoff is not assigned to the self.edgeweights_toref matrix.

getbitscore(seq1, seq2)¶

Calculate a bitscore between two (unaligned) sequences.

Parameters:	seq1 – Sequence 1 seq2 – Sequence 2
Variables:	self.AA – True if sequence alphabet is amino acids, false otherwise. self.stretcher_gapopen – The penalty for opening a gap in the alignment. self.stretcher_gapextend – The penalty for extending a gap in the alignment. self.scoremat – The score matrix to use for the scoring of the alignment.
Returns:	A bit score for the aligned sequences.

Note

pandas is required to manage scoring matrices.

getperID(seq1, seq2)¶

Calculate percent identity between two (unaligned) sequences.

Parameters:	seq1 – Sequence 1 seq2 – Sequence 2
Variables:	self.AA – True if sequence alphabet is amino acids, false otherwise. self.ignoregaps – True if gaps in the first (reference) sequence are to be ignored.
Returns:	A percent identity for the aligned sequences.

opt_cluster_toref()¶

Internal function: Takes sequences from each species with: the highest similarity to the reference.

optimize_cluster()¶

Internal function: optimize the sequence cluster.

If self.optrule is ‘pairwise’, optimize cluster by picking the: sequence for each species that optimizes the pairwise distance to current best sequences. This is repeated cyclically until convergence is reached.
If self.optrule is ‘toref’, pick the sequence from each species: that is most similar to the reference sequence
If self.optrule is ‘custom’, apply the function defined: in self.customoptfunc to the self.edgeweights_toref and/or self.edgeweights_pairwise matrices

optloop_pairwise()¶

Internal function: optimize cluster using pairwise similarities: and Gibbs sampling.

write_unaligned(filename1, filename2=None, inclspec=False, inclmet=False, specorder=None, labelfunc=None)¶

Write (unaligned) optimal sequences to a file.

Parameters:

Parameters:	filename1 – The file to which to write the primary MSAlist filename2 – The file to which to write the secondary MSAlist inclspec – Whether to include the species name in the sequence labels inclmet – Whether to include the method name in the sequence labels specorder – If specified, a different order for species in the output labelfunc – If specified, a function to output sequence labels. Should be of the form labelfunc(seq.name, species, method)

filename1 – The file to which to write the primary MSAlist
filename2 – The file to which to write the secondary MSAlist
inclspec – Whether to include the species name in the sequence labels
inclmet – Whether to include the method name in the sequence labels
specorder – If specified, a different order for species in the output
labelfunc – If specified, a function to output sequence labels. Should be of the form labelfunc(seq.name, species, method)

MOSAIC 1.0 documentation