parser - operations on modX peptide sequences¶
modX is a simple extension of the IUPAC one-letter peptide sequence representation.
The labels (or codes) for the 20 standard amino acids in modX are the same as in IUPAC nomeclature. A label for a modified amino acid has a general form of ‘modX’, i.e.:
- it starts with an arbitrary number of lower-case symbols or numbers (a modification);
- it ends with a single upper-case symbol (an amino acid residue).
The valid examples of modX amino acid labels are: ‘G’, ‘pS’, ‘oxM’. This rule allows to combine read- and parseability.
Besides the sequence of amino acid residues, modX has a rule to specif terminal modifications of a polypeptide. Such a label should start or end with a hyphen. The default N-terminal amine group and C-terminal carboxyl group may not be shown explicitly.
Therefore, the valid examples of peptide sequences in modX are: “GAGA”, “H-PEPTIDE”, “TEST-NH2”.
Operations on polypeptide sequences¶
parse() - convert a sequence string into a list of amino acid residues.
tostring() - convert a parsed sequence to a string.
amino_acid_composition() - get numbers of each amino acid residue in a peptide.
cleave() - cleave a polypeptide using a given rule of enzymatic digestion.
isoforms() - generate all unique modified peptide sequences given the initial sequence and modifications.
Auxiliary commands¶
length() - calculate the number of amino acid residues in a polypeptide.
valid() - check if a sequence can be parsed successfully.
fast_valid() - check if a sequence contains of known one-letter codes.
is_modX() - check if supplied code corresponds to a modX label.
is_term_mod() - check if supplied code corresponds to a terminal modification.
Data¶
std_amino_acids - a list of the 20 standard amino acid IUPAC codes.
std_nterm - the standard N-terminal modification (the unmodified group is a single atom of hydrogen).
std_cterm - the standard C-terminal modification (the unmodified group is hydroxyl).
std_labels - a list of all standard sequence elements, amino acid residues and terminal modifications.
expasy_rules - a dict with the regular expressions of cleavage rules for the most popular proteolytic enzymes.
- pyteomics.parser.amino_acid_composition(sequence, show_unmodified_termini=False, term_aa=False, allow_unknown_modifications=False, **kwargs)[source]¶
Calculate amino acid composition of a polypeptide.
Parameters : sequence : str or list
The sequence of a polypeptide or a list with a parsed sequence.
show_unmodified_termini : bool, optional
If True then the unmodified N- and C-terminus are explicitly shown in the returned dict. Default value is False.
term_aa : bool, optional
If True then the terminal amino acid residues are artificially modified with nterm or cterm modification. Default value is False.
allow_unknown_modifications : bool, optional
If True then do not raise an exception when an unknown modification of a known amino acid residue is found in the sequence. Default value is False.
labels : list, optional
A list of allowed labels for amino acids and terminal modifications (default is the 20 standard amino acids, N-terminal ‘H-‘ and C-terminal ‘-OH’).
Returns : out : dict
A dictionary of amino acid composition.
Examples
>>> amino_acid_composition('PEPTIDE') {'I': 1, 'P': 2, 'E': 2, 'T': 1, 'D': 1} >>> amino_acid_composition('PEPTDE', term_aa=True) {'ctermE': 1, 'E': 1, 'D': 1, 'P': 1, 'T': 1, 'ntermP': 1} >>> amino_acid_composition('PEPpTIDE', labels=std_labels+['pT']) {'I': 1, 'P': 2, 'E': 2, 'D': 1, 'pT': 1}
- pyteomics.parser.cleave(*args, **kwargs)[source]¶
Cleaves a polypeptide sequence using a given rule.
Parameters : sequence : str
The sequence of a polypeptide.
rule : str
A string with a regular expression describing the C-terminal site of cleavage.
missed_cleavages : int, optional
The maximal number of allowed missed cleavages. Defaults to 0.
overlap : bool, optional
Set this to True if the cleavage rule is complex and it is important to get all possible peptides when the matching subsequences overlap (e.g. ‘XX’ produces overlapping matches when the sequence contains ‘XXX’). Default is False. Use with caution: enabling this results in exponentially growing execution time.
Returns : out : set
A set of unique (!) peptides.
Examples
>>> cleave('AKAKBK', expasy_rules['trypsin'], 0) set(['AK', 'BK']) >>> cleave('AKAKBKCK', expasy_rules['trypsin'], 2) set(['CK', 'AKBK', 'BKCK', 'AKAK', 'AKBKCK', 'AK', 'AKAKBK', 'BK'])
- pyteomics.parser.expasy_rules¶
This dict contains regular expressions for cleavage rules of the most popular proteolytic enzymes. The rules were taken from the PeptideCutter tool at Expasy.
- pyteomics.parser.fast_valid(sequence, labels=['Q', 'W', 'E', 'R', 'T', 'Y', 'I', 'P', 'A', 'S', 'D', 'F', 'G', 'H', 'K', 'L', 'C', 'V', 'N', 'M', 'H-', '-OH'])[source]¶
Iterate over sequence and check if all items are in labels. With strings, this only works as expected on sequences without modifications or terminal groups.
- pyteomics.parser.is_modX(label)[source]¶
Check if label is a valid ‘modX’ label.
Parameters : label : str Returns : out : bool
- pyteomics.parser.is_term_mod(label)[source]¶
Check if label corresponds to a terminal modification.
Parameters : label : str Returns : out : bool
- pyteomics.parser.isoforms(sequence, **kwargs)[source]¶
Apply variable and fixed modifications to the polypeptide and yield the unique modified sequences.
Parameters : sequence : str
Peptide sequence to modify.
variable_mods : dict, optional
A dict of variable modifications in the following format: {'label1': ['X', 'Y', ...], 'label2': ['X', 'A', 'B', ...]}
Note: several variable modifications can occur on amino acids of the same type, but in the output each amino acid residue will be modified at most once (apart from terminal modifications).
fixed_mods : dict, optional
A dict of fixed modifications in the same format.
Note: if a residue is affected by a fixed modification, no variable modifications will be applied to it (apart from terminal modifications).
labels : list, optional
A list of amino acid labels containing all the labels present in sequence. Modified entries will be added automatically. Defaults to std_labels.
override : bool, optional
Defines how to handle the residues that are modified in the input. False means that they will be preserved (default). True means they will be treated as unmodified.
Note: If True, then supplying fixed mods is pointless.
show_unmodified_termini : bool, optional
If True then the unmodified N- and C-termini are explicitly shown in the returned sequences. Default value is False.
Returns : out : iterator over strings
All possible unique polypeptide sequences resulting from the specified modifications are yielded obe by one.
- pyteomics.parser.length(sequence, **kwargs)[source]¶
Calculate the number of amino acid residues in a polypeptide written in modX notation.
Parameters : sequence : str or list or dict
A string with a polypeptide sequence, a list with a parsed sequence or a dict of amino acid composition.
labels : list, optional
A list of allowed labels for amino acids and terminal modifications (default is std_labels, the 20 standard amino acids, N-terminal H- and C-terminal -OH).
Examples
>>> length('PEPTIDE') 7 >>> length('H-PEPTIDE-OH') 7
- pyteomics.parser.parse(sequence, show_unmodified_termini=False, split=False, allow_unknown_modifications=False, **kwargs)[source]¶
Parse a sequence string written in modX notation into a list of labels or (if split argument is True) into a list of tuples representing amino acid residues and their modifications.
Parameters : sequence : str
The sequence of a polypeptide.
show_unmodified_termini : bool, optional
If True then the unmodified N- and C-termini are explicitly shown in the returned list. Default value is False.
split : bool, optional
If True then the result will be a list of tuples with 1 to 4 elements: terminal modification, modification, residue. Default value is False.
allow_unknown_modifications : bool, optional
If True then do not raise an exception when an unknown modification of a known amino acid residue is found in the sequence. Default value is False.
labels : container, optional
A list (set, tuple, etc.) of allowed labels for amino acids, modifications and terminal modifications (default is the 20 standard amino acids, N-terminal ‘H-‘ and C-terminal ‘-OH’).
New in ver. 1.2.2: separate labels for modifications (such as ‘p’ or ‘ox’) are now allowed.
Returns : out : list
List of tuples with labels of modifications and amino acid residues.
Examples
>>> parse('PEPTIDE', split=True) [('P',), ('E',), ('P',), ('T',), ('I',), ('D',), ('E',)] >>> parse('H-PEPTIDE') ['P', 'E', 'P', 'T', 'I', 'D', 'E'] >>> parse('PEPTIDE', show_unmodified_termini=True) ['H-', 'P', 'E', 'P', 'T', 'I', 'D', 'E', '-OH'] >>> parse('TEpSToxM', labels=std_labels + ['pS', 'oxM']) ['T', 'E', 'pS', 'T', 'oxM'] >>> parse('zPEPzTIDzE', True, True, labels=std_labels+['z']) [('H-', 'z', 'P'), ('E',), ('P',), ('z', 'T'), ('I',), ('D',), ('z', 'E', '-OH')]
- pyteomics.parser.std_amino_acids¶
modX labels for the 20 standard amino acids.
- pyteomics.parser.std_cterm¶
modX label for the unmodified C-terminus.
- pyteomics.parser.std_labels¶
modX labels for the standard amino acids and unmodified termini.
- pyteomics.parser.std_nterm¶
modX label for the unmodified N-terminus.
- pyteomics.parser.tostring(parsed_sequence, show_unmodified_termini=True)[source]¶
Create a string from a parsed sequence.
Parameters : parsed_sequence : iterable
Expected to be in one of the formats returned by parse(), i.e. list of labels or list of tuples.
show_unmodified_termini : bool, optional
Defines the behavior towards standard terminal groups in the input. True means that they will be preserved if present (default). False means that they will be removed. Standard terminal groups will not be added if not shown in parsed_sequence, regardless of this setting.
Returns : sequence : str