Pyteomics documentation v3.4.1

parser - operations on modX peptide sequences

«  Pyteomics API documentation   ::   Contents   ::   mass - molecular masses and isotope distributions  »

parser - operations on modX peptide sequences

modX is a simple extension of the IUPAC one-letter peptide sequence representation.

The labels (or codes) for the 20 standard amino acids in modX are the same as in IUPAC nomeclature. A label for a modified amino acid has a general form of ‘modX’, i.e.:

The valid examples of modX amino acid labels are: ‘G’, ‘pS’, ‘oxM’. This rule allows to combine read- and parseability.

Besides the sequence of amino acid residues, modX has a rule to specify terminal modifications of a polypeptide. Such a label should start or end with a hyphen. The default N-terminal amine group and C-terminal carboxyl group may not be shown explicitly.

Therefore, valid examples of peptide sequences in modX are: “GAGA”, “H-PEPTIDE-OH”, “H-TEST-NH2”. It is not recommmended to specify only one terminal group.

Operations on polypeptide sequences

parse() - convert a sequence string into a list of amino acid residues.

tostring() - convert a parsed sequence to a string.

amino_acid_composition() - get numbers of each amino acid residue in a peptide.

cleave() - cleave a polypeptide using a given rule of enzymatic digestion.

isoforms() - generate all unique modified peptide sequences given the initial sequence and modifications.

Auxiliary commands

coverage() - calculate the sequence coverage of a protein by peptides.

length() - calculate the number of amino acid residues in a polypeptide.

valid() - check if a sequence can be parsed successfully.

fast_valid() - check if a sequence contains of known one-letter codes.

is_modX() - check if supplied code corresponds to a modX label.

is_term_mod() - check if supplied code corresponds to a terminal modification.

Data

std_amino_acids - a list of the 20 standard amino acid IUPAC codes.

std_nterm - the standard N-terminal modification (the unmodified group is a single atom of hydrogen).

std_cterm - the standard C-terminal modification (the unmodified group is hydroxyl).

std_labels - a list of all standard sequence elements, amino acid residues and terminal modifications.

expasy_rules - a dict with the regular expressions of cleavage rules for the most popular proteolytic enzymes.


pyteomics.parser.amino_acid_composition(sequence, show_unmodified_termini=False, term_aa=False, allow_unknown_modifications=False, **kwargs)[source]

Calculate amino acid composition of a polypeptide.

Parameters:

sequence : str or list

The sequence of a polypeptide or a list with a parsed sequence.

show_unmodified_termini : bool, optional

If True then the unmodified N- and C-terminus are explicitly shown in the returned dict. Default value is False.

term_aa : bool, optional

If True then the terminal amino acid residues are artificially modified with nterm or cterm modification. Default value is False.

allow_unknown_modifications : bool, optional

If True then do not raise an exception when an unknown modification of a known amino acid residue is found in the sequence. Default value is False.

labels : list, optional

A list of allowed labels for amino acids and terminal modifications.

Returns:

out : dict

A dictionary of amino acid composition.

Examples

>>> amino_acid_composition('PEPTIDE') ==     {'I': 1, 'P': 2, 'E': 2, 'T': 1, 'D': 1}
True
>>> amino_acid_composition('PEPTDE', term_aa=True) ==     {'ctermE': 1, 'E': 1, 'D': 1, 'P': 1, 'T': 1, 'ntermP': 1}
True
>>> amino_acid_composition('PEPpTIDE', labels=std_labels+['pT']) ==     {'I': 1, 'P': 2, 'E': 2, 'D': 1, 'pT': 1}
True
pyteomics.parser.cleave(*args, **kwargs)[source]

Cleaves a polypeptide sequence using a given rule.

Parameters:

sequence : str

The sequence of a polypeptide.

Note

The sequence is expected to be in one-letter uppercase notation. Otherwise, some of the cleavage rules in expasy_rules will not work as expected.

rule : str or compiled regex

A regular expression describing the site of cleavage. It is recommended to design the regex so that it matches only the residue whose C-terminal bond is to be cleaved. All additional requirements should be specified using lookaround assertions. expasy_rules contains cleavage rules for popular cleavage agents.

missed_cleavages : int, optional

Maximum number of allowed missed cleavages. Defaults to 0.

min_length : int or None, optional

Minimum peptide length. Defaults to None.

..note ::

This checks for string length, which is only correct for one-letter notation and not for full modX. Use length() manually if you know what you are doing and apply cleave() to modX sequences.

Returns:

out : set

A set of unique (!) peptides.

Examples

>>> cleave('AKAKBK', expasy_rules['trypsin'], 0) == {'AK', 'BK'}
True
>>> cleave('GKGKYKCK', expasy_rules['trypsin'], 2) ==     {'CK', 'GKYK', 'YKCK', 'GKGK', 'GKYKCK', 'GK', 'GKGKYK', 'YK'}
True
pyteomics.parser.coverage(protein, peptides)[source]

Calculate how much of protein is covered by peptides. Peptides can overlap. If a peptide is found multiple times in protein, it contributes more to the overall coverage.

Requires numpy.

Note

Modifications and terminal groups are discarded.

Parameters:

protein : str

A protein sequence.

peptides : iterable

An iterable of peptide sequences.

Returns:

out : float

The sequence coverage, between 0 and 1.

Examples

>>> coverage('PEPTIDES'*100, ['PEP', 'EPT'])
0.5
pyteomics.parser.expasy_rules

This dict contains regular expressions for cleavage rules of the most popular proteolytic enzymes. The rules were taken from the PeptideCutter tool at Expasy.

pyteomics.parser.fast_valid(sequence, labels=set(['A', 'C', 'E', 'D', 'G', 'F', 'I', 'H', 'K', 'M', 'L', 'N', 'Q', 'P', 'S', '-OH', 'T', 'W', 'V', 'Y', 'R', 'H-']))[source]

Iterate over sequence and check if all items are in labels. With strings, this only works as expected on sequences without modifications or terminal groups.

Parameters:

sequence : iterable (expectedly, str)

The sequence to check. A valid sequence would be a string of labels, all present in labels.

labels : iterable, optional

An iterable of known labels.

Returns:

out : bool

pyteomics.parser.is_modX(label)[source]

Check if label is a valid ‘modX’ label.

Parameters:label : str
Returns:out : bool
pyteomics.parser.is_term_mod(label)[source]

Check if label corresponds to a terminal modification.

Parameters:label : str
Returns:out : bool
pyteomics.parser.isoforms(sequence, **kwargs)[source]

Apply variable and fixed modifications to the polypeptide and yield the unique modified sequences.

Parameters:

sequence : str

Peptide sequence to modify.

variable_mods : dict, optional

A dict of variable modifications in the following format: {'label1': ['X', 'Y', ...], 'label2': ['X', 'A', 'B', ...]}

Keys in the dict are modification labels (terminal modifications allowed). Values are iterables of residue labels (one letter each) or True. If a value for a modification is True, it is applicable to any residue (useful for terminal modifications). You can use values such as ‘ntermX’ or ‘ctermY’ to specify that a mdofication only occurs when the residue is in the terminal position. This is not needed for terminal modifications.

Note

Several variable modifications can occur on amino acids of the same type, but in the output each amino acid residue will be modified at most once (apart from terminal modifications).

fixed_mods : dict, optional

A dict of fixed modifications in the same format.

Note: if a residue is affected by a fixed modification, no variable modifications will be applied to it (apart from terminal modifications).

labels : list, optional

A list of amino acid labels containing all the labels present in sequence. Modified entries will be added automatically. Defaults to std_labels. Not required since version 2.5.

max_mods : int or None, optional

Number of modifications that can occur simultaneously on a peptide, excluding fixed modifications. If None or if max_mods is greater than the number of modification sites, all possible isoforms are generated. Default is None.

override : bool, optional

Defines how to handle the residues that are modified in the input. False means that they will be preserved (default). True means they will be treated as unmodified.

show_unmodified_termini : bool, optional

If True then the unmodified N- and C-termini are explicitly shown in the returned sequences. Default value is False.

format : str, optional

If 'str' (default), an iterator over sequences is returned. If 'split', the iterator will yield results in the same format as parse() with the ‘split’ option, with unmodified terminal groups shown.

Returns:

out : iterator over strings or lists

All possible unique polypeptide sequences resulting from the specified modifications are yielded obe by one.

pyteomics.parser.length(sequence, **kwargs)[source]

Calculate the number of amino acid residues in a polypeptide written in modX notation.

Parameters:

sequence : str or list or dict

A string with a polypeptide sequence, a list with a parsed sequence or a dict of amino acid composition.

labels : list, optional

A list of allowed labels for amino acids and terminal modifications.

Examples

>>> length('PEPTIDE')
7
>>> length('H-PEPTIDE-OH')
7
pyteomics.parser.match_modX(label)[source]

Check if label is a valid ‘modX’ label.

Parameters:label : str
Returns:out : re.match or None
pyteomics.parser.num_sites(sequence, rule, **kwargs)[source]

Count the number of sites where sequence can be cleaved using the given rule (e.g. number of miscleavages for a peptide).

Parameters:

sequence : str

The sequence of a polypeptide.

rule : str or compiled regex

A regular expression describing the site of cleavage. It is recommended to design the regex so that it matches only the residue whose C-terminal bond is to be cleaved. All additional requirements should be specified using lookaround assertions.

labels : list, optional

A list of allowed labels for amino acids and terminal modifications.

Returns:

out : int

Number of cleavage sites.

pyteomics.parser.parse(sequence, show_unmodified_termini=False, split=False, allow_unknown_modifications=False, **kwargs)[source]

Parse a sequence string written in modX notation into a list of labels or (if split argument is True) into a list of tuples representing amino acid residues and their modifications.

Parameters:

sequence : str

The sequence of a polypeptide.

show_unmodified_termini : bool, optional

If True then the unmodified N- and C-termini are explicitly shown in the returned list. Default value is False.

split : bool, optional

If True then the result will be a list of tuples with 1 to 4 elements: terminal modification, modification, residue. Default value is False.

allow_unknown_modifications : bool, optional

If True then do not raise an exception when an unknown modification of a known amino acid residue is found in the sequence. This also includes terminal groups. Default value is False.

Note

Since version 2.5, this parameter has effect only if labels are provided.

labels : container, optional

A container of allowed labels for amino acids, modifications and terminal modifications. If not provided, no checks will be done. Separate labels for modifications (such as ‘p’ or ‘ox’) can be supplied, which means they are applicable to all residues.

Warning

If show_unmodified_termini is set to True, standard terminal groups need to be present in labels.

Warning

Avoid using sequences with only one terminal group, as they are ambiguous. If you provide one, labels (or std_labels) will be used to resolve the ambiguity.

Returns:

out : list

List of tuples with labels of modifications and amino acid residues.

Examples

>>> parse('PEPTIDE', split=True)
[('P',), ('E',), ('P',), ('T',), ('I',), ('D',), ('E',)]
>>> parse('H-PEPTIDE')
['P', 'E', 'P', 'T', 'I', 'D', 'E']
>>> parse('PEPTIDE', show_unmodified_termini=True)
['H-', 'P', 'E', 'P', 'T', 'I', 'D', 'E', '-OH']
>>> parse('TEpSToxM', labels=std_labels + ['pS', 'oxM'])
['T', 'E', 'pS', 'T', 'oxM']
>>> parse('zPEPzTIDzE', True, True, labels=std_labels+['z'])
[('H-', 'z', 'P'), ('E',), ('P',), ('z', 'T'), ('I',), ('D',), ('z', 'E', '-OH')]
pyteomics.parser.std_amino_acids

modX labels for the 20 standard amino acids.

pyteomics.parser.std_cterm

modX label for the unmodified C-terminus.

pyteomics.parser.std_labels

modX labels for the standard amino acids and unmodified termini.

pyteomics.parser.std_nterm

modX label for the unmodified N-terminus.

pyteomics.parser.tostring(parsed_sequence, show_unmodified_termini=True)[source]

Create a string from a parsed sequence.

Parameters:

parsed_sequence : iterable

Expected to be in one of the formats returned by parse(), i.e. list of labels or list of tuples.

show_unmodified_termini : bool, optional

Defines the behavior towards standard terminal groups in the input. True means that they will be preserved if present (default). False means that they will be removed. Standard terminal groups will not be added if not shown in parsed_sequence, regardless of this setting.

Returns:

sequence : str

pyteomics.parser.valid(*args, **kwargs)[source]

Try to parse sequence and catch the exceptions. All parameters are passed to parse().

Returns:

out : bool

True if the sequence was parsed successfully, and False otherwise.

«  Pyteomics API documentation   ::   Contents   ::   mass - molecular masses and isotope distributions  »