Pyteomics documentation v3.4.2

mzid - mzIdentML file reader

Contents

mzid - mzIdentML file reader

Summary

mzIdentML is one of the standards developed by the Proteomics Informatics working group of the HUPO Proteomics Standard Initiative.

This module provides a minimalistic way to extract information from mzIdentML files. You can use the old functional interface (read()) or the new object-oriented interface (MzIdentML) to iterate over entries in <SpectrumIdentificationResult> elements, i.e. groups of identifications for a certain spectrum. Note that each entry can contain more than one PSM (peptide-spectrum match). They are accessible with “SpectrumIdentificationItem” key. MzIdentML objects also support direct indexing by element ID.

Data access

MzIdentML - a class representing a single MzIdentML file. Other data access functions use this class internally.

read() - iterate through peptide-spectrum matches in an mzIdentML file. Data from a single PSM group are converted to a human-readable dict. Basically creates an MzIdentML object and reads it.

chain() - read multiple files at once.

chain.from_iterable() - read multiple files at once, using an iterable of files.

DataFrame() - read MzIdentML files into a pandas.DataFrame.

Target-decoy approach

filter() - read a chain of mzIdentML files and filter to a certain FDR using TDA.

filter.chain() - chain a series of filters applied independently to several files.

filter.chain.from_iterable() - chain a series of filters applied independently to an iterable of files.

filter_df() - filter MzIdentML files and return a pandas.DataFrame.

is_decoy() - determine if a “SpectrumIdentificationResult” should be consiudered decoy.

fdr() - estimate the false discovery rate of a set of identifications using the target-decoy approach.

qvalues() - get an array of scores and local FDR values for a PSM set using the target-decoy approach.

Deprecated functions

version_info() - get information about mzIdentML version and schema. You can just read the corresponding attribute of the MzIdentML object.

get_by_id() - get an element by its ID and extract the data from it. You can just call the corresponding method of the MzIdentML object.

iterfind() - iterate over elements in an mzIdentML file. You can just call the corresponding method of the MzIdentML object.

Dependencies

This module requires lxml.


pyteomics.mzid.version_info(source)

Provide version information about the mzIdentML file.

Note

This function is provided for backward compatibility only. It simply creates an MzIdentML instance and returns its version_info attribute.

Parameters:

source : str or file

File name or file-like object.

Returns:

out : tuple

A (version, schema URL) tuple, both elements are strings or None.

pyteomics.mzid.fdr(psms=None, formula=1, is_decoy=<function is_decoy>, ratio=1, correction=0, pep=None)

Estimate FDR of a data set using TDA or given PEP values. Two formulas can be used. The first one (default) is:

The second formula is:

Note

This function is less versatile than qvalues(). To obtain FDR, you can call qvalues() and take the last q-value. This function can be used (with correction = 0 or 1) when numpy is not available.

Parameters:

psms : iterable, optional

An iterable of PSMs, e.g. as returned by read(). Not needed if is_decoy is an iterable.

formula : int, optional

Can be either 1 or 2, defines which formula should be used for FDR estimation. Default is 1.

is_decoy : callable, iterable, or str, optional

If callable, should accept exactly one argument (PSM) and return a truthy value if the PSM is considered decoy. Default is is_decoy(). If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a pandas.DataFrame).

Warning

The default function may not work with your files, because format flavours are diverse.

pep : callable, iterable, or str, optional

If callable, a function used to determine the posterior error probability (PEP). Should accept exactly one argument (PSM) and return a float. If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a pandas.DataFrame).

Note

If this parameter is given, then PEP values will be used to calculate FDR. Otherwise, decoy PSMs will be used instead. This option conflicts with: is_decoy, formula, ratio, correction.

ratio : float, optional

The size ratio between the decoy and target databases. Default is 1. In theory, the “size” of the database is the number of theoretical peptides eligible for assignment to spectra that are produced by in silico cleavage of that database.

correction : int or float, optional

Possible values are 0, 1 and 2, or floating point numbers between 0 and 1. Default is 0 (no correction); 1 accounts for the probability that a false positive scores better than the first excluded decoy PSM; 2 also corrects that probability for finite size of the sample. If a floating point number is given, then instead of the expectation value for the number of false PSMs, the confidence value is used. The value of correction is then interpreted as desired confidence level. E.g., if correction=0.95, then the calculated q-values do not exceed the “real” q-values with 95% probability.

See this paper for further explanation.

Note

Requires numpy, if correction is a float or 2.

Note

Correction is only needed if the PSM set at hand was obtained using TDA filtering based on decoy counting (as done by using filter() without correction).

Returns:

out : float

The estimation of FDR, (roughly) between 0 and 1.

pyteomics.mzid.qvalues(*args, **kwargs)

Read args and return a NumPy array with scores and q-values. q-values are calculated either using TDA or based on provided values of PEP.

Requires numpy (and optionally pandas).

Parameters:

positional args : file or str

Files to read PSMs from. All positional arguments are treated as files. The rest of the arguments must be named.

key : callable / array-like / iterable / str, keyword only, optional

If callable, a function used for sorting of PSMs. Should accept exactly one argument (PSM) and return a number (the smaller the better). If array-like, should contain scores for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a DataFrame).

Warning

The default function may not work with your files, because format flavours are diverse.

reverse : bool, keyword only, optional

If True, then PSMs are sorted in descending order, i.e. the value of the key function is higher for better PSMs. Default is False.

is_decoy : callable / array-like / iterable / str, keyword only, optional

If callable, a function used to determine if the PSM is decoy or not. Should accept exactly one argument (PSM) and return a truthy value if the PSM should be considered decoy. If array-like, should contain boolean values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a DataFrame).

Warning

The default function may not work with your files, because format flavours are diverse.

pep : callable / array-like / iterable / str, keyword only, optional

If callable, a function used to determine the posterior error probability (PEP). Should accept exactly one argument (PSM) and return a float. If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a DataFrame).

Note

If this parameter is given, then PEP values will be used to calculate q-values. Otherwise, decoy PSMs will be used instead. This option conflicts with: is_decoy, remove_decoy, formula, ratio, correction. key can still be provided. Without key, PSMs will be sorted by PEP.

remove_decoy : bool, keyword only, optional

Defines whether decoy matches should be removed from the output. Default is False.

Note

If set to False, then by default the decoy PSMs will be taken into account when estimating FDR. Refer to the documentation of fdr() for math; basically, if remove_decoy is True, then formula 1 is used to control output FDR, otherwise it’s formula 2. This can be changed by overriding the formula argument.

formula : int, keyword only, optional

Can be either 1 or 2, defines which formula should be used for FDR estimation. Default is 1 if remove_decoy is True, else 2 (see fdr() for definitions).

ratio : float, keyword only, optional

The size ratio between the decoy and target databases. Default is 1. In theory, the “size” of the database is the number of theoretical peptides eligible for assignment to spectra that are produced by in silico cleavage of that database.

correction : int or float, keyword only, optional

Possible values are 0, 1 and 2, or floating point numbers between 0 and 1. Default is 0 (no correction); 1 accounts for the probability that a false positive scores better than the first excluded decoy PSM; 2 also corrects that probability for finite size of the sample. If a floating point number is given, then instead of the expectation value for the number of false PSMs, the confidence value is used. The value of correction is then interpreted as desired confidence level. E.g., if correction=0.95, then the calculated q-values do not exceed the “real” q-values with 95% probability.

See this paper for further explanation.

q_label : str, optional

Field name for q-value in the output. Default is 'q'.

score_label : str, optional

Field name for score in the output. Default is 'score'.

decoy_label : str, optional

Field name for the decoy flag in the output. Default is 'is decoy'.

pep_label : str, optional

Field name for PEP in the output. Default is 'PEP'.

full_output : bool, keyword only, optional

If True, then the returned array has PSM objects along with scores and q-values. Default is False.

**kwargs : passed to the chain() function.

Returns:

out : numpy.ndarray

A sorted array of records with the following fields:

  • ‘score’: np.float64
  • ‘is decoy’: np.bool_
  • ‘q’: np.float64
  • ‘psm’: np.object_ (if full_output is True)
pyteomics.mzid.chain(*args, **kwargs)

Chain read() for several files. Positional arguments should be file names or file objects. Keyword arguments are passed to the read() function.

chain.from_iterable(files, **kwargs)

Chain read() for several files. Keyword arguments are passed to the read() function.

Parameters:

files : iterable

Iterable of file names or file objects.

pyteomics.mzid.filter(*args, **kwargs)

Read args and yield only the PSMs that form a set with estimated false discovery rate (FDR) not exceeding fdr.

Requires numpy and, optionally, pandas.

Parameters:

positional args : file or str

Files to read PSMs from. All positional arguments are treated as files. The rest of the arguments must be named.

fdr : float, keyword only, 0 <= fdr <= 1

Desired FDR level.

key : callable / array-like / iterable / str, keyword only, optional

A function used for sorting of PSMs. Should accept exactly one argument (PSM) and return a number (the smaller the better). The default is a function that tries to extract e-value from the PSM.

Warning

The default function may not work with your files, because format flavours are diverse.

reverse : bool, keyword only, optional

If True, then PSMs are sorted in descending order, i.e. the value of the key function is higher for better PSMs. Default is False.

is_decoy : callable / array-like / iterable / str, keyword only, optional

A function used to determine if the PSM is decoy or not. Should accept exactly one argument (PSM) and return a truthy value if the PSM should be considered decoy.

Warning

The default function may not work with your files, because format flavours are diverse.

remove_decoy : bool, keyword only, optional

Defines whether decoy matches should be removed from the output. Default is True.

Note

If set to False, then by default the decoy PSMs will be taken into account when estimating FDR. Refer to the documentation of fdr() for math; basically, if remove_decoy is True, then formula 1 is used to control output FDR, otherwise it’s formula 2. This can be changed by overriding the formula argument.

formula : int, keyword only, optional

Can be either 1 or 2, defines which formula should be used for FDR estimation. Default is 1 if remove_decoy is True, else 2 (see fdr() for definitions).

ratio : float, keyword only, optional

The size ratio between the decoy and target databases. Default is 1. In theory, the “size” of the database is the number of theoretical peptides eligible for assignment to spectra that are produced by in silico cleavage of that database.

correction : int or float, keyword only, optional

Possible values are 0, 1 and 2, or floating point numbers between 0 and 1. Default is 0 (no correction); 1 accounts for the probability that a false positive scores better than the first excluded decoy PSM; 2 also corrects that probability for finite size of the sample. If a floating point number is given, then instead of the expectation value for the number of false PSMs, the confidence value is used. The value of correction is then interpreted as desired confidence level. E.g., if correction=0.95, then the calculated q-values do not exceed the “real” q-values with 95% probability.

See this paper for further explanation.

pep : callable / array-like / iterable / str, keyword only, optional

If callable, a function used to determine the posterior error probability (PEP). Should accept exactly one argument (PSM) and return a float. If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a DataFrame).

Note

If this parameter is given, then PEP values will be used to calculate q-values. Otherwise, decoy PSMs will be used instead. This option conflicts with: is_decoy, remove_decoy, formula, ratio, correction. key can still be provided. Without key, PSMs will be sorted by PEP.

full_output : bool, keyword only, optional

If True, then an array of PSM objects is returned. Otherwise, an iterator / context manager object is returned, and the files are parsed twice. This saves some RAM, but is ~2x slower. Default is True.

Note

The name for the parameter comes from the fact that it is internally passed to qvalues().

q_label : str, optional

Field name for q-value in the output. Default is 'q'.

score_label : str, optional

Field name for score in the output. Default is 'score'.

decoy_label : str, optional

Field name for the decoy flag in the output. Default is 'is decoy'.

pep_label : str, optional

Field name for PEP in the output. Default is 'PEP'.

**kwargs : passed to the chain() function.

Returns:

out : iterator or numpy.ndarray or pandas.DataFrame

filter.chain(*files, **kwargs)

Chain filter() for several files. Positional arguments should be file names or file objects. Keyword arguments are passed to the filter() function.

filter.chain.from_iterable(*files, **kwargs)

Chain filter() for several files. Keyword arguments are passed to the filter() function.

Parameters:

files : iterable

Iterable of file names or file objects.

pyteomics.mzid.DataFrame(*args, **kwargs)[source]

Read MzIdentML files into a pandas.DataFrame.

Requires pandas.

Warning

Only the first ‘SpectrumIdentificationItem’ element is considered in every ‘SpectrumIdentificationResult’.

Parameters:

*args, **kwargs : passed to chain()

sep : str or None, optional

Some values related to PSMs (such as protein information) are variable-length lists. If sep is a str, they will be packed into single string using this delimiter. If sep is None, they are kept as lists. Default is None.

Returns:

out : pandas.DataFrame

class pyteomics.mzid.MzIdentML(*args, **kwargs)[source]

Bases: pyteomics.xml.IndexedXML

Parser class for MzIdentML files.

Methods

build_id_cache(*args, **kwargs) Construct a cache for each element in the document, indexed by id
build_tree(*args, **kwargs) Build and store the ElementTree instance
clear_id_cache() Clear the element ID cache
clear_tree() Remove the saved ElementTree.
get_by_id(*args, **kwargs) Retrieve the requested entity by its id.
iterfind(*args, **kwargs) Parse the XML and yield info on elements with specified local name or by specified “XPath”.
next()
reset()
__init__(*args, **kwargs)

Create an XML parser object.

Parameters:

source : str or file

File name or file-like object corresponding to an XML file.

read_schema : bool, optional

Defines whether schema file referenced in the file header should be used to extract information about value conversion. Default is True.

iterative : bool, optional

Defines whether an ElementTree object should be constructed and stored on the instance or if iterative parsing should be used instead. Iterative parsing keeps the memory usage low for large XML files. Default is True.

use_index : bool, optional

Defines whether an index of byte offsets needs to be created for elements listed in indexed_tags. This is useful for random access to spectra in mzML or elements of mzIdentML files, or for iterative parsing of mzIdentML with retrieve_refs=True. If True, build_id_cache is ignored. If False, the object acts exactly like XML. Default is True.

indexed_tags : container of bytes, optional

If use_index is True, elements listed in this parameter will be indexed. Empty set by default.

build_id_cache(*args, **kwargs)

Construct a cache for each element in the document, indexed by id attribute

build_tree(*args, **kwargs)

Build and store the ElementTree instance for the underlying file

clear_id_cache()

Clear the element ID cache

clear_tree()

Remove the saved ElementTree.

get_by_id(*args, **kwargs)

Retrieve the requested entity by its id. If the entity is a spectrum described in the offset index, it will be retrieved by immediately seeking to the starting position of the entry, otherwise falling back to parsing from the start of the file.

Parameters:

elem_id : str

The id value of the entity to retrieve.

Returns:

dict :

iterfind(*args, **kwargs)

Parse the XML and yield info on elements with specified local name or by specified “XPath”.

Parameters:

path : str

Element name or XPath-like expression. Only local names separated with slashes are accepted. An asterisk (*) means any element. You can specify a single condition in the end, such as: "/path/to/element[some_value>1.5]" Note: you can do much more powerful filtering using plain Python. The path can be absolute or “free”. Please don’t specify namespaces.

**kwargs : passed to self._get_info_smart().

Returns:

out : iterator

pyteomics.mzid.filter_df(*args, **kwargs)[source]

Read MzIdentML files or DataFrames and return a DataFrame with filtered PSMs. Positional arguments can be MzIdentML files or DataFrames.

Requires pandas.

Warning

Only the first ‘SpectrumIdentificationItem’ element is considered in every ‘SpectrumIdentificationResult’.

Parameters:

key : str / iterable / callable, optional

Default is ‘mascot:expectation value’.

is_decoy : str / iterable / callable, optional

Default is ‘isDecoy’.

*args, **kwargs : passed to auxiliary.filter() and/or DataFrame().

Returns:

out : pandas.DataFrame

pyteomics.mzid.get_by_id(source, elem_id, **kwargs)[source]

Parse source and return the element with id attribute equal to elem_id. Returns None if no such element is found.

Note

This function is provided for backward compatibility only. If you do multiple get_by_id() calls on one file, you should create an MzIdentML object and use its get_by_id() method.

Parameters:

source : str or file

A path to a target mzIdentML file of the file object itself.

elem_id : str

The value of the id attribute to match.

Returns:

out : dict or None

pyteomics.mzid.is_decoy(psm)[source]

Given a PSM dict, return True if all proteins in the dict are marked as decoy, and False otherwise.

Parameters:

psm : dict

A dict, as yielded by read().

Returns:

out : bool

pyteomics.mzid.iterfind(source, path, **kwargs)[source]

Parse source and yield info on elements with specified local name or by specified “XPath”.

Note

This function is provided for backward compatibility only. If you do multiple iterfind() calls on one file, you should create an MzIdentML object and use its iterfind() method.

Parameters:

source : str or file

File name or file-like object.

path : str

Element name or XPath-like expression. Only local names separated with slashes are accepted. An asterisk (*) means any element. You can specify a single condition in the end, such as: "/path/to/element[some_value>1.5]" Note: you can do much more powerful filtering using plain Python. The path can be absolute or “free”. Please don’t specify namespaces.

recursive : bool, optional

If False, subelements will not be processed when extracting info from elements. Default is True.

retrieve_refs : bool, optional

If True, additional information from references will be automatically added to the results. The file processing time will increase. Default is False.

iterative : bool, optional

Specifies whether iterative XML parsing should be used. Iterative parsing significantly reduces memory usage and may be just a little slower. When retrieve_refs is True, however, it is highly recommended to disable iterative parsing if possible. Default value is True.

read_schema : bool, optional

If True, attempt to extract information from the XML schema mentioned in the mzIdentML header (default). Otherwise, use default parameters. Disable this to avoid waiting on slow network connections or if you don’t like to get the related warnings.

build_id_cache : bool, optional

Defines whether a cache of element IDs should be built and stored on the created MzIdentML instance. Default value is the value of retrieve_refs.

Returns:

out : iterator

pyteomics.mzid.read(source, **kwargs)[source]

Parse source and iterate through peptide-spectrum matches.

Note

This function is provided for backward compatibility only. It simply creates an MzIdentML instance using provided arguments and returns it.

Parameters:

source : str or file

A path to a target mzIdentML file or the file object itself.

recursive : bool, optional

If False, subelements will not be processed when extracting info from elements. Default is True.

retrieve_refs : bool, optional

If True, additional information from references will be automatically added to the results. The file processing time will increase. Default is True.

iterative : bool, optional

Specifies whether iterative XML parsing should be used. Iterative parsing significantly reduces memory usage and may be just a little slower. When retrieve_refs is True, however, it is highly recommended to disable iterative parsing if possible. Default value is True.

read_schema : bool, optional

If True, attempt to extract information from the XML schema mentioned in the mzIdentML header (default). Otherwise, use default parameters. Disable this to avoid waiting on slow network connections or if you don’t like to get the related warnings.

build_id_cache : bool, optional

Defines whether a cache of element IDs should be built and stored on the created MzIdentML instance. Default value is the value of retrieve_refs.

Note

This parameter is ignored when use_index is True (default).

use_index : bool, optional

Defines whether an index of byte offsets needs to be created for the indexed elements. If True (default), build_id_cache is ignored.

indexed_tags : container of bytes, optional

Defines which elements need to be indexed. Empty set by default.

Returns:

out : MzIdentML

An iterator over the dicts with PSM properties.

Contents