Pyteomics documentation v3.4.1

Data Access

«  Retention time prediction   ::   Contents   ::   Pyteomics API documentation  »

Data Access

The following section is dedicated to data manipulation. Pyteomics aims to support the most common formats of (LC-)MS/MS data, peptide identification results and protein databases.

General Notes

  • Each module mentioned below corresponds to a file format. In each module, the top-level function read() allows iteration over entries in a file. It works like the built-in open(), allowing direct iteration and supporting the with syntax, which we recommend using. So you can do:

    >>> from pyteomics import mgf
    >>> reader ='tests/test.mgf')
    >>> for spectrum in reader:
    >>>    ...
    >>> reader.close()

    ... but it is recommended to do:

    >>> from pyteomics import mgf
    >>> with'tests/test.mgf') as reader:
    >>>     for spectrum in reader:
    >>>        ...
  • Apart from read(), which reads just one file, all modules described here have functions for reading multiple files: chain() and chain.from_iterable(). chain('f1', 'f2') is equivalent to chain.from_iterable(['f1', 'f2']). chain() and chain.from_iterable() only support the with syntax. If you don’t want to use the with syntax, you can just use the itertools functions chain() and chain.from_iterable().

  • Throughout this section we use pyteomics.auxiliary.print_tree() to display the structure of the data returned by various parsers.

Text-based formats


Mascot Generic Format (MGF) is a simple human-readable format for MS/MS data. It allows storing MS/MS peak lists and exprimental parameters. pyteomics.mgf is a module that implements reading and writing MGF files.

Reading function allows iterating through spectrum entries. Spectra are represented as dicts. By default, MS/MS peak lists are stored as numpy.ndarray objects mass array and intensity array. Fragment charges will be stored in a masked array under the charge array key. Parameters are stored as a dict under params key.

Here is an example of use:

>>> from pyteomics import mgf, auxiliary
>>> with'tests/test.mgf') as reader:
>>>     auxiliary.print_tree(next(reader))
m/z array
 -> username
 -> useremail
 -> mods
 -> pepmass
 -> title
 -> itol
 -> charge
 -> mass
 -> itolu
 -> it_mods
 -> com
intensity array
charge array

To speed up parsing, or if you want to avoid using numpy, you can tweak the behaviour of with parameters convert_arrays and read_charges.

Reading file headers

Also, pyteomics.mgf allows to extract headers with general parameters from MGF files with pyteomics.mgf.read_header() function. It also returns a dict.

>>> header = mgf.read_header('tests/test.mgf')
>>> auxiliary.print_tree(header)


Creation of MGF files is implemented in pyteomics.mgf.write() function. The user can specify the header, an iterable of spectra in the same format as returned by read(), and the output path.

>>> spectra ='tests/test.mgf')
>>> mgf.write(spectra=spectra, header=header)
MODS=Carbamidomethyl (C)
IT_MODS=Oxidation (M)
CHARGE=2+ and 3+
COM=Taken from

TITLE=Spectrum 1
846.6 73.0
846.8 44.0
847.6 67.0
1640.1 291.0
1640.6 54.0
1895.5 49.0

TITLE=Spectrum 2
345.1 237.0
370.2 128.0
460.2 108.0
1673.3 1007.0
1674.0 974.0
1675.3 79.0


FASTA is a common format for protein sequence databases.


To extract data from FASTA databases, use the function.

>>> from pyteomics import fasta
>>> with'/path/to/file/my.fasta') as db:
>>>     for entry in db:
>>>         ...

Just like other parsers in Pyteomics, returns a generator object instead of a list to prevent excessive memory use. The generator yields (description, sequence) tuples, so it’s natural to use it as follows:

>>> with'/path/to/file/my.fasta') as db:
>>>     for descr, seq in db:
>>>         ...

You can also use attributes to access description and sequence:

>>> with'my.fasta') as reader:
>>>     descriptions = [item.description for item in reader]

Description parsing

You can specify a function that will be applied to the FASTA headers for your convenience. pyteomics.fasta.std_parsers has some pre-defined parsers that can be used for this purpose.

>>> with'HUMAN.fasta', parser=fasta.std_parsers['uniprotkb']) as r:
>>>    print(next(r).description)
{'PE': 2, 'gene_id': 'LCE6A', 'GN': 'LCE6A', 'id': 'A0A183', 'taxon': 'HUMAN',
 'SV': 1, 'OS': 'Homo sapiens', 'entry': 'LCE6A_HUMAN',
 'name': 'Late cornified envelope protein 6A', 'db': 'sp'}

or try guessing the header format:

>>> with'HUMAN.fasta', parser=fasta.parse) as r:
>>>    print(next(r).description)
{'PE': 2, 'gene_id': 'LCE6A', 'GN': 'LCE6A', 'id': 'A0A183', 'taxon': 'HUMAN',
 'SV': 1, 'OS': 'Homo sapiens', 'entry': 'LCE6A_HUMAN',
 'name': 'Late cornified envelope protein 6A', 'db': 'sp'}


You can also create a FASTA file using a sequence of (description, sequence) tuples.

>>> entries = [('Protein 1', 'PEPTIDE'*1000), ('Protein 2', 'PEPTIDE'*2000)]
>>> fasta.write(entries, 'target-file.fasta')

Decoy databases

Another common task is to generate a decoy database. Pyteomics allows that by means of the pyteomics.fasta.decoy_db() and pyteomics.fasta.write_decoy_db() functions.

>>> fasta.write_decoy_db('mydb.fasta', 'mydb-with-decoy.fasta')

The only required argument is the first one, indicating the source database. The second argument is the target file and defaults to system standard output.

If you need to modify a single sequence, use the pyteomics.fasta.decoy_sequence() function. It supports two modes: 'reverse' and 'random'.

>>> fasta.decoy_sequence('PEPTIDE', 'reverse')
>>> fasta.decoy_sequence('PEPTIDE', 'random')
>>> fasta.decoy_sequence('PEPTIDE', 'random')

XML formats

XML parsers are implemented as classes and provide an object-oriented interface. The functional interface is preserved for backward compatibility and wraps the actual class-based machinery. That means that reader objects returned by read() functions have additional methods.

One of the most important methods is iterfind(). It allows reading additional information from XML files.

mzML and mzXML

mzML and mzXML are XML-based formats for experimental data obtained on MS/MS or LC-MS setups. Pyteomics offers you the functionality of pyteomics.mzml and pyteomics.mzxml modules to gain access to the information contained in those files from Python. The interfaces of the two modules are very similar, this section will use mzML for demonstration.

The user can iterate through MS/MS spectra contained in a file via the function or pyteomics.mzml.MzML class. Here is an example of the output:

>>> from pyteomics import mzml, auxiliary
>>> with'tests/test.mzML') as reader:
>>>     auxiliary.print_tree(next(reader))
highest observed m/z
ms level
total ion current
intensity array
lowest observed m/z
profile spectrum
MSn spectrum
positive scan
base peak intensity
m/z array
base peak m/z
 -> count
 -> scan [list]
 ->  -> scan start time
 ->  -> preset scan configuration
 ->  -> filter string
 ->  -> instrumentConfigurationRef
 ->  -> scanWindowList
 ->  ->  -> count
 ->  ->  -> scanWindow [list]
 ->  ->  ->  -> scan window lower limit
 ->  ->  ->  -> scan window upper limit
 ->  -> [Thermo Trailer Extra]Monoisotopic M/Z:
 -> no combination

Additionally, pyteomics.mzml.MzML objects support direct indexing with spectrum IDs:

>>> from pyteomics import mzml
>>> with mzml.MzML('tests/test.mzML') as reader:
>>>    spectrum = reader["controllerType=0 controllerNumber=1 scan=1"]
>>>    mz = spectrum['m/z array'] # do something with "spectrum"
>>>    ...

pyteomics.mzml.PreIndexedMzML offers the same functionality, but it uses byte offset information found at the end of the file. Unlike the rest of the functions and classes, pyteomics.mzml.PreIndexedMzML does not have a counterpart in pyteomics.mzxml.


pepXML is a widely used XML-based format for peptide identifications. It contains information about the MS data, the parameters of the search engine used and the assigned sequences. To access these data, use pyteomics.pepxml module.

The function iterates through Peptide-Spectrum matches in a pepXML file and returns them as a custom dict. Alternatively, you can use the pyteomics.pepxml.PepXML interface.

>>> from pyteomics import pepxml, auxiliary
>>> with'tests/test.pep.xml') as reader:
>>>     auxiliary.print_tree(next(reader))
search_hit [list]
 -> hit_rank
 -> calc_neutral_pep_mass
 -> modifications
 -> modified_peptide
 -> peptide
 -> num_matched_ions
 -> search_score
 ->  -> deltacn
 ->  -> spscore
 ->  -> sprank
 ->  -> deltacnstar
 ->  -> xcorr
 -> num_missed_cleavages
 -> analysis_result [list]
 ->  -> peptideprophet_result
 ->  ->  -> all_ntt_prob
 ->  ->  -> parameter
 ->  ->  ->  -> massd
 ->  ->  ->  -> fval
 ->  ->  ->  -> nmc
 ->  ->  ->  -> ntt
 ->  ->  -> probability
 ->  -> analysis
 -> tot_num_ions
 -> num_tot_proteins
 -> is_rejected
 -> proteins [list]
 ->  -> num_tol_term
 ->  -> protein
 ->  -> peptide_next_aa
 ->  -> protein_descr
 ->  -> peptide_prev_aa
 -> massdiff

Reading into a pandas.DataFrame

If you like working with tabular data using pandas, you can load data from pepXML files directly into pandas.DataFrames using the pyteomics.pepxml.DataFrame() function. It can read multiple files at once (using pyteomics.pepxml.chain()) and return a combined table with essential information about search results. This function requires pandas.


X!Tandem search engine has its own output format that contains more info than pepXML. Pyteomics has a reader for it in the pyteomics.tandem module.

>>> from pyteomics import tandem, auxiliary
>>> with'tests/test.t.xml') as reader:
...     auxiliary.print_tree(next(reader))
 -> fragment ion mass spectrum
 ->  -> M+H
 ->  -> note
 ->  -> charge
 ->  -> Ydata
 ->  ->  -> units
 ->  ->  -> values
 ->  -> Xdata
 ->  ->  -> units
 ->  ->  -> values
 ->  -> label
 ->  -> id
 -> supporting data
 ->  -> convolution survival function
 ->  ->  -> Ydata
 ->  ->  ->  -> units
 ->  ->  ->  -> values
 ->  ->  -> Xdata
 ->  ->  ->  -> units
 ->  ->  ->  -> values
 ->  ->  -> label
 ->  -> b ion histogram
 ->  ->  -> Ydata
 ->  ->  ->  -> units
 ->  ->  ->  -> values
 ->  ->  -> Xdata
 ->  ->  ->  -> units
 ->  ->  ->  -> values
 ->  ->  -> label
 ->  -> y ion histogram
 ->  ->  -> Ydata
 ->  ->  ->  -> units
 ->  ->  ->  -> values
 ->  ->  -> Xdata
 ->  ->  ->  -> units
 ->  ->  ->  -> values
 ->  ->  -> label
 ->  -> hyperscore expectation function
 ->  ->  -> a1
 ->  ->  -> a0
 ->  ->  -> Ydata
 ->  ->  ->  -> units
 ->  ->  ->  -> values
 ->  ->  -> Xdata
 ->  ->  ->  -> units
 ->  ->  ->  -> values
 ->  ->  -> label
protein [list]
 -> peptide
 ->  -> pre
 ->  -> end
 ->  -> seq
 ->  -> b_ions
 ->  -> nextscore
 ->  -> mh
 ->  -> y_ions
 ->  -> start
 ->  -> hyperscore
 ->  -> expect
 ->  -> delta
 ->  -> id
 ->  -> post
 ->  -> missed_cleavages
 ->  -> b_score
 ->  -> y_score
 -> uid
 -> sumI
 -> label
 -> note
 -> expect
 -> file
 ->  -> URL
 ->  -> type
 -> id returns a pyteomics.tandem.TandemXML instance, which can also be created directly.

Reading into a pandas.DataFrame

You can also load data from X!Tandem files directly into pandas.DataFrames using the pyteomics.tandem.DataFrame() function. It can read multiple files at once (using pyteomics.tandem.chain()) and return a combined table with essential information about search results. Of course, this function requires pandas.


mzIdentML is one of the standards developed by the Proteomics Informatics working group of the HUPO Proteomics Standard Initiative.

The module interface is similar to that of the other reader modules. The function returns a pyteomics.mzid.MzIdentML instance, which you can just as easily use directly.

>>> from pyteomics import mzid, auxiliary
>>> with'tests/test.mzid') as reader:
>>>     auxiliary.print_tree(next(reader))
SpectrumIdentificationItem [list]
 -> PeptideEvidenceRef [list]
 ->  -> peptideEvidence_ref
 -> ProteinScape:SequestMetaScore
 -> chargeState
 -> rank
 -> ProteinScape:IntensityCoverage
 -> calculatedMassToCharge
 -> peptide_ref
 -> passThreshold
 -> experimentalMassToCharge
 -> id

Element IDs and references

In mzIdentML, some elements contain references to other elements in the same file. The references are simply XML attributes whose name ends with _ref and the value is an ID, identical to value of the id attribute of a certain element.

The parser can retrieve information from these references on the fly, which can be enabled by passing retrieve_refs=True to the pyteomics.mzid.MzIdentML.iterfind() method, to pyteomics.mzid.MzIdentML constructor, or to Retrieval of data by ID is implemented in the pyteomics.mzid.MzIdentML.get_by_id() method. Alternatively, the MzIdentML object itself can be indexed with element IDs:

>>> from pyteomics import mzid
>>> m = mzid.MzIdentML('tests/test.mzid')
>>> m['ipi.HUMAN_decoy']
{'DatabaseName': 'database IPI_human',
 'decoy DB accession regexp': '^SHD',
 'decoy DB generation algorithm': 'PeakQuant.DecoyDatabaseBuilder',
 'id': 'ipi.HUMAN_decoy',
 'location': 'file://',
 'name': ['decoy DB from IPI_human',
  'DB composition target+decoy',
  'decoy DB type shuffle'],
 'numDatabaseSequences': 58099,
 'releaseDate': '2006-02-22T09:30:47Z',
 'version': '3.15'}
>>> m.close()


Since version 3.3, pyteomics.mzid.MzIdentML objects keep an index of byte offsets for some of the elements. It helps achieve acceptable performance when using retrieve_refs=True, or when accessing individual elements by their ID.

This behavior can be disabled by passing use_index=False to the object constructor. An alternative, older mechanism is caching of element IDs. To build a cache for a file, you can pass build_id_cache=True and use_index=False to the MzIdentML constructor, or to, or call the pyteomics.mzid.MzIdentML.build_id_cache() method prior to reading the data.


pyteomics.featurexml implements a simple parser for .featureXML files used in the OpenMS framework. The usage is identical to other XML parsing modules. Since featureXML has feature IDs, FeatureXML objects also support direct indexing as well as iteration:

>>> from pyteomics import featurexml

>>> # function style, iteration
... with'tests/test.featureXML') as f:
...     qual = [feat['overallquality'] for feat in f]

>>> qual # qualities of the two features in the test file
[0.791454, 0.945634]

>>> # object-oriented style, direct indexing
>>> f = featurexml.FeatureXML('tests/test.featureXML')
>>> f['f_189396504510444007']['overallquality']
>>> f.close()

As always, and pyteomics.featurexml.FeatureXML are interchangeable.

FDR estimation and filtering

Three modules for reading proteomics search engine output (tandem, pepxml and mzid) expose similar functions is_decoy(), fdr() and filter(). These functions implement the widely used Target-Decoy Approach (TDA) to estimation of False Discovery Rate (FDR).

The is_decoy() function is supposed to determine if a particular spectrum identification is coming from the decoy database. In tandem and pepxml this is done by checking if the protein description/name starts with a certain prefix. In mzid, a boolean value that stores this information in the PSM dict is used.


Because of the variety of the software producing files in pepXML and mzIdentML formats, the is_decoy() function provided in the corresponding modules may not work for your specific files. In this case you will have to refer to the source of pyteomics.pepxml.is_decoy() and pyteomics.mzid.is_decoy() and create your own function in a similar manner.

The fdr() function estimates the FDR in a set of PSMs by counting the decoy matches. Since it is using the is_decoy() function, the warning above applies. You can supply a custom function so that fdr() works for your data. fdr() can also be imported from auxiliary, where it has no default for is_decoy().

The filter() function works like chain(), but instead of yielding all PSMs, it filters them to a certain level of FDR. PSM filtering requires counting decoy matches, too (see above), but it also implies sorting the PSMs by some kind of a score. This score cannot be universal due to the above-mentioned reasons, and it can be specified as a user-defined function. For instance, the default sorting key in pyteomics.mzid.filter() is only expected to work with mzIdentML files created with Mascot. So once again,


The default parameters of filter() may not work for your files.

There are also filter.chain() and filter.chain.from_iterable(). These are different from filter() in that they apply FDR filtering to all files separately and then provide a reader over top PSMs of all files, whereas filter() pools all PSMs together and applies a single threshold.

If you want to filter a list representing PSMs in arbitrary format, you can use pyteomics.auxiliary.filter(). Instead of files it takes lists (or other iterables) of PSMs. The rest is the same as for other filter() functions.

NumPy and Pandas support, etc.

pyteomics.auxiliary.filter() supports structured numpy arrays and pandas.DataFrames of PSMs. This makes it easy to filter search results stored as CSV files (see Example 3: Search engines and PSM filtering for more info).

Generally, PSMs can be provided as iterators, lists, arrays, and DataFrames, and key and is_decoy parameters to filter() can be functions, strings, lists, arrays, or iterators. If a string is given, the PSMs need to be in a structured array or a DataFrame.

«  Retention time prediction   ::   Contents   ::   Pyteomics API documentation  »