Pyteomics documentation v3.4.2

History of changes

«  Example 3: Search engines and PSM filtering   ::   Contents

History of changes


  • New module pyteomics.ms1 for parsing of MS1 files.
  • mass.Composition constructor now accepts ion_type and charge parameters.
  • New functions pyteomics.mzid.DataFrame() and pyteomics.mzid.filter_df(). Their behavior may be refined later on.
  • Changes in behavior of pyteomics.auxiliary.filter() and pyteomics.auxiliary.qvalues():
    • both functions now always return DataFrames with pandas.DataFrame input and full_output=True.
    • string values of key, is_decoy and pep are substituted with simple itemgetter functions for non-pandas, non-numpy input;
    • additional parameters score_label, decoy_label, pep_label, and q_label for output control.
  • Performance optimizations in XML parsing code.




New submodule pyteomics.featurexml with a parser for OpenMS featureXML files.


  • mzML and mzIdentML parsers can now create an index of element offsets. This allows quick random access to elements by unique ID.
  • mzML parsers now come in two flavors: pyteomics.mzml.MzML and pyteomics.mzml.PreIndexedMzML. The latter uses the byte offsets listed at the end of the file.
  • New parameters convert_arrays and read_charges in allow using it without numpy and possibly improve performance. The default behavior is retained.
  • Performance optimizations in and parser.cleave().
  • New decoy generation mode called “fused decoy”, described in the paper accepted to JASMS.

API changes

  • pyteomics.parser.cleave() no longer accepts the labels argument. It is emphasized that the input sequences are expected to be in plain one-letter notation, but no checks are performed.
  • DataFrame() functions in pepxml and tandem now extract more protein-related information. The list-like protein-related values can be reported as lists or packed into strings, depending on the optional paramter sep. Some column names have changed as a result.
  • Call signatures of pyteomics.fasta.decoy_sequence() and the functions using it are slightly changed. Standard modes are now also exposed as individual functions.


New submodule pyteomics.mass.unimod contains rewritten machinery for handling of Unimod relational databases (contributed by Joshua Klein). This is a substitution and extension for the old mass.Unimod class. pyteomics.mass.unimod requires SQLAlchemy.

Other changes:



This release offers integration with the great pandas library. Working with qvalues() and filter() functions is now much easier if you have your PSMs in a DataFrame. Many search engines use CSV as their output format, allowing direct creation of DataFrame objects. New functions pyteomics.tandem.DataFrame() and pyteomics.pepxml.DataFrame() faciliatate creation of DataFrames from corresponding formats.

Also, qvalues(), filter() and fdr() functions can now use posterior error probabilities (PEPs) instead of using decoys for q-value calculation.

  • In qvalues() and filter() functions, key and is_decoy can now be array-like objects or strings (as well as functions and iterators). If a string is given, it is used as a field name in the PSM array or DataFrame. fdr() functions also support strings and iterables as arguments.
  • New parameter pep in qvalues(), filter() and fdr() functions. It can be callable, array-like, or iterator. Conflicts with decoy-related parameters. Compatible with key, but makes it optional.
  • Fixed the behavior of filter.chain() functions. They now treat the full_output argument the same way as filter() functions.
  • Fixed the issue that caused exceptions when calling fasta.decoy_db() and fasta.write_decoy_db() with explicitly given mode (signature for creation of pyteomics.auxiliary.FileReader objects slightly changed).
  • Pyteomics now uses setuptools and is a namespace package.
  • Minor fixes.

API changes

  • Default value of remove_decoy in qvalues() is now False.



  • XML parsers are now implemented as objects, each format has its own class. Those classes can be instantiated using the same arguments as read() functions accepted, and support direct iteration and the with syntax. The read() functions are now simple aliases to the corresponding constructors.
  • As a result, functions iterfind(), version_info() and get_by_id() functions are now deprecated in favor of methods iterfind() and get_by_id() and attribute version_info of corresponding instances.
  • In pyteomics.mgf.write(), the order of keys and the format of values are now controlled via module-level variables.
  • In pyteomics.electrochem, correction for pK of terminal groups depending on the terminal residue is implemented; example set of pK and corrected pK added.
  • Imports of external dependencies are delayed where possible, so that unnecessary ImportErrors do not occur.
  • local_fdr() renamed to qvalues() in pepxml, mzid, tandem and auxiliary. local_fdr() did not reflect the semantics of the function. The algorithm has been also corrected so that the array of q-values is always sorted (as it should be by definition).
  • qvalues() now also accepts a parameter full_output which keeps the PSMs alongside their scores and associated q-values.
  • All fdr(), qvalues(), and filter() functions now accept a new parameter correction. It is used for more accurate estimation of the number of false positives using TDA (paper with explanation).
  • filter() functions now support both iterator protocol and context manager protocol. They now also accept the full_output parameter, which has the following meaning: if True (default), then an array of PSMs is directly returned by the function. Otherwise, an iterator is returned, as before. The array takes some memory, but this way is usually around 2x faster.
  • New function pyteomics.pylab_aux.plot_qvalue_curve().
  • pyteomics.mass.Composition objects now have a mass() method (equivalent to pyteomics.mass.calculate_mass().
  • Also, Composition and objects returned by pyteomics.parser.amino_acid_composition() now inherit from collections.defaultdict and collections.Counter.
  • Decoy-related functions in pyteomics.fasta now accept a new parameter keep_nterm that preserves the N-terminal residue in the generated decoy sequences.
  • Minor fixes.

API changes


Fix for a memory leak in pyteomics.mzid.get_by_id(), which affects with retrieve_refs=True.


  • New functions local_fdr() in pepxml, mzid, and tandem. The function returns a NumPy array with PSM scores and corresponding values of local FDR.
  • New parameter iterative in read() functions of XML parsing modules. Parsing of mzIdentML files with retrieve_refs=True got significantly faster.


  • Universally applicable modifications are now allowed in pyteomics.parser.isoforms().
  • It is now also possible to specify non-terminal modifications which are only applicable to terminal residues.
  • Fix in pyteomics.parser.parse(): if the labels argument is provided, it needs to contain standard terminal groups if they are present in the sequence or if show_unmodified_termini is set to True.
  • pyteomics.mass.Composition instances are now pickleable.
  • Performance improvements.


  • New parameter reverse in all filter() functions.
  • New function pyteomics.mass.fast_mass2(), which is analogous to pyteomicsmass.fast_mass(), but supports full modX notation and is several times slower.
  • Fix in for compatibility with files produced with Mascot2XML utility.
  • Unknown labels now allowed in pyteomics.electrochem and pyteomics.achrom functions in accordance with new general policy.



API changes

  • The boolean overlap parameter in pyteomics.parser.cleave() is replaced with an integer min_length. Since min_length uses pyteomics.parser.length(), the labels keyword argument is now accepted by cleave() and num_sites(), if needed. With carefully designed cleavage rules, all cleavage functions work with modX sequences.
  • The labels argument in pyteomics.parser.parse() and related functions has changed its meaning. parse() won’t raise an exception for non-standard labels in sequences if the labels keyword argument is not given.
  • The modX notation specification is now more strict to avoid ambiguity: only zero or two terminal groups can be present in a modX sequence. Sequences with one terminal group specified will be supported where possible, but be advised that sequences such as “H-OH” are intrinsically ambiguous.


  • Added the ratio keyword argument for FDR calculation.
  • Minor changes in iterfind() functions of file parsers.
  • Bugfix in pyteomics.mgf.write() (duplication of pepmass key).
  • Removed non-functional parameter read_schema for


  • Bugfix in pyteomics.mass.most_probable_isotopic_composition(). The bug manifested itself after version 2.4.0, when pyteomics.mass.nist_mass was expanded. Also, the format of the returned value is now in accordance with the documentation.



  • New functions for filtering to a certain FDR level based on target-decoy strategy, as well as for FDR estimation, in pyteomics.tandem, pyteomics.pepxml and pyteomics.mzid. The functions are called filter() (beware of shadowing the built-in function) and fdr() (in each of the modules). Chained versions filter.chain() and filter.chain.from_iterable() are also available. See Data Access for more info.

  • New function pyteomics.parser.coverage() for sequence coverage calculation.

  • New function pyteomics.fasta.decoy_chain(), a chained version of pyteomics.fasta.decoy_db().

  • New elements in pyteomics.mass.nist_mass. Pretty much all elements are there now.

  • Fix in pyteomics.parser.parse() to cover some fancy corner cases.

  • Bugfix in pyteomics.tandem: modification info is now fully extracted.

  • pyteomics.mass.isotopic_composition_abundance() is now able to calculate abundances for larger molecules.


    Rounding errors may be significant in this case.


  • New parameter “read_schema” in read() functions of XML parsing modules. When set to False, disables the attempts to fetch an auxiliary file and obtain structure information about the file being parsed.
  • New function chain() in all modules that have a read() function, for convenient chaining of multiple files. chain() only works as a context manager. Use itertools.chain() in other cases. The chain.from_iterable form is also available as a context manager.
  • New function pyteomics.auxiliary.print_tree() for exploration of complex nested dicts produced by XML parsers.
  • New sets of retention coefficients in pyteomics.achrom.
  • Bugfix in pyteomics.pepxml. The bug caused an exception when parsing some pepXML files.
  • The output of now always contains a masked array of charges.
  • Other minor fixes.

API change

  • In the precursor charge is now always represented by a list of ints (a ChargeList object).



  • Update parsers for FASTA headers.
  • NamedTuple for FASTA entries is now defined globally, which should solve pickling problems.


  • New module pyteomics.tandem for reading output files of X!Tandem search engine.


  • Fix in pyteomics.pepxml. pepXML files generated by TPP are now processed without errors.


  • Fix in pyteomics.pepxml. ‘modified_peptide’ is now always available.
  • Fix in pyteomics.mass (issue #2 in the bug tracker).
  • Improved arithmetics for Composition objects.


  • In fasta, decoy_db() now doesn’t write to file, but returns an iterator over FASTA records. The old decoy_db() is now called write_decoy_db(), which is equivalent to decoy_db() combined with write().


  • In, the charges, if present, are returned as a masked array now. Previously, an exception occurred if charges were missing for some of the fragments.
  • Values in mass.nist_mass corrected.
  • Other minor corrections.


  • Adjust the behavior affected by the bug fixed in 2.1.2. name attributes of <cvParam> elements in the absence of value attributes are now collected in a list under the ‘name’ key.
  • Add support for overlapping matches in parser.cleave().


  • Bugfix in XML parsers. The bug caused the mzML parser break on some files. The fix can slightly change the format of the output.


  • Rename keys in the dicts returned by to facilitate writing code working with both MGF and mzML.
  • The items yielded by now have attributes description and sequence.


  • New sets of retention coefficients in achrom.
  • mass.Composition now only stores non-zero ints.
  • fasta now has tools for parsing of FASTA headers.
  • File parsers now implement the context manager protocol. We recommend using with statements to avoid resource leaks.

API changes

  • ‘pepmass’ is now a tuple in the output of (to allow reading precursor intensities).
  • new function fasta.parse() for convenient parsing of FASTA headers.
  • fasta.std_parsers stores parsers for common UniProt header formats.
  • new parameter parser in allows to apply parsing while reading a FASTA file.
  • close parameter removed in all functions that do file I/O. The unified behavior is: if the parameter is a file object, it won’t be closed by the function. If a file path is given, the file object will be created and closed inside the corresponding function.


  • Added new class pyteomics.mass.Unimod. The interface is experimental and may change.
  • Improved iterfind() function in XML-reading modules.
  • pyteomics.mass.Composition objects now support multiplication by int.
  • Bugfix in auxiliary.linear_regression().



API changes


  • Added mzid module for parsing of mzIdentML files.
  • Fixed bugs, improved tests.

API changes

  • top-module functions in fasta, mgf, mzml, pepxml, as well as mzid, are now called read().
  • in parser, parse_sequence() renamed to parse(). It now accepts an optional parameter allow_unknown_modifications.
  • mgf.write_mgf() and fasta.write_fasta() renamed to write().
  • the output format of all read() functions has changed.



  • Changes in pyteomics.mass.

API changes

  • Composition objects can be created using positional first argument, which will be treated as a sequence or (upon failure) as a formula. This means that all functions relying on Composition (calculate_mass(), most_probable_isotopic_composition(), isotopic_composition_abundance()) allow that as well. However, it’s of no use for the latter.
  • Composition entries for modifications can be added to aa_comp and used in composition and mass calculations. This way the specified group will be added to any residue bearing this modification.
  • That being said, the add_modifications() function is not needed anymore and has been removed.
  • Addition and subtraction of Composition objects now produces a Composition object, allowing addition/subtraction of multiple objects.
  • Composition is now a subclass of collections.defaultdict so one can safely retrieve values without checking if a key exists.


API changes


  • Bugfix in pyteomics.pepxml: modification info is now extracted.
  • New optional bool argument ‘split’ in pyteomics.parser.parse_sequence() allows to generate a list of tuples where modifications are separated from the residues instead of a regular list of labels. In labels not only modX labels are now allowed, but also separate mod prefixes. Such modifications are assumed to be applicable to any residue.


  • Memory usage significantly decreased when parsing large mzML and pepXML files.


  • Added support for Python 3. Python 2.7 is still supported, Python 2.6 is not.


  • New function called add_modifications() added in pyteomics.mass. It updates aa_comp.
  • Also, pyteomics.parser.isoforms() is a new function to get all possible modified sequences of a peptide.


  • New module added - pyteomics.mgf. It is intended for reading and writing files in Mascot Generic Format.


  • In pyteomics.pepxml module, now all search hits are read from file (not only the top hit).

API changes:

  • information specific to search hits is now stored in a list under the 'search_hits' key. The list is sorted by hit rank.



  • The first public release of Pyteomics.

API changes:

«  Example 3: Search engines and PSM filtering   ::   Contents