Pyteomics documentation v2.1.5

Data access

«  Retention time prediction   ::   Contents   ::   Pyteomics API documentation  »

Data access

The following section is dedicated to data manipulation. Pyteomics aims to support the most common formats of (LC-)MS/MS data, peptide identification files and protein databases

mzML

mzML is an XML-based format for experimental data obtained on MS/MS or LC-MS setups. Pyteomics offers you the functionality of pyteomics.mzml module to gain access to the information contained in .mzML files from Python.

The main function in this module is pyteomics.mzml.read(). It allows the user to iterate through MS/MS spectra contained in an mzML file. Here is an example of its output:

>>> from pyteomics import mzml
>>> with mzml.read('tests/test.mzML') as reader:
>>>     print next(reader) # Retrieve the first spectrum from the file and print it.
{'MSn spectrum': '',
 'base peak intensity': 1471973.875,
 'base peak m/z': 810.415283203125,
 'defaultArrayLength': '19914',
 'highest observed m/z': 2000.0099466203771,
 'id': 'controllerType=0 controllerNumber=1 scan=1',
 'index': '0',
 'intensity array': array([ 0.,  0.,  0., ...,  0.,  0.,  0.], dtype=float32),
 'lowest observed m/z': 200.00018816645022,
 'm/z array': array([  200.00018817,   200.00043034,   200.00067252, ...,  1999.96151259,
                      1999.98572931,  2000.00994662]),
 'ms level': 1,
 'no combination': '',
 'positive scan': '',
 'precursorList': [],
 'profile spectrum': '',
 'scanList': [{'[Thermo Trailer Extra]Monoisotopic M/Z:': '810.41522216796875',
               'filter string': 'FTMS + p ESI Full ms [200.00-2000.00]',
               'preset scan configuration': '1',
               'scan start time': '0.0049350000000000002',
               'scanWindowList': [{'scan window lower limit': '200',
                                   'scan window upper limit': '2000'}]}],
 'total ion current': 15245068.0}

At the moment, the interface of pyteomics.mzml.read() is relatively low-level. It iterates through the spectra in the file and returns each one as a dict with selected fields stored. The interface is relatively raw and can be modified in the subsequent releases of Pyteomics.

MGF

Mascot Generic Format (MGF) is a simple human-readable format for MS/MS data. It allows storing MS/MS peak lists and exprimental parameters. pyteomics.mgf is a module that implements reading and writing MGF files.

Similar to pyteomics.mzml, pyteomics.mgf has a read() function. It allows iterating through spectrum entries. Spectra are represented as dictionaries. MS/MS peak lists are stored as numpy.array objects masses and intensities. Parameters are stored as a dict under ‘params’ key.

Here is an example of use:

>>> from pyteomics import mgf
>>> with mgf.read('tests/test.mgf') as reader:
>>>     print next(reader) # Retrieve the first spectrum from the file and print it.
{'m/z array': array([  345.1,   370.2,   460.2,  1673.3,  1674. ,  1675.3]),
'charge array': array([ 3,  2,  1,  1,  1,  1]),
'params': {'username': 'Lou Scene', 'useremail': 'leu@altered-state.edu',
'mods': 'Carbamidomethyl (C)', 'itolu': 'Da', 'title': 'Spectrum 2',
'rtinseconds': '25', 'itol': '1', 'charge': '2+ and 3+',
'mass': 'Monoisotopic', 'it_mods': 'Oxidation (M)',
'pepmass': (1084.9, 1234.0),
'com': 'Based on http://www.matrixscience.com/help/data_file_help.html',
'scans': '3'},
'intensity array': array([  237.,   128.,   108.,  1007.,   974.,    79.])}

Also, pyteomics.mgf allows to extract headers with general parameters from MGF files with read_header() function. It also returns a dict.

>>> header = mgf.read_header('tests/test.mgf')
>>> print header
{'username': 'Lou Scene', 'itol': '1', 'useremail': 'leu@altered-state.edu',
'mods': 'Carbamidomethyl (C)', 'it_mods': 'Oxidation (M)',
'charge': '2+ and 3+', 'mass': 'Monoisotopic', 'itolu': 'Da',
'com': 'Taken from http://www.matrixscience.com/help/data_file_help.html'}

Creation of MGF files is implemented in write() function. The user can specify the header, list of spectra in the same format as returned by read() and the output path.

>>> spectra = mgf.read('tests/test.mgf')
>>> mgf.write(spectra=spectra, header=header)
USERNAME=Lou Scene
ITOL=1
USEREMAIL=leu@altered-state.edu
MODS=Carbamidomethyl (C)
IT_MODS=Oxidation (M)
CHARGE=2+ and 3+
MASS=Monoisotopic
ITOLU=Da
COM=Taken from http://www.matrixscience.com/help/data_file_help.html

BEGIN IONS
TITLE=Spectrum 1
PEPMASS=983.6
846.6 73.0
846.8 44.0
847.6 67.0
1640.1 291.0
1640.6 54.0
1895.5 49.0
END IONS

BEGIN IONS
TITLE=Spectrum 2
RTINSECONDS=25
PEPMASS=1084.9
SCANS=3
345.1 237.0
370.2 128.0
460.2 108.0
1673.3 1007.0
1674.0 974.0
1675.3 79.0
END IONS

pepXML

pepXML is a widely used XML-based format for peptide identifications. It contains information about the MS data, the parameters of the search engine used and the assigned sequences. To access these data, use pyteomics.pepxml module.

pyteomics.pepxml has the same structure as pyteomics.mzml. The function pyteomics.pepxml.read() iterates through Peptide-Spectrum matches in a .pepXML file and returns them as a custom dict.

>>> from pyteomics import pepxml
>>> reader = pepxml.read('tests/test.pep.xml')
>>> print next(reader)
{'end_scan': 100,
'index': 1,
'assumed_charge': 1,
'spectrum': 'pps_sl20060731_18mix_25ul_r1_1154456409.0100.0100.1',
'search_hit': [
    {'hit_rank': 1,
    'calc_neutral_pep_mass': 860.892,
    'modifications': [],
    'modified_peptide': 'SLNGEWR',
    'peptide': 'SLNGEWR',
    'num_matched_ions': 11,
    'search_score': {
        'spscore': 894.0,
        'sprank': 1.0,
        'deltacnstar': 0.0,
        'deltacn': 0.081,
        'xcorr': 1.553},
    'proteins': [
        {'num_tol_term': 2,
        'protein': 'sp|P00722|BGAL_ECOLI',
        'peptide_next_aa': 'F',
        'protein_descr': 'BETA-GALACTOSIDASE (EC 3.2.1.23) (LACTASE) - Escherichia coli.',
        'peptide_prev_aa': 'R'}],
    'num_missed_cleavages': 0,
    'analysis_result': [
        {'peptideprophet_result':
            {'parameter': {'massd': -0.5, 'nmc': 0.0, 'ntt': 2.0, 'fval': 1.4723},
            'all_ntt_prob': [0.0422, 0.509, 0.96],
            'probability': 0.96},
            'analysis': 'peptideprophet'}],
    'tot_num_ions': 12,
    'num_tot_proteins': 1,
    'is_rejected': False,
    'massdiff': -0.5}],
'precursor_neutral_mass': 860.392,
'start_scan': 100}

mzIdentML

mzIdentML is one of the standards developed by the Proteomics Informatics working group of the HUPO Proteomics Standard Initiative.

The module interface is similar to that of the other reader modules.

>>> from pyteomics import mzid
>>> with mzid.read('tests/test.mzid') as reader:
>>>     print next(reader)
{'SpectrumIdentificationItem': [
    {'ProteinScape:IntensityCoverage': 0.3919545603809718,
    'PeptideEvidenceRef': [
        {'peptideEvidence_ref': 'PE1_SEQ_spec1_pep1'}],
    'passThreshold': True,
    'rank': 1,
    'chargeState': 1,
    'calculatedMassToCharge': 1507.695,
    'peptide_ref': 'prot1_pep1',
    'experimentalMassToCharge': 1507.696,
    'id': 'SEQ_spec1_pep1',
    'ProteinScape:SequestMetaScore': 7.59488518903425}],
'spectrumID': 'databasekey=1',
'id': 'SEQ_spec1',
'spectraData_ref': 'LCMALDI_spectra'}

You can tune the amount of information you get from the file. The available options to the read() function are recursive (True by default) and retrieve_refs (False by default). The latter pulls additional info from the file that is present only as references in the example above.

Additional function get_by_id() allows to extract info from any element using its unique ID.

FASTA

To extract data from FASTA databases, use the pyteomics.fasta.read() function.

>>> from pyteomics import fasta
>>> proteins = list(fasta.read('/path/to/file/my.fasta'))

pyteomics.fasta.read() returns a generator object instead of a list to prevent excessive memory use. The generator yields (description, sequence) tuples, so it’s natural to use it as follows:

>>> from pyteomics import fasta
>>> for descr, seq in fasta.read('my.fasta'):
>>>    ...

You can also use attributes to access description and sequence:

>>> from pyteomics import fasta
>>> for protein in fasta.read('my.fasta'):
>>>    print protein.description
>>>    print protein.sequence

Note the new recommended with syntax:

>>> from pyteomics import fasta
>>> with fasta.read('my.fasta') as reader:
>>>    for descr, seq in reader:
>>>       ...

You can specify a function that will be applied to the FASTA headers for your convenience. pyteomics.fasta.std_parsers has some pre-defined parsers that can be used for this purpose.

You can also create a FASTA file using a sequence of (description, sequence) tuples.

>>> from pyteomics import fasta
>>> entries = [('Protein 1', 'PEPTIDE'*1000), ('Protein 2', 'PEPTIDE'*2000)]
>>> fasta.write(entries, 'target-file.fasta')

Another common task is to generate a decoy database. Pyteomics allows that by means of the pyteomics.fasta.decoy_db() function.

>>> from pyteomics import fasta
>>> fasta.decoy_db('mydb.fasta', 'mydb-with-decoy.fasta')

The only required argument is the first one, indicating the source database. The second argument is the target file and defaults to system standard output.

If you need to modify a single sequence, use the pyteomics.fasta.decoy_sequence() method. It currently supports two modes: ‘reverse’ and ‘random’.

>>> from pyteomics import fasta
>>> fasta.decoy_sequence('PEPTIDE', 'reverse')
'EDITPEP'
>>> fasta.decoy_sequence('PEPTIDE', 'random')
‘TPPIDEE'
>>> fasta.decoy_sequence('PEPTIDE', 'random')
'PTIDEPE'

«  Retention time prediction   ::   Contents   ::   Pyteomics API documentation  »