Pyteomics documentation v3.4.1

auxiliary - common functions and objects

Contents

auxiliary - common functions and objects

Math

linear_regression_vertical() - a wrapper for NumPy linear regression, minimizes the sum of squares of y errors.

linear_regression() - alias for linear_regression_vertical().

linear_regression_perpendicular() - a wrapper for NumPy linear regression, minimizes the sum of squares of (perpendicular) distances between the points and the line.

Target-Decoy Approach

qvalues() - estimate q-values for a set of PSMs.

filter() - filter PSMs to specified FDR level using TDA or given PEPs.

filter.chain() - a chained version of filter().

fdr() - estimate FDR in a set of PSMs using TDA or given PEPs.

Project infrastructure

PyteomicsError - a pyteomics-specific exception.

Helpers

Charge - a subclass of int for charge states.

ChargeList - a subclass of list for lists of charges.

print_tree() - display the structure of a complex nested dict.

memoize() - makes a memoization function decorator.


exception pyteomics.auxiliary.PyteomicsError(msg)[source]

Bases: exceptions.Exception

Exception raised for errors in Pyteomics library.

Attributes

message
__init__(msg)[source]
pyteomics.auxiliary.fdr(psms=None, formula=1, is_decoy=None, ratio=1, correction=0, pep=None)

Estimate FDR of a data set using TDA or given PEP values. Two formulas can be used. The first one (default) is:

The second formula is:

Note

This function is less versatile than qvalues(). To obtain FDR, you can call qvalues() and take the last q-value. This function can be used (with correction = 0 or 1) when numpy is not available.

Parameters:

psms : iterable, optional

An iterable of PSMs, e.g. as returned by read(). Not needed if is_decoy is an iterable.

formula : int, optional

Can be either 1 or 2, defines which formula should be used for FDR estimation. Default is 1.

is_decoy : callable, iterable, or str

If callable, should accept exactly one argument (PSM) and return a truthy value if the PSM is considered decoy. Default is is_decoy(). If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a pandas.DataFrame).

pep : callable, iterable, or str, optional

If callable, a function used to determine the posterior error probability (PEP). Should accept exactly one argument (PSM) and return a float. If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a pandas.DataFrame).

Note

If this parameter is given, then PEP values will be used to calculate FDR. Otherwise, decoy PSMs will be used instead. This option conflicts with: is_decoy, formula, ratio, correction.

ratio : float, optional

The size ratio between the decoy and target databases. Default is 1. In theory, the “size” of the database is the number of theoretical peptides eligible for assignment to spectra that are produced by in silico cleavage of that database.

correction : int or float, optional

Possible values are 0, 1 and 2, or floating point numbers between 0 and 1. Default is 0 (no correction); 1 accounts for the probability that a false positive scores better than the first excluded decoy PSM; 2 also corrects that probability for finite size of the sample. If a floating point number is given, then instead of the expectation value for the number of false PSMs, the confidence value is used. The value of correction is then interpreted as desired confidence level. E.g., if correction=0.95, then the calculated q-values do not exceed the “real” q-values with 95% probability.

See this paper for further explanation.

Note

Requires numpy, if correction is a float or 2.

Note

Correction is only needed if the PSM set at hand was obtained using TDA filtering based on decoy counting (as done by using filter() without correction).

Returns:

out : float

The estimation of FDR, (roughly) between 0 and 1.

pyteomics.auxiliary.filter(*args, **kwargs)

Read args and yield only the PSMs that form a set with estimated false discovery rate (FDR) not exceeding fdr.

Requires numpy and, optionally, pandas.

Parameters:

positional args : iterables

Iterables to read PSMs from. All positional arguments are chained. The rest of the arguments must be named.

fdr : float, keyword only, 0 <= fdr <= 1

Desired FDR level.

key : callable, keyword only

A function used for sorting of PSMs. Should accept exactly one argument (PSM) and return a number (the smaller the better). The default is a function that tries to extract e-value from the PSM.

reverse : bool, keyword only, optional

If True, then PSMs are sorted in descending order, i.e. the value of the key function is higher for better PSMs. Default is False.

is_decoy : callable, keyword only

A function used to determine if the PSM is decoy or not. Should accept exactly one argument (PSM) and return a truthy value if the PSM should be considered decoy.

remove_decoy : bool, keyword only, optional

Defines whether decoy matches should be removed from the output. Default is True.

Note

If set to False, then by default the decoy PSMs will be taken into account when estimating FDR. Refer to the documentation of fdr() for math; basically, if remove_decoy is True, then formula 1 is used to control output FDR, otherwise it’s formula 2. This can be changed by overriding the formula argument.

formula : int, keyword only, optional

Can be either 1 or 2, defines which formula should be used for FDR estimation. Default is 1 if remove_decoy is True, else 2 (see fdr() for definitions).

ratio : float, keyword only, optional

The size ratio between the decoy and target databases. Default is 1. In theory, the “size” of the database is the number of theoretical peptides eligible for assignment to spectra that are produced by in silico cleavage of that database.

correction : int or float, keyword only, optional

Possible values are 0, 1 and 2, or floating point numbers between 0 and 1. Default is 0 (no correction); 1 accounts for the probability that a false positive scores better than the first excluded decoy PSM; 2 also corrects that probability for finite size of the sample. If a floating point number is given, then instead of the expectation value for the number of false PSMs, the confidence value is used. The value of correction is then interpreted as desired confidence level. E.g., if correction=0.95, then the calculated q-values do not exceed the “real” q-values with 95% probability.

See this paper for further explanation.

pep : callable / array-like / iterable / str, keyword only, optional

If callable, a function used to determine the posterior error probability (PEP). Should accept exactly one argument (PSM) and return a float. If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a DataFrame).

Note

If this parameter is given, then PEP values will be used to calculate q-values. Otherwise, decoy PSMs will be used instead. This option conflicts with: is_decoy, remove_decoy, formula, ratio, correction. key can still be provided. Without key, PSMs will be sorted by PEP.

full_output : bool, keyword only, optional

If True, then an array of PSM objects is returned. Otherwise, an iterator / context manager object is returned, and the files are parsed twice. This saves some RAM, but is ~2x slower. Default is True.

Note

The name for the parameter comes from the fact that it is internally passed to qvalues().

**kwargs : passed to the chain() function.

Returns:

out : iterator or numpy.ndarray or pandas.DataFrame

pyteomics.auxiliary.linear_regression(x, y=None, a=None, b=None)[source]

Alias of linear_regression_vertical().

pyteomics.auxiliary.linear_regression_perpendicular(x, y=None)[source]

Calculate coefficients of a linear regression y = a * x + b. The fit minimizes perpendicular distances between the points and the line.

Requires numpy.

Parameters:

x, y : array_like of float

1-D arrays of floats. If y is omitted, x must be a 2-D array of shape (N, 2).

Returns:

out : 4-tuple of float

The structure is (a, b, r, stderr), where a – slope coefficient, b – free term, r – Peason correlation coefficient, stderr – standard deviation.

pyteomics.auxiliary.linear_regression_vertical(x, y=None, a=None, b=None)[source]

Calculate coefficients of a linear regression y = a * x + b. The fit minimizes vertical distances between the points and the line.

Requires numpy.

Parameters:

x, y : array_like of float

1-D arrays of floats. If y is omitted, x must be a 2-D array of shape (N, 2).

a : float, optional

If specified then the slope coefficient is fixed and equals a.

b : float, optional

If specified then the free term is fixed and equals b.

Returns:

out : 4-tuple of float

The structure is (a, b, r, stderr), where a – slope coefficient, b – free term, r – Peason correlation coefficient, stderr – standard deviation.

pyteomics.auxiliary.memoize(maxsize=1000)[source]

Make a memoization decorator. A negative value of maxsize means no size limit.

pyteomics.auxiliary.print_tree(d, indent_str=' -> ', indent_count=1)[source]

Read a nested dict (with strings as keys) and print its structure.

pyteomics.auxiliary.qvalues(*args, **kwargs)

Read args and return a NumPy array with scores and q-values. q-values are calculated either using TDA or based on provided values of PEP.

Requires numpy (and optionally pandas).

Parameters:

positional args : iterables

Iterables to read PSMs from. All positional arguments are chained. The rest of the arguments must be named.

key : callable / array-like / iterable / str, keyword only

If callable, a function used for sorting of PSMs. Should accept exactly one argument (PSM) and return a number (the smaller the better). If array-like, should contain scores for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a DataFrame).

reverse : bool, keyword only, optional

If True, then PSMs are sorted in descending order, i.e. the value of the key function is higher for better PSMs. Default is False.

is_decoy : callable / array-like / iterable / str, keyword only

If callable, a function used to determine if the PSM is decoy or not. Should accept exactly one argument (PSM) and return a truthy value if the PSM should be considered decoy. If array-like, should contain boolean values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a DataFrame).

pep : callable / array-like / iterable / str, keyword only, optional

If callable, a function used to determine the posterior error probability (PEP). Should accept exactly one argument (PSM) and return a float. If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a DataFrame).

Note

If this parameter is given, then PEP values will be used to calculate q-values. Otherwise, decoy PSMs will be used instead. This option conflicts with: is_decoy, remove_decoy, formula, ratio, correction. key can still be provided. Without key, PSMs will be sorted by PEP.

remove_decoy : bool, keyword only, optional

Defines whether decoy matches should be removed from the output. Default is False.

Note

If set to False, then by default the decoy PSMs will be taken into account when estimating FDR. Refer to the documentation of fdr() for math; basically, if remove_decoy is True, then formula 1 is used to control output FDR, otherwise it’s formula 2. This can be changed by overriding the formula argument.

formula : int, keyword only, optional

Can be either 1 or 2, defines which formula should be used for FDR estimation. Default is 1 if remove_decoy is True, else 2 (see fdr() for definitions).

ratio : float, keyword only, optional

The size ratio between the decoy and target databases. Default is 1. In theory, the “size” of the database is the number of theoretical peptides eligible for assignment to spectra that are produced by in silico cleavage of that database.

correction : int or float, keyword only, optional

Possible values are 0, 1 and 2, or floating point numbers between 0 and 1. Default is 0 (no correction); 1 accounts for the probability that a false positive scores better than the first excluded decoy PSM; 2 also corrects that probability for finite size of the sample. If a floating point number is given, then instead of the expectation value for the number of false PSMs, the confidence value is used. The value of correction is then interpreted as desired confidence level. E.g., if correction=0.95, then the calculated q-values do not exceed the “real” q-values with 95% probability.

See this paper for further explanation.

full_output : bool, keyword only, optional

If True, then the returned array has PSM objects along with scores and q-values. Default is False.

**kwargs : passed to the chain() function.

Returns:

out : numpy.ndarray

A sorted array of records with the following fields:

  • ‘score’: np.float64
  • ‘is decoy’: np.bool_
  • ‘q’: np.float64
  • ‘psm’: np.object_ (if full_output is True)

Contents