# auxiliary - common functions and objects¶

## Math¶

`linear_regression_vertical()` - a wrapper for NumPy linear regression, minimizes the sum of squares of y errors.

`linear_regression_perpendicular()` - a wrapper for NumPy linear regression, minimizes the sum of squares of (perpendicular) distances between the points and the line.

## Target-Decoy Approach¶

`qvalues()` - estimate q-values for a set of PSMs.

`filter()` - filter PSMs to specified FDR level using TDA or given PEPs.

`filter.chain()` - a chained version of `filter()`.

`fdr()` - estimate FDR in a set of PSMs using TDA or given PEPs.

## Project infrastructure¶

`PyteomicsError` - a pyteomics-specific exception.

## Helpers¶

`Charge` - a subclass of `int` for charge states.

`ChargeList` - a subclass of `list` for lists of charges.

`print_tree()` - display the structure of a complex nested `dict`.

exception `pyteomics.auxiliary.``PyteomicsError`(msg)[source]

Exception raised for errors in Pyteomics library.

Attributes

 `message`
`__init__`(msg)[source]
`pyteomics.auxiliary.``fdr`(psms=None, formula=1, is_decoy=None, ratio=1, correction=0, pep=None)

Estimate FDR of a data set using TDA or given PEP values. Two formulas can be used. The first one (default) is:

The second formula is:

Note

This function is less versatile than `qvalues()`. To obtain FDR, you can call `qvalues()` and take the last q-value. This function can be used (with correction = 0 or 1) when `numpy` is not available.

Parameters: psms : iterable, optional An iterable of PSMs, e.g. as returned by `read()`. Not needed if is_decoy is an iterable. formula : int, optional Can be either 1 or 2, defines which formula should be used for FDR estimation. Default is 1. is_decoy : callable, iterable, or str If callable, should accept exactly one argument (PSM) and return a truthy value if the PSM is considered decoy. Default is `is_decoy()`. If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a `pandas.DataFrame`). pep : callable, iterable, or str, optional If callable, a function used to determine the posterior error probability (PEP). Should accept exactly one argument (PSM) and return a float. If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a `pandas.DataFrame`). Note If this parameter is given, then PEP values will be used to calculate FDR. Otherwise, decoy PSMs will be used instead. This option conflicts with: is_decoy, formula, ratio, correction. ratio : float, optional The size ratio between the decoy and target databases. Default is 1. In theory, the “size” of the database is the number of theoretical peptides eligible for assignment to spectra that are produced by in silico cleavage of that database. correction : int or float, optional Possible values are 0, 1 and 2, or floating point numbers between 0 and 1. Default is 0 (no correction); 1 accounts for the probability that a false positive scores better than the first excluded decoy PSM; 2 also corrects that probability for finite size of the sample. If a floating point number is given, then instead of the expectation value for the number of false PSMs, the confidence value is used. The value of correction is then interpreted as desired confidence level. E.g., if correction=0.95, then the calculated q-values do not exceed the “real” q-values with 95% probability. See this paper for further explanation. Note Requires `numpy`, if correction is a float or 2. Note Correction is only needed if the PSM set at hand was obtained using TDA filtering based on decoy counting (as done by using `filter()` without correction). out : float The estimation of FDR, (roughly) between 0 and 1.
`pyteomics.auxiliary.``filter`(*args, **kwargs)

Read args and yield only the PSMs that form a set with estimated false discovery rate (FDR) not exceeding fdr.

Requires `numpy` and, optionally, `pandas`.

Parameters: positional args : iterables Iterables to read PSMs from. All positional arguments are chained. The rest of the arguments must be named. fdr : float, keyword only, 0 <= fdr <= 1 Desired FDR level. key : callable / array-like / iterable / str, keyword only A function used for sorting of PSMs. Should accept exactly one argument (PSM) and return a number (the smaller the better). The default is a function that tries to extract e-value from the PSM. reverse : bool, keyword only, optional If `True`, then PSMs are sorted in descending order, i.e. the value of the key function is higher for better PSMs. Default is `False`. is_decoy : callable / array-like / iterable / str, keyword only A function used to determine if the PSM is decoy or not. Should accept exactly one argument (PSM) and return a truthy value if the PSM should be considered decoy. remove_decoy : bool, keyword only, optional Defines whether decoy matches should be removed from the output. Default is `True`. Note If set to `False`, then by default the decoy PSMs will be taken into account when estimating FDR. Refer to the documentation of `fdr()` for math; basically, if remove_decoy is `True`, then formula 1 is used to control output FDR, otherwise it’s formula 2. This can be changed by overriding the formula argument. formula : int, keyword only, optional Can be either 1 or 2, defines which formula should be used for FDR estimation. Default is 1 if remove_decoy is `True`, else 2 (see `fdr()` for definitions). ratio : float, keyword only, optional The size ratio between the decoy and target databases. Default is 1. In theory, the “size” of the database is the number of theoretical peptides eligible for assignment to spectra that are produced by in silico cleavage of that database. correction : int or float, keyword only, optional Possible values are 0, 1 and 2, or floating point numbers between 0 and 1. Default is 0 (no correction); 1 accounts for the probability that a false positive scores better than the first excluded decoy PSM; 2 also corrects that probability for finite size of the sample. If a floating point number is given, then instead of the expectation value for the number of false PSMs, the confidence value is used. The value of correction is then interpreted as desired confidence level. E.g., if correction=0.95, then the calculated q-values do not exceed the “real” q-values with 95% probability. See this paper for further explanation. pep : callable / array-like / iterable / str, keyword only, optional If callable, a function used to determine the posterior error probability (PEP). Should accept exactly one argument (PSM) and return a float. If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a `DataFrame`). Note If this parameter is given, then PEP values will be used to calculate q-values. Otherwise, decoy PSMs will be used instead. This option conflicts with: is_decoy, remove_decoy, formula, ratio, correction. key can still be provided. Without key, PSMs will be sorted by PEP. full_output : bool, keyword only, optional If `True`, then an array of PSM objects is returned. Otherwise, an iterator / context manager object is returned, and the files are parsed twice. This saves some RAM, but is ~2x slower. Default is `True`. Note The name for the parameter comes from the fact that it is internally passed to `qvalues()`. q_label : str, optional Field name for q-value in the output. Default is `'q'`. score_label : str, optional Field name for score in the output. Default is `'score'`. decoy_label : str, optional Field name for the decoy flag in the output. Default is `'is decoy'`. pep_label : str, optional Field name for PEP in the output. Default is `'PEP'`. **kwargs : passed to the `chain()` function. out : iterator or `numpy.ndarray` or `pandas.DataFrame`
`pyteomics.auxiliary.``linear_regression`(x, y=None, a=None, b=None)[source]
`pyteomics.auxiliary.``linear_regression_perpendicular`(x, y=None)[source]

Calculate coefficients of a linear regression y = a * x + b. The fit minimizes perpendicular distances between the points and the line.

Requires `numpy`.

Parameters: x, y : array_like of float 1-D arrays of floats. If y is omitted, x must be a 2-D array of shape (N, 2). out : 4-tuple of float The structure is (a, b, r, stderr), where a – slope coefficient, b – free term, r – Peason correlation coefficient, stderr – standard deviation.
`pyteomics.auxiliary.``linear_regression_vertical`(x, y=None, a=None, b=None)[source]

Calculate coefficients of a linear regression y = a * x + b. The fit minimizes vertical distances between the points and the line.

Requires `numpy`.

Parameters: x, y : array_like of float 1-D arrays of floats. If y is omitted, x must be a 2-D array of shape (N, 2). a : float, optional If specified then the slope coefficient is fixed and equals a. b : float, optional If specified then the free term is fixed and equals b. out : 4-tuple of float The structure is (a, b, r, stderr), where a – slope coefficient, b – free term, r – Peason correlation coefficient, stderr – standard deviation.
`pyteomics.auxiliary.``memoize`(maxsize=1000)[source]

Make a memoization decorator. A negative value of maxsize means no size limit.

`pyteomics.auxiliary.``print_tree`(d, indent_str=’ -> ‘, indent_count=1)[source]

Read a nested dict (with strings as keys) and print its structure.

`pyteomics.auxiliary.``qvalues`(*args, **kwargs)

Read args and return a NumPy array with scores and q-values. q-values are calculated either using TDA or based on provided values of PEP.

Requires `numpy` (and optionally `pandas`).

Parameters: positional args : iterables Iterables to read PSMs from. All positional arguments are chained. The rest of the arguments must be named. key : callable / array-like / iterable / str, keyword only If callable, a function used for sorting of PSMs. Should accept exactly one argument (PSM) and return a number (the smaller the better). If array-like, should contain scores for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a `DataFrame`). reverse : bool, keyword only, optional If `True`, then PSMs are sorted in descending order, i.e. the value of the key function is higher for better PSMs. Default is `False`. is_decoy : callable / array-like / iterable / str, keyword only If callable, a function used to determine if the PSM is decoy or not. Should accept exactly one argument (PSM) and return a truthy value if the PSM should be considered decoy. If array-like, should contain boolean values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a `DataFrame`). pep : callable / array-like / iterable / str, keyword only, optional If callable, a function used to determine the posterior error probability (PEP). Should accept exactly one argument (PSM) and return a float. If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a `DataFrame`). Note If this parameter is given, then PEP values will be used to calculate q-values. Otherwise, decoy PSMs will be used instead. This option conflicts with: is_decoy, remove_decoy, formula, ratio, correction. key can still be provided. Without key, PSMs will be sorted by PEP. remove_decoy : bool, keyword only, optional Defines whether decoy matches should be removed from the output. Default is `False`. Note If set to `False`, then by default the decoy PSMs will be taken into account when estimating FDR. Refer to the documentation of `fdr()` for math; basically, if remove_decoy is `True`, then formula 1 is used to control output FDR, otherwise it’s formula 2. This can be changed by overriding the formula argument. formula : int, keyword only, optional Can be either 1 or 2, defines which formula should be used for FDR estimation. Default is 1 if remove_decoy is `True`, else 2 (see `fdr()` for definitions). ratio : float, keyword only, optional The size ratio between the decoy and target databases. Default is 1. In theory, the “size” of the database is the number of theoretical peptides eligible for assignment to spectra that are produced by in silico cleavage of that database. correction : int or float, keyword only, optional Possible values are 0, 1 and 2, or floating point numbers between 0 and 1. Default is 0 (no correction); 1 accounts for the probability that a false positive scores better than the first excluded decoy PSM; 2 also corrects that probability for finite size of the sample. If a floating point number is given, then instead of the expectation value for the number of false PSMs, the confidence value is used. The value of correction is then interpreted as desired confidence level. E.g., if correction=0.95, then the calculated q-values do not exceed the “real” q-values with 95% probability. See this paper for further explanation. q_label : str, optional Field name for q-value in the output. Default is `'q'`. score_label : str, optional Field name for score in the output. Default is `'score'`. decoy_label : str, optional Field name for the decoy flag in the output. Default is `'is decoy'`. pep_label : str, optional Field name for PEP in the output. Default is `'PEP'`. full_output : bool, keyword only, optional If `True`, then the returned array has PSM objects along with scores and q-values. Default is `False`. **kwargs : passed to the `chain()` function. out : numpy.ndarray A sorted array of records with the following fields: ‘score’: `np.float64` ‘is decoy’: `np.bool_` ‘q’: `np.float64` ‘psm’: `np.object_` (if full_output is `True`)