auxiliary  common functions and objects¶
Math¶
linear_regression_vertical()
 a wrapper for NumPy linear regression, minimizes the sum of squares of y errors.
linear_regression()
 alias forlinear_regression_vertical()
.
linear_regression_perpendicular()
 a wrapper for NumPy linear regression, minimizes the sum of squares of (perpendicular) distances between the points and the line.
TargetDecoy Approach¶
Project infrastructure¶
PyteomicsError
 a pyteomicsspecific exception.
Helpers¶
Charge
 a subclass ofint
for charge states.
ChargeList
 a subclass oflist
for lists of charges.
print_tree()
 display the structure of a complex nesteddict
.
memoize()
 makes a memoization function decorator.

exception
pyteomics.auxiliary.
PyteomicsError
(msg)[source]¶ Bases:
exceptions.Exception
Exception raised for errors in Pyteomics library.
Attributes
message

pyteomics.auxiliary.
fdr
(psms=None, formula=1, is_decoy=None, ratio=1, correction=0, pep=None)¶ Estimate FDR of a data set using TDA or given PEP values. Two formulas can be used. The first one (default) is:
The second formula is:
Note
This function is less versatile than
qvalues()
. To obtain FDR, you can callqvalues()
and take the last qvalue. This function can be used (with correction = 0 or 1) whennumpy
is not available.Parameters: psms : iterable, optional
An iterable of PSMs, e.g. as returned by
read()
. Not needed if is_decoy is an iterable.formula : int, optional
Can be either 1 or 2, defines which formula should be used for FDR estimation. Default is 1.
is_decoy : callable, iterable, or str
If callable, should accept exactly one argument (PSM) and return a truthy value if the PSM is considered decoy. Default is
is_decoy()
. If arraylike, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or apandas.DataFrame
).pep : callable, iterable, or str, optional
If callable, a function used to determine the posterior error probability (PEP). Should accept exactly one argument (PSM) and return a float. If arraylike, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a
pandas.DataFrame
).Note
If this parameter is given, then PEP values will be used to calculate FDR. Otherwise, decoy PSMs will be used instead. This option conflicts with: is_decoy, formula, ratio, correction.
ratio : float, optional
The size ratio between the decoy and target databases. Default is 1. In theory, the “size” of the database is the number of theoretical peptides eligible for assignment to spectra that are produced by in silico cleavage of that database.
correction : int or float, optional
Possible values are 0, 1 and 2, or floating point numbers between 0 and 1. Default is 0 (no correction); 1 accounts for the probability that a false positive scores better than the first excluded decoy PSM; 2 also corrects that probability for finite size of the sample. If a floating point number is given, then instead of the expectation value for the number of false PSMs, the confidence value is used. The value of correction is then interpreted as desired confidence level. E.g., if correction=0.95, then the calculated qvalues do not exceed the “real” qvalues with 95% probability.
See this paper for further explanation.
Note
Requires
numpy
, if correction is a float or 2.Note
Correction is only needed if the PSM set at hand was obtained using TDA filtering based on decoy counting (as done by using
filter()
without correction).Returns: out : float
The estimation of FDR, (roughly) between 0 and 1.

pyteomics.auxiliary.
filter
(*args, **kwargs)¶ Read args and yield only the PSMs that form a set with estimated false discovery rate (FDR) not exceeding fdr.
Requires
numpy
and, optionally,pandas
.Parameters: positional args : iterables
Iterables to read PSMs from. All positional arguments are chained. The rest of the arguments must be named.
fdr : float, keyword only, 0 <= fdr <= 1
Desired FDR level.
key : callable / arraylike / iterable / str, keyword only
A function used for sorting of PSMs. Should accept exactly one argument (PSM) and return a number (the smaller the better). The default is a function that tries to extract evalue from the PSM.
reverse : bool, keyword only, optional
If
True
, then PSMs are sorted in descending order, i.e. the value of the key function is higher for better PSMs. Default isFalse
.is_decoy : callable / arraylike / iterable / str, keyword only
A function used to determine if the PSM is decoy or not. Should accept exactly one argument (PSM) and return a truthy value if the PSM should be considered decoy.
remove_decoy : bool, keyword only, optional
Defines whether decoy matches should be removed from the output. Default is
True
.Note
If set to
False
, then by default the decoy PSMs will be taken into account when estimating FDR. Refer to the documentation offdr()
for math; basically, if remove_decoy isTrue
, then formula 1 is used to control output FDR, otherwise it’s formula 2. This can be changed by overriding the formula argument.formula : int, keyword only, optional
Can be either 1 or 2, defines which formula should be used for FDR estimation. Default is 1 if remove_decoy is
True
, else 2 (seefdr()
for definitions).ratio : float, keyword only, optional
The size ratio between the decoy and target databases. Default is 1. In theory, the “size” of the database is the number of theoretical peptides eligible for assignment to spectra that are produced by in silico cleavage of that database.
correction : int or float, keyword only, optional
Possible values are 0, 1 and 2, or floating point numbers between 0 and 1. Default is 0 (no correction); 1 accounts for the probability that a false positive scores better than the first excluded decoy PSM; 2 also corrects that probability for finite size of the sample. If a floating point number is given, then instead of the expectation value for the number of false PSMs, the confidence value is used. The value of correction is then interpreted as desired confidence level. E.g., if correction=0.95, then the calculated qvalues do not exceed the “real” qvalues with 95% probability.
See this paper for further explanation.
pep : callable / arraylike / iterable / str, keyword only, optional
If callable, a function used to determine the posterior error probability (PEP). Should accept exactly one argument (PSM) and return a float. If arraylike, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a
DataFrame
).Note
If this parameter is given, then PEP values will be used to calculate qvalues. Otherwise, decoy PSMs will be used instead. This option conflicts with: is_decoy, remove_decoy, formula, ratio, correction. key can still be provided. Without key, PSMs will be sorted by PEP.
full_output : bool, keyword only, optional
If
True
, then an array of PSM objects is returned. Otherwise, an iterator / context manager object is returned, and the files are parsed twice. This saves some RAM, but is ~2x slower. Default isTrue
.Note
The name for the parameter comes from the fact that it is internally passed to
qvalues()
.q_label : str, optional
Field name for qvalue in the output. Default is
'q'
.score_label : str, optional
Field name for score in the output. Default is
'score'
.decoy_label : str, optional
Field name for the decoy flag in the output. Default is
'is decoy'
.pep_label : str, optional
Field name for PEP in the output. Default is
'PEP'
.**kwargs : passed to the
chain()
function.Returns: out : iterator or
numpy.ndarray
orpandas.DataFrame

pyteomics.auxiliary.
linear_regression
(x, y=None, a=None, b=None)[source]¶ Alias of
linear_regression_vertical()
.

pyteomics.auxiliary.
linear_regression_perpendicular
(x, y=None)[source]¶ Calculate coefficients of a linear regression y = a * x + b. The fit minimizes perpendicular distances between the points and the line.
Requires
numpy
.Parameters: x, y : array_like of float
1D arrays of floats. If y is omitted, x must be a 2D array of shape (N, 2).
Returns: out : 4tuple of float
The structure is (a, b, r, stderr), where a – slope coefficient, b – free term, r – Peason correlation coefficient, stderr – standard deviation.

pyteomics.auxiliary.
linear_regression_vertical
(x, y=None, a=None, b=None)[source]¶ Calculate coefficients of a linear regression y = a * x + b. The fit minimizes vertical distances between the points and the line.
Requires
numpy
.Parameters: x, y : array_like of float
1D arrays of floats. If y is omitted, x must be a 2D array of shape (N, 2).
a : float, optional
If specified then the slope coefficient is fixed and equals a.
b : float, optional
If specified then the free term is fixed and equals b.
Returns: out : 4tuple of float
The structure is (a, b, r, stderr), where a – slope coefficient, b – free term, r – Peason correlation coefficient, stderr – standard deviation.

pyteomics.auxiliary.
memoize
(maxsize=1000)[source]¶ Make a memoization decorator. A negative value of maxsize means no size limit.

pyteomics.auxiliary.
print_tree
(d, indent_str=’ > ‘, indent_count=1)[source]¶ Read a nested dict (with strings as keys) and print its structure.

pyteomics.auxiliary.
qvalues
(*args, **kwargs)¶ Read args and return a NumPy array with scores and qvalues. qvalues are calculated either using TDA or based on provided values of PEP.
Requires
numpy
(and optionallypandas
).Parameters: positional args : iterables
Iterables to read PSMs from. All positional arguments are chained. The rest of the arguments must be named.
key : callable / arraylike / iterable / str, keyword only
If callable, a function used for sorting of PSMs. Should accept exactly one argument (PSM) and return a number (the smaller the better). If arraylike, should contain scores for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a
DataFrame
).reverse : bool, keyword only, optional
If
True
, then PSMs are sorted in descending order, i.e. the value of the key function is higher for better PSMs. Default isFalse
.is_decoy : callable / arraylike / iterable / str, keyword only
If callable, a function used to determine if the PSM is decoy or not. Should accept exactly one argument (PSM) and return a truthy value if the PSM should be considered decoy. If arraylike, should contain boolean values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a
DataFrame
).pep : callable / arraylike / iterable / str, keyword only, optional
If callable, a function used to determine the posterior error probability (PEP). Should accept exactly one argument (PSM) and return a float. If arraylike, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a
DataFrame
).Note
If this parameter is given, then PEP values will be used to calculate qvalues. Otherwise, decoy PSMs will be used instead. This option conflicts with: is_decoy, remove_decoy, formula, ratio, correction. key can still be provided. Without key, PSMs will be sorted by PEP.
remove_decoy : bool, keyword only, optional
Defines whether decoy matches should be removed from the output. Default is
False
.Note
If set to
False
, then by default the decoy PSMs will be taken into account when estimating FDR. Refer to the documentation offdr()
for math; basically, if remove_decoy isTrue
, then formula 1 is used to control output FDR, otherwise it’s formula 2. This can be changed by overriding the formula argument.formula : int, keyword only, optional
Can be either 1 or 2, defines which formula should be used for FDR estimation. Default is 1 if remove_decoy is
True
, else 2 (seefdr()
for definitions).ratio : float, keyword only, optional
The size ratio between the decoy and target databases. Default is 1. In theory, the “size” of the database is the number of theoretical peptides eligible for assignment to spectra that are produced by in silico cleavage of that database.
correction : int or float, keyword only, optional
Possible values are 0, 1 and 2, or floating point numbers between 0 and 1. Default is 0 (no correction); 1 accounts for the probability that a false positive scores better than the first excluded decoy PSM; 2 also corrects that probability for finite size of the sample. If a floating point number is given, then instead of the expectation value for the number of false PSMs, the confidence value is used. The value of correction is then interpreted as desired confidence level. E.g., if correction=0.95, then the calculated qvalues do not exceed the “real” qvalues with 95% probability.
See this paper for further explanation.
q_label : str, optional
Field name for qvalue in the output. Default is
'q'
.score_label : str, optional
Field name for score in the output. Default is
'score'
.decoy_label : str, optional
Field name for the decoy flag in the output. Default is
'is decoy'
.pep_label : str, optional
Field name for PEP in the output. Default is
'PEP'
.full_output : bool, keyword only, optional
If
True
, then the returned array has PSM objects along with scores and qvalues. Default isFalse
.**kwargs : passed to the
chain()
function.Returns: out : numpy.ndarray
A sorted array of records with the following fields:
 ‘score’:
np.float64
 ‘is decoy’:
np.bool_
 ‘q’:
np.float64
 ‘psm’:
np.object_
(if full_output isTrue
)
 ‘score’: