$\DeclareMathOperator{\erf}{erf} \DeclareMathOperator{\argmin}{argmin} \newcommand{\R}{\mathbb{R}} \newcommand{\n}{\boldsymbol{n}}$

# Module pyqt_fit.kde¶

Author: Pierre Barbier de Reuille

Module implementing kernel-based estimation of density of probability.

Given a kernel $$K$$, the density function is estimated from a sampling $$X = \{X_i \in \mathbb{R}^n\}_{i\in\{1,\ldots,m\}}$$ as:

$f(\mathbf{z}) \triangleq \frac{1}{hW} \sum_{i=1}^m \frac{w_i}{\lambda_i} K\left(\frac{X_i-\mathbf{z}}{h\lambda_i}\right)$$W = \sum_{i=1}^m w_i$

where $$h$$ is the bandwidth of the kernel, $$w_i$$ are the weights of the data points and $$\lambda_i$$ are the adaptation factor of the kernel width.

The kernel is a function of $$\mathbb{R}^n$$ such that:

$\begin{split}\begin{array}{rclcl} \idotsint_{\mathbb{R}^n} f(\mathbf{z}) d\mathbf{z} & = & 1 & \Longleftrightarrow & \text{f is a probability}\\ \idotsint_{\mathbb{R}^n} \mathbf{z}f(\mathbf{z}) d\mathbf{z} &=& \mathbf{0} & \Longleftrightarrow & \text{f is centered}\\ \forall \mathbf{u}\in\mathbb{R}^n, \|\mathbf{u}\| = 1\qquad\int_{\mathbb{R}} t^2f(t \mathbf{u}) dt &\approx& 1 & \Longleftrightarrow & \text{The co-variance matrix of f is close to be the identity.} \end{array}\end{split}$

The constraint on the covariance is only required to provide a uniform meaning for the bandwidth of the kernel.

If the domain of the density estimation is bounded to the interval $$[L,U]$$, the density is then estimated with:

$f(x) \triangleq \frac{1}{hW} \sum_{i=1}^n \frac{w_i}{\lambda_i} \hat{K}(x;X,\lambda_i h,L,U)$

where $$\hat{K}$$ is a modified kernel that depends on the exact method used. Currently, only 1D KDE supports bounded domains.

## Kernel Density Estimation Methods¶

class pyqt_fit.kde.KDE1D(xdata, **kwords)[source]

Perform a kernel based density estimation in 1D, possibly on a bounded domain $$[L,U]$$.

Parameters: data (ndarray) – 1D array with the data points kwords (dict) – setting attributes at construction time. Any named argument will be equivalent to setting the property after the fact. For example: >>> xs = [1,2,3] >>> k = KDE1D(xs, lower=0)  will be equivalent to: >>> k = KDE1D(xs) >>> k.lower = 0 

The calculation is separated in three parts:

__call__(points, out=None)[source]

This method is an alias for BoundedKDE1D.evaluate()

bandwidth[source]

Bandwidth of the kernel. Can be set either as a fixed value or using a bandwidth calculator, that is a function of signature w(xdata) that returns a single value.

Note

A ndarray with a single value will be converted to a floating point value.

cdf_grid(N=None, cut=None)[source]

Compute the cdf from the lower bound to the points given as argument.

closed[source]

Returns true if the density domain is closed (i.e. lower and upper are both finite)

copy()[source]

Shallow copy of the KDE object

covariance[source]

Covariance of the gaussian kernel. Can be set either as a fixed value or using a bandwidth calculator, that is a function of signature w(xdata) that returns a single value.

Note

A ndarray with a single value will be converted to a floating point value.

evaluate(points, out=None)[source]

Compute the PDF of the distribution on the set of points points

fit()[source]

Compute the various parameters needed by the kde method

grid(N=None, cut=None)[source]

Evaluate the density on a grid of N points spanning the whole dataset.

Returns: a tuple with the mesh on which the density is evaluated and the density itself
icdf_grid(N=None, cut=None)[source]

Compute the inverse cumulative distribution (quantile) function on a grid.

kernel[source]

Kernel object. This must be an object modeled on pyqt_fit.kernels.Kernel1D. It is recommended to inherit this class to provide numerical approximation for all methods.

By default, the kernel is an instance of pyqt_fit.kernels.normal_kernel1d

lambdas[source]

Scaling of the bandwidth, per data point. It can be either a single value or an array with one value per data point.

When deleted, the lamndas are reset to 1.

lower[source]

Lower bound of the density domain. If deleted, becomes set to $$-\infty$$

method[source]

Select the method to use. The method should be an object modeled on pyqt_fit.kde_methods.KDE1DMethod, and it is recommended to inherit the model.

Available methods in the pyqt_fit.kde_methods sub-module.

upper[source]

Upper bound of the density domain. If deleted, becomes set to $$\infty$$

weights[source]

Weigths associated to each data point. It can be either a single value, or an array with a value per data point. If a single value is provided, the weights will always be set to 1.

## Bandwidth Estimation Methods¶

pyqt_fit.kde.variance_bandwidth(factor, xdata)

Returns the covariance matrix:

$\mathcal{C} = \tau^2 cov(X)$

where $$\tau$$ is a correcting factor that depends on the method.

pyqt_fit.kde.silverman_covariance(xdata, model=None)

The Silverman bandwidth is defined as a variance bandwidth with factor:

$\tau = \left( n \frac{d+2}{4} \right)^\frac{-1}{d+4}$
pyqt_fit.kde.scotts_covariance(xdata, model=None)

The Scotts bandwidth is defined as a variance bandwidth with factor:

$\tau = n^\frac{-1}{d+4}$
pyqt_fit.kde.botev_bandwidth(N=None, **kword)

Implementation of the KDE bandwidth selection method outline in:

Z. I. Botev, J. F. Grotowski, and D. P. Kroese. Kernel density estimation via diffusion. The Annals of Statistics, 38(5):2916-2957, 2010.

Based on the implementation of Daniel B. Smith, PhD.

The object is a callable returning the bandwidth for a 1D kernel.