.. .. Copyright John Reid 2012 .. .. This is a reStructuredText document. If you are reading this in text format, it can be .. converted into a more readable format by using Docutils_ tools such as rst2html. .. .. _Docutils: http://docutils.sourceforge.net/docs/user/tools.html What are Gaussian processes? ============================ Often we have an inference problem involving :math:`n` data, .. math:: \mathcal{D} = \{(\boldsymbol{x_i},y_i)|i=1,\ldots,n, \boldsymbol{x_i} \in \mathcal{X}, y_i \in \mathbb{R}\} where the :math:`\boldsymbol{x_i}` are the inputs and the :math:`y_i` are the targets. We wish to make predictions, :math:`y_*`\ , for new inputs :math:`\boldsymbol{x_*}`\ . Taking a Bayesian perspective we might build a model that defines a distribution over all possible functions, :math:`f: \mathcal{X} \rightarrow \mathbb{R}`\ . We can encode our initial beliefs about our particular problem as a prior over these functions. Given the data, :math:`\mathcal{D}`\ , and applying Bayes’ rule we can infer a posterior distribution. In particular, for any given :math:`\boldsymbol{x_*}` we can calculate or approximate a predictive distribution over :math:`y_*` under this posterior. *Gaussian processes* (GPs) are probability distributions over functions for which this inference task is tractable. They can be seen as a generalisation of the Gaussian probability distribution to the space of functions. That is a multivariate Gaussian *distribution* defines a distribution over a finite set of random variables, a Gaussian *process* defines a distribution over an infinite set of random variables (for example the real numbers). GP domains are not restricted to the real numbers, any space with a dot product is suitable. Analagously to a multivariate Gaussian distribution, a GP is defined by its mean, :math:`\mu`\ , and covariance, :math:`k`\ . However for a GP these are themselves functions, :math:`\mu: \mathcal{X} \rightarrow \mathbb{R}` and :math:`k: \mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R}`\ . In all that follows we assume :math:`\mu(\boldsymbol{x}) = 0` as without loss of generality as we can always shift the data to accommodate any given mean. .. samples .. image:: _static/Images/samples_from_prior.* .. image:: _static/Images/samples_from_posterior.* Samples from two Gaussian processes with the same mean and covariance functions. Here :math:`\mathcal{X} = \mathbb{R}`\ . The prior samples are taken from a Gaussian process without any data and the posterior samples are taken from a Gaussian process where the data are shown as black squares. The black dotted line represents the mean of the process and the gray shaded area covers twice the standard deviation at each input, :math:`x`\ . The coloured lines are samples from the process, or more accurately, samples at a finite number of inputs, :math:`x`\ , joined by lines. The code to generate the above figures: .. literalinclude:: _static/Code/figure_samples.py :language: python :lines: 5- The covariance function, :math:`k` ---------------------------------- Assuming the mean function, :math:`\mu`\ , is 0 everywhere then our GP is defined by two quantities: the data, :math:`\mathcal{D}`\ ; and its covariance function (sometimes referred to as its *kernel*), :math:`k`\ . The data is fixed so our modelling problem is exactly that of choosing a suitable covariance function. Given different problems we certainly wish to specify different priors over possible functions. Fortunately we have available a large library of possible covariance functions each of which represents a different prior on the space of functions. .. covariance-examples .. image:: _static/Images/covariance_function_se.* .. image:: _static/Images/covariance_function_matern_52.* .. image:: _static/Images/covariance_function_se_long_length.* .. image:: _static/Images/covariance_function_matern_32.* .. image:: _static/Images/covariance_function_rq.* .. image:: _static/Images/covariance_function_periodic.* Samples drawn from GPs with the same data and different covariance functions. Typical samples from the posterior of GPs with different covariance functions have different characteristics. The periodic covariance function’s primary characteristic is self explanatory. The other covariance functions affect the smoothness of the samples in different ways. The code to generate the above figures: .. literalinclude:: _static/Code/figure_covariance_examples.py :language: python :lines: 5- Combining covariance functions and noisy data ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Furthermore the point-wise product and sum of covariance functions are themselves covariance functions. In this way we can combine simple covariance functions to represent more complicated beliefs we have about our functions. Normally we are modelling a system where we do not actually have access to the target values, :math:`y`\ , but only noisy versions of them, :math:`y+\epsilon`\ . If we assume :math:`\epsilon` has a Gaussian distribution with variance :math:`\sigma_n^2` we can incorporate this noise term into our covariance function. This requires that our noisy GP’s covariance function, :math:`k_{\textrm{noise}}(x_1,x_2)` is aware of whether :math:`x_1` and :math:`x_2` are the same input as we may have two noisy measurements at the same point in :math:`\mathcal{X}`\ . .. math:: k_{\textrm{noise}}(x_1,x_2) = k(x_1,x_2) + \delta(x_1=x_2) \sigma_n^2 .. image:: _static/Images/noise_low.* .. image:: _static/Images/noise_mid.* .. image:: _static/Images/noise_high.* GP predictions with varying levels of noise. The covariance function is a squared exponential with additive noise of levels 0.0001, 0.1 and 1. The code to generate the above figures: .. literalinclude:: _static/Code/figure_noise.py :language: python :lines: 5- Learning covariance function parameters ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Most of the commonly used covariance functions are parameterised. The parameters can be fixed if we are confident in our understanding of the problem. Alternatively we can treat them as hyper-parameters in our Bayesian inference task and optimise them through some technique such as maximum likelihood estimation or conjugate gradient descent. .. image:: _static/Images/learning_first_guess.* .. image:: _static/Images/learning_learnt.* The effects of learning covariance function hyper-parameters. We see the predictions in the second figure seem to fit the data more accurately. The code to generate the above figures: .. literalinclude:: _static/Code/figure_learning.py :language: python :lines: 5-