1:orphan:
2
3..
4    _href from docs/source/index.rst
5
6******************
7Gaussian Processes
8******************
9
10GP Basics
11=========
12
13Sometimes an unknown parameter or variable in a model is not a scalar value or
14a fixed-length vector, but a *function*.  A Gaussian process (GP) can be used
15as a prior probability distribution whose support is over the space of
16continuous functions.  A GP prior on the function :math:`f(x)` is usually written,
17
18.. math::
19
20  f(x) \sim \mathcal{GP}(m(x), \, k(x, x')) \,.
21
22The function values are modeled as a draw from a multivariate normal
23distribution that is parameterized by the mean function, :math:`m(x)`, and the
24covariance function, :math:`k(x, x')`.  Gaussian processes are a convenient
25choice as priors over functions due to the marginalization and conditioning
26properties of the multivariate normal distribution.  Usually, the marginal
27distribution over :math:`f(x)` is evaluated during the inference step.  The
28conditional distribution is then used for predicting the function values
29:math:`f(x_*)` at new points, :math:`x_*`.
30
31The joint distribution of :math:`f(x)` and :math:`f(x_*)` is multivariate
32normal,
33
34.. math::
35
36  \begin{bmatrix} f(x) \\ f(x_*) \\ \end{bmatrix} \sim
37  \text{N}\left(
38    \begin{bmatrix} m(x)  \\ m(x_*)    \\ \end{bmatrix} \,,
39    \begin{bmatrix} k(x,x')    & k(x_*, x)    \\
40                    k(x_*, x) &  k(x_*, x_*')  \\ \end{bmatrix}
41          \right) \,.
42
43Starting from the joint distribution, one obtains the marginal distribution
44of :math:`f(x)`, as :math:`\text{N}(m(x),\, k(x, x'))`.  The conditional
45distribution is
46
47.. math::
48
49  f(x_*) \mid f(x) \sim \text{N}\left( k(x_*, x) k(x, x)^{-1} [f(x) - m(x)] + m(x_*) ,\,
50    k(x_*, x_*) - k(x, x_*) k(x, x)^{-1} k(x, x_*) \right) \,.
51
52.. note::
53
54  For more information on GPs, check out the book `Gaussian Processes for
55  Machine Learning <http://www.gaussianprocess.org/gpml/>`_ by Rasmussen &
56  Williams, or `this introduction <https://www.ics.uci.edu/~welling/teaching/KernelsICS273B/gpB.pdf>`_
57  by D. Mackay.
58
59PyMC3 is a great environment for working with fully Bayesian Gaussian Process
60models.  GPs in PyMC3 have a clear syntax and are highly composable, and many
61predefined covariance functions (or kernels), mean functions, and several GP
62implementations are included.  GPs are treated as distributions that can be
63used within larger or hierarchical models, not just as standalone regression
64models.
65
66Mean and covariance functions
67=============================
68
69Those who have used the GPy or GPflow Python packages will find the syntax for
70construction mean and covariance functions somewhat familiar.  When first
71instantiated, the mean and covariance functions are parameterized, but not
72given their inputs yet.  The covariance functions must additionally be provided
73with the number of dimensions of the input matrix, and a list that indexes
74which of those dimensions they are to operate on.  The reason for this design
75is so that covariance functions can be constructed that are combinations of
76other covariance functions.
77
78For example, to construct an exponentiated quadratic covariance function that
79operates on the second and third column of a three column matrix representing
80three predictor variables::
81
82    ls = [2, 5] # the lengthscales
83    cov_func = pm.gp.cov.ExpQuad(input_dim=3, ls=ls, active_dims=[1, 2])
84
85Here the :code:`ls`, or lengthscale, parameter is two dimensional, allowing the second
86and third dimension to have a different lengthscale.  The reason we have to
87specify :code:`input_dim`, the total number of columns of :code:`X`, and
88:code:`active_dims`, which of those columns or dimensions the covariance
89function will act on, is because :code:`cov_func` hasn't actually seen the
90input data yet.  The :code:`active_dims` argument is optional, and defaults to
91all columns of the matrix of inputs.
92
93Covariance functions in PyMC3 closely follow the algebraic rules for kernels,
94which allows users to combine covariance functions into new ones, for example:
95
96- The sum of two covariance functions is also a covariance function::
97
98
99    cov_func = pm.gp.cov.ExpQuad(...) + pm.gp.cov.ExpQuad(...)
100
101- The product of two covariance functions is also a covariance function::
102
103
104    cov_func = pm.gp.cov.ExpQuad(...) * pm.gp.cov.Periodic(...)
105
106- The product (or sum) of a covariance function with a scalar is a
107  covariance function::
108
109
110    cov_func = eta**2 * pm.gp.cov.Matern32(...)
111
112
113
114After the covariance function is defined, it is now a function that is
115evaluated by calling :code:`cov_func(x, x)` (or :code:`mean_func(x)`).  Since
116PyMC3 is built on top of Theano, it is relatively easy to define and experiment
117with non-standard covariance and mean functons.  For more information check out
118the tutorial on covariance functions.
119
120
121GP Implementations
122==================
123
124PyMC3 includes several GP implementations, including marginal and latent
125variable models and also some fast approximations.  Their usage all follows a
126similar pattern:  First, a GP is instantiated with a mean function and a
127covariance function.  Then, GP objects can be added together, allowing for
128function characteristics to be carefully modeled and separated.  Finally, one
129of `prior`, `marginal_likelihood` or `conditional` methods is called on the GP
130object to actually construct the PyMC3 random variable that represents the
131function prior.
132
133Using :code:`gp.Latent` for the example, the syntax to first specify the GP
134is::
135
136    gp = pm.gp.Latent(mean_func, cov_func)
137
138The first argument is the mean function and the second is the covariance
139function.  We've made the GP object, but we haven't made clear which function
140it is to be a prior for, what the inputs are, or what parameters it will be
141conditioned on.
142
143.. note::
144
145  The :code:`gp.Marginal` class and similar don't have a :code:`prior` method.
146  Instead they have a :code:`marginal_likelihood` method that is used similarly,
147  but has additional required arguments, such as the observed data, noise,
148  or other, depending on the implementation.  See the notebooks for examples.
149  The :code:`conditional` method works similarly.
150
151Calling the `prior` method will create a PyMC3 random variable that represents
152the latent function :math:`f(x) = \mathbf{f}`::
153
154	f = gp.prior("f", X)
155
156:code:`f` is a random variable that can be used within a PyMC3 model like any
157other type of random variable.  The first argument is the name of the random
158variable representing the function we are placing the prior over.
159The second argument is the inputs to the function that the prior is over,
160:code:`X`.  The inputs are usually known and present in the data, but they can
161also be PyMC3 random variables.  If the inputs are a Theano tensor or a
162PyMC3 random variable, the :code:`shape` needs to be given.
163
164Usually at this point, inference is performed on the model.  The
165:code:`conditional` method creates the conditional, or predictive,
166distribution over the latent function at arbitrary :math:`x_*` input points,
167:math:`f(x_*)`.  To construct the conditional distribution we write::
168
169	f_star = gp.conditional("f_star", X_star)
170
171Additive GPs
172============
173
174The GP implementation in PyMC3 is constructed so that it is easy to define
175additive GPs and sample from individual GP components.  We can write::
176
177    gp1 = pm.gp.Marginal(mean_func1, cov_func1)
178    gp2 = pm.gp.Marginal(mean_func2, cov_func2)
179    gp3 = gp1 + gp2
180
181The GP objects have to have the same type, :code:`gp.Marginal` cannot
182be added to :code:`gp.Latent`.
183
184Consider two independent GP distributed functions, :math:`f_1(x) \sim
185\mathcal{GP}\left(m_1(x),\, k_1(x, x')\right)` and :math:`f_2(x) \sim
186\mathcal{GP}\left( m_2(x),\, k_2(x, x')\right)`.  The joint distribution of
187:math:`f_1,\, f_1^*,\, f_2,\, f_2^*,\, f_1 + f_2 and f_1^* + f_2^*` is
188
189.. math::
190
191  \begin{bmatrix} f_1 \\ f_1^* \\ f_2 \\ f_2^*
192               \\ f_1 + f_2    \\ f_1^* + f_2^* \end{bmatrix} \sim
193  \text{N}\left(
194    \begin{bmatrix} m_1 \\ m_1^* \\ m_2 \\ m_2^* \\
195                    m_1 + m_2    \\ m_1^* + m_2^*   \\ \end{bmatrix} \,,\,
196    \begin{bmatrix}
197      K_1       &  K_1^*     &   0       &    0      & K_1        & K_1^*              \\
198      K_1^{*^T} &  K_1^{**}  &   0       &    0      & K_1^*      & K_1^{**}           \\
199      0         &  0         & K_2       & K_2^*     & K_2        & K_2^{*}            \\
200      0         &  0         & K_2^{*^T} & K_2^{**}  & K_2^{*}    & K_2^{**}           \\
201      K_1       &  K_1^{*}   & K_2       & K_2^{*}   & K_1 + K_2  & K_1^{*} + K_2^{*}  \\
202      K_1^{*^T} & K_1^{**} & K_2^{*^T} & K_2^{**} & K_1^{*^T}+K_2^{*^T} & K_1^{**}+K_2^{**}
203    \end{bmatrix}
204  \right) \,.
205
206Using the joint distribution to obtain the conditional distribution of :math:`f_1^*`
207with the contribution due to :math:`f_1 + f_2` factored out, we get
208
209.. math::
210  f_1^* \mid f_1 + f_2 \sim \text{N}\left(
211    m_1^* + K_1^{*^T}(K_1 + K_2)^{-1}\left[f_1 + f_2 - m_1 - m_2\right] \,,\,
212    K_1^{**} - K_1^{*^T}(K_1 + K_2)^{-1}K_1^* \right) \,.
213
214
215These equations show how to break down GP models into individual components to see how each
216contributes to the data.  For more information, check out `David Duvenaud's PhD
217thesis <https://www.cs.toronto.edu/~duvenaud/thesis.pdf>`_.
218
219The GP objects in PyMC3 keeps track of these marginals automatically.  The
220following code sketch shows how to define the conditional distribution of
221:math:`f_2^*`.  We use `gp.Marginal` in the example, but the same works for
222other implementations.  The first block fits the GP prior.  We denote
223:math:`f_1 + f_2` as just :math:`f` for brevity::
224
225    with pm.Model() as model:
226        gp1 = pm.gp.Marginal(mean_func1, cov_func1)
227        gp2 = pm.gp.Marginal(mean_func2, cov_func2)
228
229        # gp represents f1 + f2.
230        gp = gp1 + gp2
231
232        f = gp.marginal_likelihood("f", X, y, noise)
233
234        trace = pm.sample(1000)
235
236
237To construct the conditional distribution of :code:`gp1` or :code:`gp2`, we
238also need to include the additional arguments, :code:`X`, :code:`y`, and
239:code:`noise`::
240
241    with model:
242        # conditional distributions of f1 and f2
243        f1_star = gp1.conditional("f1_star", X_star,
244                                  given={"X": X, "y": y, "noise": noise, "gp": gp})
245        f2_star = gp2.conditional("f2_star", X_star,
246                                  given={"X": X, "y": y, "noise": noise, "gp": gp})
247
248        # conditional of f1 + f2, `given` not required
249        f_star = gp.conditional("f_star", X_star)
250
251This second block produces the conditional distributions.  Notice that extra
252arguments are required for conditionals of :math:`f1` and :math:`f2`, but not
253:math:`f`.  This is because those arguments are cached when
254:code:`.marginal_likelihood` is called on :code:`gp`.
255
256.. note::
257  When constructing conditionals, the additional arguments :code:`X`, :code:`y`,
258  :code:`noise` and :code:`gp` must be provided as a dict called `given`!
259
260Since the marginal likelihoood method of :code:`gp1` or :code:`gp2` weren't called,
261their conditionals need to be provided with the required inputs.  In the same
262fashion as the prior, :code:`f_star`, :code:`f1_star` and :code:`f2_star` are random
263variables that can now be used like any other random variable in PyMC3.
264
265Check the notebooks for detailed demonstrations of the usage of GP functionality
266in PyMC3.
267