1.. _metropolis_hastings:
2
3The Metropolis-Hastings Algorithm
4---------------------------------
5
6| **Markov chain.** Considering a :math:`\sigma`-algebra :math:`\cA` on
7  :math:`\Omega`, a Markov chain is a process
8  :math:`{(X_k)}_{k\in\Nset}` such that
9
10  .. math::
11
12     \begin{aligned}
13         \forall{}(A,x_0,\ldots,x_{k-1})\in\cA\times\Omega^k
14         \quad \Prob{X_k\in A \,|\, X_0=x_0, \ldots, X_{k-1}=x_{k-1}}
15         = \Prob{X_k\in A \,|\, X_{k-1}=x_{k-1}}.
16       \end{aligned}
17
18An example is the *random walk* for which
19:math:`X_k = X_{k-1} + \varepsilon_k` where the steps
20:math:`\varepsilon_k` are independent and identically distributed.
21
22| **Transition kernel.** A transition kernel on :math:`(\Omega, \cA)` is
23  a mapping :math:`K: (\Omega, \cA) \rightarrow [0, 1]` such that
24
25-  :math:`\forall{}A\in\cA \quad K(., A)` is measurable;
26
27-  | :math:`\forall{}x\in\Omega \quad K(x, .)` is a probability
28     distribution on :math:`(\Omega, \cA)`.
29
30The kernel :math:`K` has density :math:`k` if
31:math:`\forall(x,A)\in\Omega\times\cA \quad K(x, A) = \displaystyle\int_A \: k(x, y) \mbox{d}y`.
32
33| :math:`{(X_k)}_{k\in\Nset}` is a homogeneous Markov Chain of
34  transition :math:`K` if
35  :math:`\forall(A,x)\in\cA\times\Omega \quad \Prob{X_k\in{}A | X_{k-1}=x} = K(x, A)`.
36| **Some Notations.** Let :math:`{(X_k)}_{k\in\Nset}` be a homogeneous
37  Markov Chain of transition :math:`K` on :math:`(\Omega, \cA)` with
38  initial distribution :math:`\nu` (that is :math:`X_0 \sim \nu`):
39
40-  :math:`K_\nu` denotes the probability distribution of the Markov
41   Chain :math:`{(X_k)}_{k\in\Nset}`;
42
43-  :math:`\nu{}K^k` denotes the probability distribution of :math:`X_k`
44   (:math:`X_k \sim \nu{}K^k`);
45
46-  | :math:`K^k` denotes the mapping defined by
47     :math:`K^k(x,A) = \Prob{X_k\in{}A|X_0=x}` for all
48     :math:`(x,A)\in\Omega\times\cA`.
49
50| **Total variation convergence.** A Markov Chain of distribution
51  :math:`K_\nu` is said to converge in total variation distance towards
52  the distribution :math:`t` if
53
54  .. math::
55
56     \begin{aligned}
57         \lim_{k\to+\infty} \sup_{A\in\cA} \left|
58         \nu{}K^k(A) - t(A)
59         \right| = 0.
60       \end{aligned}
61
62Then the notation used here is :math:`\nu{}K^k \rightarrow_{TV} t`.
63
64| **Some interesting properties.** Let :math:`t` be a (target)
65  distribution on :math:`(\Omega, \cA)`, then a transition kernel
66  :math:`K` is said to be:
67
68-  :math:`t`-invariant if :math:`t{}K = t`;
69
70-  :math:`t`-irreducible if, :math:`\forall(A,x)\in\Omega\times\cA` such
71   that :math:`t(A)>0`, :math:`\exists{}k\in\cN^* \quad {}K^k(x, A) > 0`
72   holds.
73
74Markov Chain Monte-Carlo techniques allows to sample and integrate
75according to a distribution :math:`t` which is only known up to a
76multiplicative constant. This situation is common in Bayesian statistics
77where the “target” distribution, the posterior one
78:math:`t(\vect{\theta})=\pi(\vect{\theta} | \mat{y})`, is proportional
79to the product of prior and likelihood: see equation :eq:`postdistribution`.
80
81In particular, given a “target” distribution :math:`t` and a
82:math:`t`-irreducible kernel transition :math:`Q`, the
83Metropolis-Hastings algorithm produces a Markov chain
84:math:`{(X_k)}_{k\in\Nset}` of distribution :math:`K_\nu` with the
85following properties:
86
87-  the transition kernel of the Markov chain is :math:`t`-invariant;
88
89-  :math:`\nu{}K^k \rightarrow_{TV} t`;
90
91-  the Markov chain satisfies the *ergodic theorem*: let :math:`\phi` be
92   a real-valued function such that
93   :math:`\mathbb{E}_{X\sim{}t}\left[ |\phi(X)| \right] <+\infty`, then, whatever the
94   initial distribution :math:`\nu` is:
95
96   .. math::
97
98      \begin{aligned}
99            \displaystyle\frac{1}{n} \sum_{k=1}^n \: \phi(X_k) \tendto{k}{+\infty} \mathbb{E}_{X\sim{}t}\left[ |\phi(X)| \right]
100            \mbox{ almost surely}.
101          \end{aligned}
102
103In that sense, simulating :math:`{(X_k)}_{k\in\Nset}` amounts to
104sampling according to :math:`t` and can be used to integrate relatively
105to the probability measure :math:`t`. Let us remark that the ergodic
106theorem implies that
107:math:`\displaystyle\frac{1}{n} \sum_{k=1}^n \: \fcar{A}{X_k} \tendto{k}{+\infty} \ProbCond{X\sim{}t}{X\in{}A}` almost surely.
108
109By abusing the notation, :math:`t(x)` represents, in the remainder of
110this section, a function of :math:`x` which is proportional to the PDF
111of the target distribution :math:`t`. Given a transition kernel
112:math:`Q` of density :math:`q`, the scheme of the Metropolis-Hastings
113algorithm is the following (lower case letters are used hereafter for
114both random variables and realizations as usual in the Bayesian
115literature):
116
1170)
118    Draw :math:`x_0 \sim \nu` and set :math:`k = 1`.
119
1201)
121    Draw a candidate for :math:`x_k` according to the given transition
122    kernel :math:`Q`: :math:`\tilde{x} \sim Q(x_{k-1}, .)`.
123
1242)
125    Compute the ratio
126    :math:`\rho = \displaystyle\frac{t(\tilde{x})/q(x_{k-1},\tilde{x})} {t(x_{k-1})/q(\tilde{x},x_{k-1})}`.
127
1283)
129    Draw :math:`u \sim \cU([0, 1])`; if :math:`u \leq \rho` then set
130    :math:`x_k = \tilde{x}`, otherwise set :math:`x_k = x_{k-1}`.
131
1324)
133    Set :math:`k=k+1` and go back to 1).
134
135Of course, if :math:`t` is replaced by a different function of :math:`x`
136which is proportional to it, the algorithm keeps unchanged, since
137:math:`t` only takes part in the latter in the ratio
138:math:`\frac{t(\tilde{x})}{t(x_{k-1})}`. Moreover, if :math:`Q` proposes
139some candidates in a uniform manner (constant density :math:`q`), the
140candidate :math:`\tilde{x}` is accepted according to a ratio
141:math:`\rho` which reduces to the previous “natural” ratio
142:math:`\frac{t(\tilde{x})}{t(x_{k-1})}` of PDF. The introduction of
143:math:`q` in the ratio :math:`\rho` prevents from the bias of a
144non-uniform proposition of candidates which would favor some areas of
145:math:`\Omega`.
146
147The :math:`t`-invariance is ensured by the symmetry of the expression of
148:math:`\rho` (:math:`t`-reversibility).
149
150In practice, :math:`Q` is specified as a random walk
151(:math:`\exists{}q_{RW}` such that :math:`q(x,y)=q_{RW}(x-y)`) or as a
152independent sampling (:math:`\exists{}q_{IS}` such that
153:math:`q(x,y)=q_{IS}(y)`), or as a mixture of random walk and
154independent sampling.
155
156| The important property the practitioner have to keep in mind when
157  choosing the transition kernel :math:`Q` is the
158  :math:`t`-irreducibility. Moreover, for efficient convergence,
159  :math:`Q` has to be chosen so as to explore quickly the whole support
160  of :math:`t` without conducting to a too small acceptance ratio (the
161  ratio of accepted candidates :math:`\tilde{x}` ). It is usually
162  recommended that this latter ratio is about :math:`0.2` but such a
163  ratio is neither a warranty of efficiency, nor a substitute to a
164  convergence diagnosis.
165
166.. topic:: API:
167
168    - See :class:`~openturns.RandomWalkMetropolisHastings`
169
170.. topic:: Examples:
171
172    - See :doc:`/auto_calibration/bayesian_calibration/plot_bayesian_calibration`
173
174.. topic:: References:
175
176    - Robert, C.P. and Casella, G. (2004). *Monte Carlo Statistical Methods* (Second Edition), Springer.
177    - Meyn, S. and Tweedie R.L. (2009). *Markov Chains and Stochastic Stability* (Second Edition), Cambridge University Press.
178