1.. _metropolis_hastings: 2 3The Metropolis-Hastings Algorithm 4--------------------------------- 5 6| **Markov chain.** Considering a :math:`\sigma`-algebra :math:`\cA` on 7 :math:`\Omega`, a Markov chain is a process 8 :math:`{(X_k)}_{k\in\Nset}` such that 9 10 .. math:: 11 12 \begin{aligned} 13 \forall{}(A,x_0,\ldots,x_{k-1})\in\cA\times\Omega^k 14 \quad \Prob{X_k\in A \,|\, X_0=x_0, \ldots, X_{k-1}=x_{k-1}} 15 = \Prob{X_k\in A \,|\, X_{k-1}=x_{k-1}}. 16 \end{aligned} 17 18An example is the *random walk* for which 19:math:`X_k = X_{k-1} + \varepsilon_k` where the steps 20:math:`\varepsilon_k` are independent and identically distributed. 21 22| **Transition kernel.** A transition kernel on :math:`(\Omega, \cA)` is 23 a mapping :math:`K: (\Omega, \cA) \rightarrow [0, 1]` such that 24 25- :math:`\forall{}A\in\cA \quad K(., A)` is measurable; 26 27- | :math:`\forall{}x\in\Omega \quad K(x, .)` is a probability 28 distribution on :math:`(\Omega, \cA)`. 29 30The kernel :math:`K` has density :math:`k` if 31:math:`\forall(x,A)\in\Omega\times\cA \quad K(x, A) = \displaystyle\int_A \: k(x, y) \mbox{d}y`. 32 33| :math:`{(X_k)}_{k\in\Nset}` is a homogeneous Markov Chain of 34 transition :math:`K` if 35 :math:`\forall(A,x)\in\cA\times\Omega \quad \Prob{X_k\in{}A | X_{k-1}=x} = K(x, A)`. 36| **Some Notations.** Let :math:`{(X_k)}_{k\in\Nset}` be a homogeneous 37 Markov Chain of transition :math:`K` on :math:`(\Omega, \cA)` with 38 initial distribution :math:`\nu` (that is :math:`X_0 \sim \nu`): 39 40- :math:`K_\nu` denotes the probability distribution of the Markov 41 Chain :math:`{(X_k)}_{k\in\Nset}`; 42 43- :math:`\nu{}K^k` denotes the probability distribution of :math:`X_k` 44 (:math:`X_k \sim \nu{}K^k`); 45 46- | :math:`K^k` denotes the mapping defined by 47 :math:`K^k(x,A) = \Prob{X_k\in{}A|X_0=x}` for all 48 :math:`(x,A)\in\Omega\times\cA`. 49 50| **Total variation convergence.** A Markov Chain of distribution 51 :math:`K_\nu` is said to converge in total variation distance towards 52 the distribution :math:`t` if 53 54 .. math:: 55 56 \begin{aligned} 57 \lim_{k\to+\infty} \sup_{A\in\cA} \left| 58 \nu{}K^k(A) - t(A) 59 \right| = 0. 60 \end{aligned} 61 62Then the notation used here is :math:`\nu{}K^k \rightarrow_{TV} t`. 63 64| **Some interesting properties.** Let :math:`t` be a (target) 65 distribution on :math:`(\Omega, \cA)`, then a transition kernel 66 :math:`K` is said to be: 67 68- :math:`t`-invariant if :math:`t{}K = t`; 69 70- :math:`t`-irreducible if, :math:`\forall(A,x)\in\Omega\times\cA` such 71 that :math:`t(A)>0`, :math:`\exists{}k\in\cN^* \quad {}K^k(x, A) > 0` 72 holds. 73 74Markov Chain Monte-Carlo techniques allows to sample and integrate 75according to a distribution :math:`t` which is only known up to a 76multiplicative constant. This situation is common in Bayesian statistics 77where the “target” distribution, the posterior one 78:math:`t(\vect{\theta})=\pi(\vect{\theta} | \mat{y})`, is proportional 79to the product of prior and likelihood: see equation :eq:`postdistribution`. 80 81In particular, given a “target” distribution :math:`t` and a 82:math:`t`-irreducible kernel transition :math:`Q`, the 83Metropolis-Hastings algorithm produces a Markov chain 84:math:`{(X_k)}_{k\in\Nset}` of distribution :math:`K_\nu` with the 85following properties: 86 87- the transition kernel of the Markov chain is :math:`t`-invariant; 88 89- :math:`\nu{}K^k \rightarrow_{TV} t`; 90 91- the Markov chain satisfies the *ergodic theorem*: let :math:`\phi` be 92 a real-valued function such that 93 :math:`\mathbb{E}_{X\sim{}t}\left[ |\phi(X)| \right] <+\infty`, then, whatever the 94 initial distribution :math:`\nu` is: 95 96 .. math:: 97 98 \begin{aligned} 99 \displaystyle\frac{1}{n} \sum_{k=1}^n \: \phi(X_k) \tendto{k}{+\infty} \mathbb{E}_{X\sim{}t}\left[ |\phi(X)| \right] 100 \mbox{ almost surely}. 101 \end{aligned} 102 103In that sense, simulating :math:`{(X_k)}_{k\in\Nset}` amounts to 104sampling according to :math:`t` and can be used to integrate relatively 105to the probability measure :math:`t`. Let us remark that the ergodic 106theorem implies that 107:math:`\displaystyle\frac{1}{n} \sum_{k=1}^n \: \fcar{A}{X_k} \tendto{k}{+\infty} \ProbCond{X\sim{}t}{X\in{}A}` almost surely. 108 109By abusing the notation, :math:`t(x)` represents, in the remainder of 110this section, a function of :math:`x` which is proportional to the PDF 111of the target distribution :math:`t`. Given a transition kernel 112:math:`Q` of density :math:`q`, the scheme of the Metropolis-Hastings 113algorithm is the following (lower case letters are used hereafter for 114both random variables and realizations as usual in the Bayesian 115literature): 116 1170) 118 Draw :math:`x_0 \sim \nu` and set :math:`k = 1`. 119 1201) 121 Draw a candidate for :math:`x_k` according to the given transition 122 kernel :math:`Q`: :math:`\tilde{x} \sim Q(x_{k-1}, .)`. 123 1242) 125 Compute the ratio 126 :math:`\rho = \displaystyle\frac{t(\tilde{x})/q(x_{k-1},\tilde{x})} {t(x_{k-1})/q(\tilde{x},x_{k-1})}`. 127 1283) 129 Draw :math:`u \sim \cU([0, 1])`; if :math:`u \leq \rho` then set 130 :math:`x_k = \tilde{x}`, otherwise set :math:`x_k = x_{k-1}`. 131 1324) 133 Set :math:`k=k+1` and go back to 1). 134 135Of course, if :math:`t` is replaced by a different function of :math:`x` 136which is proportional to it, the algorithm keeps unchanged, since 137:math:`t` only takes part in the latter in the ratio 138:math:`\frac{t(\tilde{x})}{t(x_{k-1})}`. Moreover, if :math:`Q` proposes 139some candidates in a uniform manner (constant density :math:`q`), the 140candidate :math:`\tilde{x}` is accepted according to a ratio 141:math:`\rho` which reduces to the previous “natural” ratio 142:math:`\frac{t(\tilde{x})}{t(x_{k-1})}` of PDF. The introduction of 143:math:`q` in the ratio :math:`\rho` prevents from the bias of a 144non-uniform proposition of candidates which would favor some areas of 145:math:`\Omega`. 146 147The :math:`t`-invariance is ensured by the symmetry of the expression of 148:math:`\rho` (:math:`t`-reversibility). 149 150In practice, :math:`Q` is specified as a random walk 151(:math:`\exists{}q_{RW}` such that :math:`q(x,y)=q_{RW}(x-y)`) or as a 152independent sampling (:math:`\exists{}q_{IS}` such that 153:math:`q(x,y)=q_{IS}(y)`), or as a mixture of random walk and 154independent sampling. 155 156| The important property the practitioner have to keep in mind when 157 choosing the transition kernel :math:`Q` is the 158 :math:`t`-irreducibility. Moreover, for efficient convergence, 159 :math:`Q` has to be chosen so as to explore quickly the whole support 160 of :math:`t` without conducting to a too small acceptance ratio (the 161 ratio of accepted candidates :math:`\tilde{x}` ). It is usually 162 recommended that this latter ratio is about :math:`0.2` but such a 163 ratio is neither a warranty of efficiency, nor a substitute to a 164 convergence diagnosis. 165 166.. topic:: API: 167 168 - See :class:`~openturns.RandomWalkMetropolisHastings` 169 170.. topic:: Examples: 171 172 - See :doc:`/auto_calibration/bayesian_calibration/plot_bayesian_calibration` 173 174.. topic:: References: 175 176 - Robert, C.P. and Casella, G. (2004). *Monte Carlo Statistical Methods* (Second Edition), Springer. 177 - Meyn, S. and Tweedie R.L. (2009). *Markov Chains and Stochastic Stability* (Second Edition), Cambridge University Press. 178