1 2Gumbel statistics are often used to estimate the statistical 3significance of local alignment scores. 4 5The Gumbel distribution is the so-called Type I extreme value 6distribution (EVD). It occurs so frequently in sequence analysis 7applications, compared to the type II (Fr\'{e}chet) and type III 8(Weibull) extreme value distributions, that ``Gumbel'' and ``EVD'' are 9often used interchangeably in bioinformatics. Easel has a separate 10module, the \eslmod{gev} module, that implements the generalized 11extreme value distribution. 12 13Karlin/Altschul statistics are a special case of the Gumbel 14distribution that apply to the scores of ungapped local alignments 15between infinitely long random sequences. Empirically, Karlin/Altschul 16statistics also apply reasonably well to the more useful case of 17gapped alignment of finite-length sequences. Karlin/Altschul 18statistics predict how the Gumbel's two parameters depend on the 19length of the query and target sequences. In the case of ungapped 20alignments, Karlin/Altschul statistics allow the Gumbel parameters to 21be estimated directly, without the need for a compute-intensive 22simulation. 23 24\subsection{The gumbel API} 25 26The \eslmod{gumbel} API consists of the following functions: 27 28\vspace{0.5em} 29\begin{center} 30\begin{tabular}{ll}\hline 31 \multicolumn{2}{c}{\textbf{evaluating densities and distributions:}}\\ 32\ccode{esl\_gumbel\_pdf()} & Returns the probability density, $P(S=x)$.\\ 33\ccode{esl\_gumbel\_logpdf()} & Returns the log of the pdf, $\log P(S=x)$.\\ 34\ccode{esl\_gumbel\_cdf()} & Returns the cumulative probability distribution, $P(S \leq x)$.\\ 35\ccode{esl\_gumbel\_logcdf()} & Returns the log of the cdf, $\log P(S \leq x)$.\\ 36\ccode{esl\_gumbel\_surv()} & Returns right tail mass, 1-cdf, $P(S > x)$\\ 37\ccode{esl\_gumbel\_logsurv()} & Returns log of 1-cdf, $\log P(S > x)$.\\ 38 \multicolumn{2}{c}{\textbf{sampling:}}\\ 39\ccode{esl\_gumbel\_Sample()} & Returns a Gumbel-distributed random sample.\\ 40 \multicolumn{2}{c}{\textbf{maximum a posteriori parameter fitting:}}\\ 41\ccode{esl\_gumbel\_FitComplete()} & Estimates $\mu,\lambda$ from complete data.\\ 42\ccode{esl\_gumbel\_FitCompleteLoc()} & Estimates $\mu$ when $\lambda$ is known.\\ 43\ccode{esl\_gumbel\_FitCensored()} & Estimates $\mu,\lambda$ from censored data.\\ 44\ccode{esl\_gumbel\_FitCensoredLoc()} & Estimates $\mu$ when $\lambda$ is known.\\ 45\ccode{esl\_gumbel\_FitTruncated()}& Estimates $\mu,\lambda$ from truncated data.\\\hline 46\end{tabular} 47\end{center} 48\vspace{0.5em} 49 50The Gumbel distribution depends on two parameters, $\mu$ and 51$\lambda$. When $\mu$ and $\lambda$ are known, the statistical 52significance (P-value) of a single score $x$ is $P(S>x)$, obtained by 53a call to \ccode{esl\_gumbel\_surv()}. The E-value for obtaining that 54score or better in searching a database of $N$ sequences is just 55$NP(S>x)$. 56 57When $\mu$ and $\lambda$ are unknown, they are estimated from scores 58obtained from comparisons of simulated random data. (Analytical 59solutions for $\mu$ and $\lambda$ are only available in the case of 60ungapped sequence alignments.) The \ccode{esl\_evd\_Fit*()} functions 61provide maximum likelihood parameter fitting routines for different 62types of data. 63 64\subsection{Example of using the gumbel API} 65 66An example that samples 10,000 data points from a Gumbel distribution 67with $\mu=-20$, $\lambda=0.4$; reports the min and max samples, and 68the probability mass to the left of the min and to the right of the 69max (both of which should be about $\frac{1}{10000}$, since we took 7010,000 samples); and then fits those simulated data to a Gumbel and 71reports the fitted $\mu$ and $\lambda$: 72 73\input{cexcerpts/gumbel_example} 74 75 76 77\subsection{Gumbel densities} 78 79The probability density function (pdf) and the cumulative distribution 80function (cdf) of the extreme value distribution are: 81 82\begin{equation} 83P(x) = \lambda \exp \left[ -\lambda (x - \mu) - e^{- \lambda (x - \mu)} \right] 84\label{eqn:gumbel_density} 85\end{equation} 86 87\begin{equation} 88P(S < x) = \exp \left[ -e^{-\lambda(x - \mu)} \right] 89\label{eqn:gumbel_distribution} 90\end{equation} 91 92The extreme value density and distribution functions for $\mu = 0$ and 93$\lambda = 1.0$ are shown below. 94 95\begin{center} 96\includegraphics[width=3in]{figures/evd_basic} 97\end{center} 98 99The $\mu$ and $\lambda$ parameters are {\em location} and {\em scale} 100parameters, respectively: 101 102\centerline{ 103\begin{minipage}{3in} 104\includegraphics[width=2.8in]{figures/evd_location} 105\end{minipage} 106\begin{minipage}{3in} 107\includegraphics[width=2.8in]{figures/evd_scale} 108\end{minipage} 109} 110 111For more details, a classic reference is \citep{Lawless82}. Gumbel 112distributions can have their long tail to the right or to the 113left. The form given here is for the long tail to the right. This is 114the form that arises when the extreme value is a maximum, such as when 115our score is the maximum over the individual scores of all possible 116alignments. The equations in \citep{Lawless82} are for extremal 117minima; use $(x - u) = -(x - \mu)$ and $b = 1 / \lambda$ to convert 118Lawless' notation to the notation used here. 119 120 121\subsection{Fitting Gumbel distributions to observed data} 122 123Given a set of $n$ observed samples $\mathbf{x}$, we may want to 124estimate the $\mu$ and $\lambda$ parameters. 125 126One might try to use linear regression to fit to a $\log \log$ 127transformation of the $P(S < x)$ histogram, which gives a straight 128line with slope $-\lambda$ and $x$ intercept $\mu$: 129 130\begin{equation} 131\log \left[ -\log P(S<x) \right] = -\lambda x + \lambda \mu 132\end{equation} 133 134However, the linear regression method is undesirable because it is 135sensitive to outliers. The following table shows the \% error for 136estimating $\hat{\mu}$ and $\hat{\lambda}$ from 500 simulated complete 137datasets, sampled from a Gumbel with $\mu = -20.0$ and $\lambda = 1380.4$, for four different dataset sizes: 139 140\begin{center} 141\begin{tabular}{lrrrr} \hline 142 & \multicolumn{4}{c}{\# of samples}\\ 143 & 100 & 1000 & 10,000 & 100,000 \\ 144\% error in $\hat{\mu}$ & 2\%& 1\% & 0.9\% & 0.9\% \\ 145max error in $\hat{\mu}$ & 24\%& 13\% & 10\% & 10\% \\ 146\% error in $\hat{\lambda}$ & 12\%& 7\% & 5\% & 3\% \\ 147max error in $\hat{\lambda}$ & 49\%& 33\% & 25\% & 20\% \\ \hline 148\end{tabular} 149\end{center} 150 151 152A better rough estimate of $\hat{\mu}$ and $\hat{\lambda}$ can be 153obtained from the sample mean $m$ and variance $s^2$ of the observed 154data \citep{Evans00}:\footnote{All simulation data are generated by 155the \eslmod{evd} module's stats driver. The only exception is the 156linear regression fit data, which come from an old version of HMMER.} 157 158\begin{eqnarray*} 159 \hat{\lambda} & = & \frac{\pi}{\sqrt{6s^2}}\\ 160 \hat{\mu} & = & m - \frac{0.57722}{\hat{\lambda}} 161\end{eqnarray*} 162 163The mean/variance method is more accurate than linear regression, as 164shown by the following simulation results: 165 166\begin{center} 167\begin{tabular}{lrrrr} \hline 168 & \multicolumn{4}{c}{\# of samples}\\ 169 & 100 & 1000 & 10,000 & 100,000 \\ 170\% error in $\hat{\mu}$ & 1\%& 0.3\% & 0.1\% & 0.03\% \\ 171max error in $\hat{\mu}$ & 5\%& 1\% & 0.4\% & 0.1\% \\ 172\% error in $\hat{\lambda}$ & 9\%& 3\% & 0.8\% & 0.3\% \\ 173max error in $\hat{\lambda}$ & 40\%& 12\% & 3\% & 0.9\% \\ \hline 174\end{tabular} 175\end{center} 176 177Still, the mean/variance method is not as accurate as a maximum 178likelihood estimation (especially for $\lambda$). Also, it requires 179complete data, whereas we also need to solve problems where we fit to 180\emph{truncated} or \emph{censored} data. Easel's main estimation 181methods are therefore maximum likelihood methods. 182 183\subsubsection{Maximum likelihood estimation, complete data} 184 185Given $n$ samples $x_1..x_n$ from some distribution that depends on 186parameters $\theta$, we want to estimate maximum likelihood parameter 187estimates $\hat{\theta}$ that maximize the log likelihood: 188 189\[ 190 \hat{\theta} = \argmax_{\theta} \sum_{i=1}^{n} \log P(x_i \mid \theta) 191\] 192 193These are also \emph{maximum a posteriori} parameter estimates, if we 194assume a uniform prior $P(\theta)$. 195 196Specifically, for samples $x_i$ drawn from an extreme value 197distribution, the log likelihood to optimize is: 198 199\begin{equation} 200\log L(\lambda, \mu) = n \log \lambda - \sum_{i=1}^{n} \lambda(x_i - 201\mu) - \sum_{i=1}^{n} e^{-\lambda(x_i - \mu)} 202\label{eqn:gumbel_logL} 203\end{equation} 204 205This objective function is differentiable with respect to $\mu$ and 206$\lambda$: 207 208\begin{eqnarray} 209\frac{\partial \log L}{\partial \mu} & = & 210n \lambda - \lambda \sum_{i=1}^{n} e^{-\lambda (x_i - \mu)}\\% 211\\% 212\label{eqn:mupartial} 213\frac{\partial \log L}{\partial \lambda} & = & 214\frac{n}{\lambda} - \sum_{i=1}^{n} (x_i - \mu) + 215\sum_{i=1}^{n} (x_i - \mu) e^{-\lambda (x_i - \mu)} 216\label{eqn:lambdapartial} 217\end{eqnarray} 218 219The maximum likelihood estimates $\hat{\lambda}$ and $\hat{\mu}$ are 220the solutions to $\frac{\partial \log L}{\partial \mu} = 0$ and 221$\frac{\partial \log L}{\partial \lambda} = 0$. Lawless 222\citep{Lawless82} gives a useful trick here that lets us solve both of 223these simultaneously. When (\ref{eqn:mupartial}) is set to zero, it 224can be used to get $\hat{\mu}$ in terms of $\hat{\lambda}$: 225 226\begin{eqnarray} 227e^{-\hat{\lambda} \hat{\mu}} & = & \frac{1}{n} \sum_{i=1}^{n} e^{-\hat{\lambda} x_i} 228\label{eqn:substitute}\\ 229\hat{\mu} & = & - \frac{1}{\hat{\lambda}} 230 \log \left[ \frac{1}{n} \sum_{i=1}^{n} e^{-\hat{\lambda} x_i} \right] 231\label{eqn:solvemu} 232\end{eqnarray} 233 234Substituting (\ref{eqn:substitute}) into (\ref{eqn:lambdapartial}), 235gives us an equation for solving $\hat{\lambda}$ in terms of the 236$x_i$'s: 237 238\begin{eqnarray} 239\frac{1}{\hat{\lambda}} - \frac{1}{n} \sum_{i=1}^{n} x_i + 240\frac{\sum_{i=1}^{n} x_i e^{-\hat{\lambda} x_i}} 241 {\sum_{i=1}^{n} e^{-\hat{\lambda} x_i}} 242& = & 0 243\label{eqn:newtontarget} 244\end{eqnarray} 245 246This is our target function. We could solve it readily enough (by 247bisection search, for example) and obtain $\hat{\lambda}$. We can 248solve it even faster using the Newton/Raphson algorithm, because it is 249differentiable with respect to lambda: 250 251\begin{equation} 252\frac{d}{d\hat{\lambda}} = 253\frac{\left( \sum_{i=1}^{n} x_i e^{-\hat{\lambda} x_i} \right)^2 } 254 {\left( \sum_{i=1}^{n} e^{-\hat{\lambda} x_i} \right)^2 } 255- 256\frac{\sum_{i=1}^{n} x_i^2 e^{-\hat{\lambda} x_i}} 257 {\sum_{i=1}^{n} e^{-\hat{\lambda} x_i}} 258- 259\frac{1}{\hat{\lambda}^2} 260\label{eqn:newtonderivative} 261\end{equation} 262 263Now, the key equations are (\ref{eqn:solvemu}), 264(\ref{eqn:newtontarget}), and (\ref{eqn:newtonderivative}). In 265summary, the inference procedure is the following: 266 267\begin{itemize} 268\item Guess an initial $\hat{\lambda}$ (using the mean/variance 269 method, for example, but any reasonable guess works). 270\item Use Newton/Raphson iterations to find the $\hat{\lambda}$ that satisfies 271 (\ref{eqn:newtontarget}): 272 \begin{itemize} 273 \item calculate the target function $f$ and 274 its first derivative $f'$ at $\hat{\lambda}$, using 275 (\ref{eqn:newtontarget}) to calculate $f$ and 276 (\ref{eqn:newtonderivative}) to calculate $f'$. 277 \item If $f$ is within some absolute tolerance of zero 278 (e.g., $10^{-6}$), stop; we have found $\hat{\lambda}$. 279 \item Else, estimate a new $\hat{\lambda} = \hat{\lambda} - \frac{f}{f'}$, 280 and do another iteration. 281 \end{itemize} 282\item Plug $\hat{\lambda}$ into (\ref{eqn:solvemu}) to get $\hat{\mu}$. 283\end{itemize} 284 285This algorithm is implemented in \ccode{esl\_evd\_FitComplete()}. An 286auxiliary function, \ccode{lawless416()}, calculates the target 287function and its derivative (equations (\ref{eqn:newtontarget}) and 288(\ref{eqn:newtonderivative})) given the current estimate of 289$\hat{\lambda}$. The name comes from Lawless' equation 4.1.6, the 290target function \citep{Lawless82}. 291 292The accuracy of fitting to simulated data (generated with $\mu=-20$ 293and $\lambda=0.4$), collated over 500 simulations, is shown in the 294following table: 295 296\begin{center} 297\begin{tabular}{lrrrr} \hline 298 & \multicolumn{4}{c}{\# of samples}\\ 299 & 100 & 1000 & 10,000 & 100,000 \\ 300\% error in $\hat{\mu}$ & 1\%& 0.3\% & 0.1\% & 0.03\% \\ 301max error in $\hat{\mu}$ & 4\%& 2\% & 0.5\% & 0.1\% \\ 302\% error in $\hat{\lambda}$ & 6\%& 2\% & 0.6\% & 0.2\% \\ 303max error in $\hat{\lambda}$ & 36\%& 9\% & 2\% & 0.8\% \\ \hline 304\end{tabular} 305\end{center} 306 307This is in accord with theoretical expectation. The distribution of 308$\frac{\lambda}{\hat{\lambda}}$ is approximately normal with mean 1 and 309standard error $\frac{0.78}{\sqrt{N}}$ \citep{Lawless82,Altschul01}. 310 311% Altschul says \frac{\hat{\lambda}}{\lambda}, actually, but I believe 312% that's wrong. xref J1/46. 313 314 315\subsubsection{Maximum likelihood fitting to censored data} 316 317A \emph{censored} data problem is when we have $N$ samples, but we 318only observe the values of a subset of $n$ samples $x_1..x_n$ that are 319greater or equal to some cutoff $\phi$. The remaining $z = N-n$ 320samples are \emph{censored}, and for these we only know that $x < 321\phi$. $x_i..x_n$, $n$, $\phi$, and $z$ are all known in a censored 322data problem. 323 324To estimate maximum likelihood parameters $\hat{\theta}$ for some 325distribution from censored data \citep{Gelman95}, the log likelihood 326to maximize is: 327 328 329\[ 330 \hat{\theta} = \argmax_{\theta} z \log P(x<\phi \mid \theta) 331 + \sum_{i=1}^n \log P(x_i \mid \theta) 332\] 333 334Specifically, when fitting a Gumbel distribution, the log likelihood 335to optimize is: 336 337\begin{equation} 338 \log L(\lambda, \mu) = 339 n \log \lambda 340 - z e^{-\lambda(\phi - \mu)} 341 - \sum_{i=1}^{n} \lambda(x_i - \mu) 342 - \sum_{i=1}^{n} e^{-\lambda(x_i - \mu)} 343\label{eqn:censor_logL} 344\end{equation} 345 346To optimize this, we follow a similar procedure as used for complete 347data \citep{Lawless82}. The log likelihood is differentiable with 348respect to $\lambda$ and $\mu$: 349 350\begin{eqnarray} 351\frac{\partial \log L}{\partial \mu} & = & 352n \lambda 353- z \lambda e^{-\lambda (\phi - \mu)} 354- \lambda \sum_{i=1}^{n} e^{-\lambda (x_i - \mu)} 355\label{eqn:censor_dmu} 356\\% 357\frac{\partial \log L}{\partial \lambda} & = & 358\frac{n}{\lambda} 359+ z (\phi - \mu) e^{-\lambda (\phi - \mu)} 360- \sum_{i=1}^{n} (x_i - \mu) 361+ \sum_{i=1}^{n} (x_i - \mu) e^{-\lambda (x_i - \mu)} 362\label{eqn:censor_dlambda} 363\end{eqnarray} 364 365Setting (\ref{eqn:censor_dmu}) to zero and solving for $\hat{\mu}$ in 366terms of $\hat{\lambda}$ gives: 367 368\begin{equation} 369\hat{\mu} = - \frac{1}{\hat{\lambda}} 370 \log \left[ \frac{1}{n} 371 \left( z e^{-\hat{\lambda} \phi} 372 + \sum_{i=1}^{n} e^{-\hat{\lambda} x_i} \right) 373 \right] 374\label{eqn:censor_solvemu} 375\end{equation} 376 377Substituting (\ref{eqn:censor_solvemu}) into 378(\ref{eqn:censor_dlambda}) gives the target equation: 379 380\begin{equation} 381\frac{1}{\hat{\lambda}} 382- \frac{1}{n} \sum_{i=1}^{n} x_i + 383\frac{z \phi e^{-\hat{\lambda} \phi} + \sum_{i=1}^{n} x_i e^{-\hat{\lambda} x_i}} 384 {z e^{-\hat{\lambda} \phi} + \sum_{i=1}^{n} e^{-\hat{\lambda} x_i}} 385 = 0 386\label{eqn:censor_newtontarget} 387\end{equation} 388 389To use Newton-Raphson root finding (instead of a slower bisection 390search) we also need the first derivative of this target equation with 391respect to $\lambda$: 392 393\begin{equation} 394\frac{d}{d\hat{\lambda}} = 395\frac{\left( 396 z \phi e^{-\hat{\lambda} \phi} 397 + \sum_{i=1}^{n} x_i e^{-\hat{\lambda} x_i} 398 \right)^2 } 399 {\left( 400 z e^{-\hat{\lambda} \phi} 401 + \sum_{i=1}^{n} e^{-\hat{\lambda} x_i} 402 \right)^2 } 403- 404\frac{z \phi^2 e^{-\hat{\lambda} \phi} + \sum_{i=1}^{n} x_i^2 e^{-\hat{\lambda} x_i}} 405 {z e^{-\hat{\lambda} \phi} + \sum_{i=1}^{n} e^{-\hat{\lambda} x_i}} 406- 407\frac{1}{\hat{\lambda}^2} 408\label{eqn:censor_newtonderiv} 409\end{equation} 410 411In summary: given $n$ observed samples $x_1..x_n$ from a total sample 412of $N$ samples, $z = N-n$ of which were censored because they have 413values $< \phi$, we solve for maximum likelihood estimates 414$\hat{\lambda}$ and $\hat{\mu}$ using the same procedure we used for 415complete data, by using equations (\ref{eqn:censor_solvemu}), 416(\ref{eqn:censor_newtontarget}), and (\ref{eqn:censor_newtonderiv}) in 417place of equations (\ref{eqn:solvemu}), (\ref{eqn:newtontarget}), and 418(\ref{eqn:newtonderivative}). Easel implements this procedure in 419\ccode{esl\_evd\_FitCensored()}. The target function 420(\ref{eqn:censor_newtontarget}) and its derivative 421(\ref{eqn:censor_newtonderiv}) are implemented in the auxiliary 422function \ccode{lawless422()} \citep{Lawless82}. 423 424Results on 500 simulated datasets with $\mu = -20, \lambda = 0.4$, 425censored at $\phi = -20$ -- the expected peak of the histogram; that 426is, a censored fit only to the right tail, which contains about 63\% 427of the samples: 428 429\begin{center} 430\begin{tabular}{lrrrr} \hline 431 & \multicolumn{4}{c}{\# samples in EVD histogram}\\ 432 & 100 & 1000 & 10,000 & 100,000 \\ 433\% error in $\mu$ & 1\%& 0.4\% & 0.1\% & 0.04\% \\ 434max error in $\mu$ & 5\%& 2\% & 0.5\% & 0.2\% \\ 435\% error in $\lambda$ & 9\%& 3\% & 0.9\% & 0.3\% \\ 436max error in $\lambda$ & 33\%& 11\% & 3\% & 1\% \\ \hline 437\end{tabular} 438\end{center} 439 440\subsubsection{Maximum likelihood fitting to truncated data} 441 442A \emph{truncated} dataset is when we only observe $n$ samples $x_i$, 443and an \emph{unknown} number $z$ of samples less than some threshold 444$\phi$ were unobserved. Thus, only the right tail of $n$ samples $x_i 445\geq \phi$ as observed. In a truncated dataset, $x_1..x_n$, $n$, and 446$\phi$ are known, but $z$ is unknown. 447 448Solving a truncated data problem motivates a Bayesian approach, 449because we need to integrate out (marginalize) the nuisance $z$ 450parameter, and to do this, we have to specify a prior distribution for 451$P(z)$. Gelman \emph{et al.} describe a general Bayesian framework for 452thinking about various types of missing data problems, including 453censored and truncated data \citep{Gelman95}. 454 455In short, to obtain maximum likelihood parameters $\hat{\theta}$ for 456some distribution, given truncated data, the log likelihood we wish to 457maximize is: 458 459\begin{equation} 460 \hat{\theta} = \argmax_\theta -n \log P(x \geq \phi \mid \theta) 461 + \sum_{i=1}^n \log P(x_i \mid \theta). 462\label{eqn:truncated_objective} 463\end{equation} 464 465\textbf{Detour: derivation of the truncated data likelihood} 466 467The derivation of the above equation may not be immediately obvious. 468The presence of the $n P(x \geq \phi \mid \theta)$ term may be 469counterintuitive, as opposed to the more intuitive $z P(x < \phi \mid 470\theta)$ term that accounts for the missing data in a censored data 471problem. Gelman \emph{et al.} actually don't even show the equation in 472their book; I obtained it from an exercise solution on their web site. 473To convince you (and to remind me) of its correctness, a sketch of the 474derivation follows. 475 476We start with the same likelihood equation that arises in censored 477data for a \emph{known} total number of samples $N$ (where $N=n+z$), 478but since $N$ is unknown, we need to integrate over all possible $N$ 479from $n$ to $\infty$: 480 481\begin{eqnarray*} 482 P(\mathbf{x} \mid \theta, \phi) & = & 483 \sum_{N=n}^{\infty} P(\mathbf{x} \mid \theta, \phi, N) P(N)\\ 484 & = & 485 \prod_{i=1}^n P(x_i \mid \theta) 486 \left[ 487 \sum_{N=n}^\infty {N \choose n} P(x < \phi \mid \theta)^{N-n} P(N) 488 \right]\\ 489\end{eqnarray*} 490 491The $\prod_{i=1}^n P(x_i \mid \theta)$ is straightforward; that sum is 492our problem. The trick is to rearrange it so we can treat it as a 493convergent negative binomial series: 494 495\[ 496 (1-p)^{-a} = 1 + ap + \frac{a(a+1)}{2!} p^2 + 497 \frac{a(a+1)(a+2)}{3!} p^3... 498\] 499 500To get the sum into the form of this series, Gelman \emph{et al.} 501suggest using an informative prior $P(N) = \frac{1}{N}$, an apparently 502unmotivated choice that happens to make the sum collapse nicely: 503 504\begin{eqnarray*} 505 &=& P(N=n) 506 + (n+1) P(x < \phi \mid \theta) P(N=n+1) 507 + \frac{(n+1)(n+2)}{2!} P(x < \phi \mid \theta)^2 P(N=n+2) ...\\ 508 &= & \frac{1}{n} \left[ 509 1 510 + n P(x < \phi \mid \theta) 511 + \frac{n(n+1)}{2!} P(x < \phi \mid \theta)^2 512 + \frac{n(n+1)(n+2)}{3!} P(x < \phi \mid \theta)^3 \right]\\ 513 &=& \frac{1}{n} (1 - P(x < \phi \mid \theta))^{-n}\\ 514 &=& \frac{1}{n} P(x \geq \phi \mid \theta)^{-n}\\ 515\end{eqnarray*} 516 517The $\frac{1}{n}$ is a constant, so we drop it from the likelihood 518equation we'll maximize. Putting this term back together with the 519probability of the observed data and taking the log, we obtain the log 520likelihood in equation (\ref{eqn:truncated_objective}). 521 522Alternatively, we might choose an uninformative improper uniform prior 523$P(N) \propto 1$. This gives a log likelihood that only differs by a 524term of $n+1$ versus $n$: 525 526\begin{equation} 527 \hat{\theta} = \argmax_\theta -(n+1) \log P(x \geq \phi \mid \theta) 528 + \sum_{i=1}^n \log P(x_i \mid \theta). 529\end{equation} 530 531However, empirically, this form is ineffective, at least for fitting 532Gumbels. The $\frac{1}{N}$ prior performs much better, probably 533because it constrains the solutions to favor smaller, finite, more 534reasonable choices of $N$. 535 536 537 538\textbf{Back to fitting a truncated Gumbel} 539 540For the specific case of fitting a truncated Gumbel, the log 541likelihood (\ref{eqn:truncated_objective}) to optimize is: 542 543\[ 544 \log L(\lambda, \mu) = 545 n \log \lambda 546 - \sum_{i=1}^{n} \lambda(x_i - \mu) 547 - \sum_{i=1}^{n} e^{-\lambda(x_i - \mu)} 548 - n \log (1 - \exp(-e^{-\lambda(\phi - \mu)})) 549\label{eqn:truncated_logL} 550\] 551 552This is differentiable with respect to $\lambda$ and $\mu$, but it's 553not going to reduce to the clean root-finding problem that we obtained 554for complete data or censored data. Instead we're going to be left 555with a numerical optimization problem. We can use standard numerical 556optimization code, such as steepest descent or conjugate gradient 557descent. There's just one hitch. These algorithms assume unconstrained 558parameters, but $\lambda$ is constrained to values $>0$. We do a 559change of variables, and use the transformation $\lambda = e^w$ so we 560can optimize the unconstrained parameter $w = \log \lambda$ instead of 561optimizing $\lambda$ directly. The necessary partial derivatives are 562then: 563 564\begin{eqnarray} 565\frac{\partial \log L}{\partial \mu} & = & 566n \lambda 567- \lambda \sum_{i=1}^{n} e^{-\lambda (x_i - \mu)} 568- \frac{n \lambda \exp \left[ -\lambda (\phi - \mu) - e^{- \lambda (\phi - \mu)} \right]} 569 {1 - \exp(-e^{-\lambda(\phi - \mu)})} 570\label{eqn:truncated_dmu} 571\\% 572\frac{\partial \log L}{\partial w} & = & 573n 574- \sum_{i=1}^{n} \lambda(x_i - \mu) 575+ \sum_{i=1}^{n} \lambda(x_i - \mu) e^{-\lambda (x_i - \mu)} 576+ \frac{n\lambda (\phi-\mu) \exp \left[ -\lambda (\phi - \mu) - e^{- \lambda (\phi - \mu)} \right]} 577 {1 - \exp(-e^{-\lambda(\phi - \mu)})} 578\label{eqn:truncated_dw} 579\end{eqnarray} 580 581This optimization is carried out by \ccode{esl\_evd\_FitTruncated()}. 582The likelihood (\ref{eqn:truncated_logL}) is implemented in 583\ccode{tevd\_func()}, and the derivatives (\ref{eqn:truncated_dmu}) and 584(\ref{eqn:truncated_dw}) are implemented in \ccode{tevd\_dfunc()}. 585\ccode{esl\_evd\_FitTruncated()} simply sets up the problem and passes 586it all off to a conjugate gradient descent optimizer. 587 588Results on 500 simulated datasets with $\mu = -20, \lambda = 0.4$, 589truncated at $\phi = -20$ (leaving the right tail, containing about 59063\% of the samples): 591 592\begin{center} 593\begin{tabular}{lrrrr} \hline 594 & \multicolumn{4}{c}{\# samples}\\ 595 & 100 & 1000 & 10,000 & 100,000 \\ 596\% error in $\hat{\mu}$ & 13\%& 2\% & 0.8\% & 0.3\% \\ 597max error in $\hat{\mu}$ &260\%& 42\% & 3\% & 1\% \\ 598\% error in $\hat{\lambda}$ & 15\%& 5\% & 2\% & 0.6\% \\ 599max error in $\hat{\lambda}$ & 68\%& 18\% & 6\% & 2\% \\ \hline 600\end{tabular} 601\end{center} 602 603Fitting truncated Gumbel distributions is difficult, requiring much 604more data than fitting complete or censored data. The problem is that 605the right tail becomes a scale-free exponential when $\phi >> \mu$, 606and $\mu$ becomes undetermined. Fits become very inaccurate as $\phi$ 607gets larger than $\mu$, and for sufficiently large $\phi$, the 608numerical optimizer will completely fail. 609 610 611 612 613 614 615 616