1
2Gumbel statistics are often used to estimate the statistical
3significance of local alignment scores.
4
5The Gumbel distribution is the so-called Type I extreme value
6distribution (EVD). It occurs so frequently in sequence analysis
7applications, compared to the type II (Fr\'{e}chet) and type III
8(Weibull) extreme value distributions, that ``Gumbel'' and ``EVD'' are
9often used interchangeably in bioinformatics. Easel has a separate
10module, the \eslmod{gev} module, that implements the generalized
11extreme value distribution.
12
13Karlin/Altschul statistics are a special case of the Gumbel
14distribution that apply to the scores of ungapped local alignments
15between infinitely long random sequences. Empirically, Karlin/Altschul
16statistics also apply reasonably well to the more useful case of
17gapped alignment of finite-length sequences. Karlin/Altschul
18statistics predict how the Gumbel's two parameters depend on the
19length of the query and target sequences. In the case of ungapped
20alignments, Karlin/Altschul statistics allow the Gumbel parameters to
21be estimated directly, without the need for a compute-intensive
22simulation.
23
24\subsection{The gumbel API}
25
26The \eslmod{gumbel} API consists of the following functions:
27
28\vspace{0.5em}
29\begin{center}
30\begin{tabular}{ll}\hline
31    \multicolumn{2}{c}{\textbf{evaluating densities and distributions:}}\\
32\ccode{esl\_gumbel\_pdf()}     & Returns the probability density, $P(S=x)$.\\
33\ccode{esl\_gumbel\_logpdf()}  & Returns the log of the pdf, $\log P(S=x)$.\\
34\ccode{esl\_gumbel\_cdf()}     & Returns the cumulative probability distribution, $P(S \leq x)$.\\
35\ccode{esl\_gumbel\_logcdf()}  & Returns the log of the cdf, $\log P(S \leq x)$.\\
36\ccode{esl\_gumbel\_surv()}    & Returns right tail mass, 1-cdf, $P(S > x)$\\
37\ccode{esl\_gumbel\_logsurv()} & Returns log of 1-cdf, $\log P(S > x)$.\\
38    \multicolumn{2}{c}{\textbf{sampling:}}\\
39\ccode{esl\_gumbel\_Sample()}  & Returns a Gumbel-distributed random sample.\\
40    \multicolumn{2}{c}{\textbf{maximum a posteriori parameter fitting:}}\\
41\ccode{esl\_gumbel\_FitComplete()} & Estimates $\mu,\lambda$ from complete data.\\
42\ccode{esl\_gumbel\_FitCompleteLoc()} & Estimates $\mu$ when $\lambda$ is known.\\
43\ccode{esl\_gumbel\_FitCensored()} & Estimates $\mu,\lambda$ from censored data.\\
44\ccode{esl\_gumbel\_FitCensoredLoc()} & Estimates $\mu$ when $\lambda$ is known.\\
45\ccode{esl\_gumbel\_FitTruncated()}& Estimates $\mu,\lambda$ from truncated data.\\\hline
46\end{tabular}
47\end{center}
48\vspace{0.5em}
49
50The Gumbel distribution depends on two parameters, $\mu$ and
51$\lambda$. When $\mu$ and $\lambda$ are known, the statistical
52significance (P-value) of a single score $x$ is $P(S>x)$, obtained by
53a call to \ccode{esl\_gumbel\_surv()}.  The E-value for obtaining that
54score or better in searching a database of $N$ sequences is just
55$NP(S>x)$.
56
57When $\mu$ and $\lambda$ are unknown, they are estimated from scores
58obtained from comparisons of simulated random data. (Analytical
59solutions for $\mu$ and $\lambda$ are only available in the case of
60ungapped sequence alignments.)  The \ccode{esl\_evd\_Fit*()} functions
61provide maximum likelihood parameter fitting routines for different
62types of data.
63
64\subsection{Example of using the gumbel API}
65
66An example that samples 10,000 data points from a Gumbel distribution
67with $\mu=-20$, $\lambda=0.4$; reports the min and max samples, and
68the probability mass to the left of the min and to the right of the
69max (both of which should be about $\frac{1}{10000}$, since we took
7010,000 samples); and then fits those simulated data to a Gumbel and
71reports the fitted $\mu$ and $\lambda$:
72
73\input{cexcerpts/gumbel_example}
74
75
76
77\subsection{Gumbel densities}
78
79The probability density function (pdf) and the cumulative distribution
80function (cdf) of the extreme value distribution are:
81
82\begin{equation}
83P(x) = \lambda \exp \left[ -\lambda (x - \mu) - e^{- \lambda (x - \mu)} \right]
84\label{eqn:gumbel_density}
85\end{equation}
86
87\begin{equation}
88P(S < x) = \exp \left[ -e^{-\lambda(x - \mu)} \right]
89\label{eqn:gumbel_distribution}
90\end{equation}
91
92The extreme value density and distribution functions for $\mu = 0$ and
93$\lambda = 1.0$ are shown below.
94
95\begin{center}
96\includegraphics[width=3in]{figures/evd_basic}
97\end{center}
98
99The $\mu$ and $\lambda$ parameters are {\em location} and {\em scale}
100parameters, respectively:
101
102\centerline{
103\begin{minipage}{3in}
104\includegraphics[width=2.8in]{figures/evd_location}
105\end{minipage}
106\begin{minipage}{3in}
107\includegraphics[width=2.8in]{figures/evd_scale}
108\end{minipage}
109}
110
111For more details, a classic reference is \citep{Lawless82}.  Gumbel
112distributions can have their long tail to the right or to the
113left. The form given here is for the long tail to the right.  This is
114the form that arises when the extreme value is a maximum, such as when
115our score is the maximum over the individual scores of all possible
116alignments. The equations in \citep{Lawless82} are for extremal
117minima; use $(x - u) = -(x - \mu)$ and $b = 1 / \lambda$ to convert
118Lawless' notation to the notation used here.
119
120
121\subsection{Fitting Gumbel distributions to observed data}
122
123Given a set of $n$ observed samples $\mathbf{x}$, we may want to
124estimate the $\mu$ and $\lambda$ parameters.
125
126One might try to use linear regression to fit to a $\log \log$
127transformation of the $P(S < x)$ histogram, which gives a straight
128line with slope $-\lambda$ and $x$ intercept $\mu$:
129
130\begin{equation}
131\log \left[ -\log P(S<x) \right] = -\lambda x + \lambda \mu
132\end{equation}
133
134However, the linear regression method is undesirable because it is
135sensitive to outliers. The following table shows the \% error for
136estimating $\hat{\mu}$ and $\hat{\lambda}$ from 500 simulated complete
137datasets, sampled from a Gumbel with $\mu = -20.0$ and $\lambda =
1380.4$, for four different dataset sizes:
139
140\begin{center}
141\begin{tabular}{lrrrr} \hline
142                              & \multicolumn{4}{c}{\# of samples}\\
143                              & 100 & 1000  & 10,000 & 100,000 \\
144\% error in $\hat{\mu}$       &  2\%&   1\% & 0.9\%  &  0.9\%  \\
145max error in $\hat{\mu}$      & 24\%&  13\% &  10\%  &   10\%  \\
146\% error in $\hat{\lambda}$   & 12\%&   7\% &   5\%  &    3\%  \\
147max error in $\hat{\lambda}$  & 49\%&  33\% &  25\%  &   20\%  \\ \hline
148\end{tabular}
149\end{center}
150
151
152A better rough estimate of $\hat{\mu}$ and $\hat{\lambda}$ can be
153obtained from the sample mean $m$ and variance $s^2$ of the observed
154data \citep{Evans00}:\footnote{All simulation data are generated by
155the \eslmod{evd} module's stats driver. The only exception is the
156linear regression fit data, which come from an old version of HMMER.}
157
158\begin{eqnarray*}
159  \hat{\lambda} & = & \frac{\pi}{\sqrt{6s^2}}\\
160  \hat{\mu}     & = & m - \frac{0.57722}{\hat{\lambda}}
161\end{eqnarray*}
162
163The mean/variance method is more accurate than linear regression, as
164shown by the following simulation results:
165
166\begin{center}
167\begin{tabular}{lrrrr} \hline
168                              & \multicolumn{4}{c}{\# of samples}\\
169                              & 100 & 1000  & 10,000 & 100,000 \\
170\% error in $\hat{\mu}$       &  1\%& 0.3\% &  0.1\% & 0.03\%  \\
171max error in $\hat{\mu}$      &  5\%&   1\% &  0.4\% &  0.1\%  \\
172\% error in $\hat{\lambda}$   &  9\%&   3\% &  0.8\% &  0.3\%  \\
173max error in $\hat{\lambda}$  & 40\%&  12\% &    3\% &  0.9\%  \\ \hline
174\end{tabular}
175\end{center}
176
177Still, the mean/variance method is not as accurate as a maximum
178likelihood estimation (especially for $\lambda$). Also, it requires
179complete data, whereas we also need to solve problems where we fit to
180\emph{truncated} or \emph{censored} data. Easel's main estimation
181methods are therefore maximum likelihood methods.
182
183\subsubsection{Maximum likelihood estimation, complete data}
184
185Given $n$ samples $x_1..x_n$ from some distribution that depends on
186parameters $\theta$, we want to estimate maximum likelihood parameter
187estimates $\hat{\theta}$ that maximize the log likelihood:
188
189\[
190   \hat{\theta} = \argmax_{\theta} \sum_{i=1}^{n} \log P(x_i \mid \theta)
191\]
192
193These are also \emph{maximum a posteriori} parameter estimates, if we
194assume a uniform prior $P(\theta)$.
195
196Specifically, for samples $x_i$ drawn from an extreme value
197distribution, the log likelihood to optimize is:
198
199\begin{equation}
200\log L(\lambda, \mu) = n \log \lambda - \sum_{i=1}^{n} \lambda(x_i -
201\mu) - \sum_{i=1}^{n} e^{-\lambda(x_i - \mu)}
202\label{eqn:gumbel_logL}
203\end{equation}
204
205This objective function is differentiable with respect to $\mu$ and
206$\lambda$:
207
208\begin{eqnarray}
209\frac{\partial \log L}{\partial \mu} & = &
210n \lambda - \lambda \sum_{i=1}^{n} e^{-\lambda (x_i - \mu)}\\%
211\\%
212\label{eqn:mupartial}
213\frac{\partial \log L}{\partial \lambda} & = &
214\frac{n}{\lambda} - \sum_{i=1}^{n} (x_i - \mu) +
215\sum_{i=1}^{n} (x_i - \mu) e^{-\lambda (x_i - \mu)}
216\label{eqn:lambdapartial}
217\end{eqnarray}
218
219The maximum likelihood estimates $\hat{\lambda}$ and $\hat{\mu}$ are
220the solutions to $\frac{\partial \log L}{\partial \mu} = 0$ and
221$\frac{\partial \log L}{\partial \lambda} = 0$. Lawless
222\citep{Lawless82} gives a useful trick here that lets us solve both of
223these simultaneously. When (\ref{eqn:mupartial}) is set to zero, it
224can be used to get $\hat{\mu}$ in terms of $\hat{\lambda}$:
225
226\begin{eqnarray}
227e^{-\hat{\lambda} \hat{\mu}} & = & \frac{1}{n} \sum_{i=1}^{n} e^{-\hat{\lambda} x_i}
228\label{eqn:substitute}\\
229\hat{\mu} & = & - \frac{1}{\hat{\lambda}}
230	\log \left[ \frac{1}{n} \sum_{i=1}^{n} e^{-\hat{\lambda} x_i} \right]
231\label{eqn:solvemu}
232\end{eqnarray}
233
234Substituting (\ref{eqn:substitute}) into (\ref{eqn:lambdapartial}),
235gives us an equation for solving $\hat{\lambda}$ in terms of the
236$x_i$'s:
237
238\begin{eqnarray}
239\frac{1}{\hat{\lambda}} - \frac{1}{n} \sum_{i=1}^{n} x_i +
240\frac{\sum_{i=1}^{n} x_i e^{-\hat{\lambda} x_i}}
241     {\sum_{i=1}^{n} e^{-\hat{\lambda} x_i}}
242& = & 0
243\label{eqn:newtontarget}
244\end{eqnarray}
245
246This is our target function. We could solve it readily enough (by
247bisection search, for example) and obtain $\hat{\lambda}$. We can
248solve it even faster using the Newton/Raphson algorithm, because it is
249differentiable with respect to lambda:
250
251\begin{equation}
252\frac{d}{d\hat{\lambda}} =
253\frac{\left( \sum_{i=1}^{n} x_i e^{-\hat{\lambda} x_i} \right)^2 }
254     {\left( \sum_{i=1}^{n} e^{-\hat{\lambda} x_i}     \right)^2 }
255-
256\frac{\sum_{i=1}^{n} x_i^2 e^{-\hat{\lambda} x_i}}
257     {\sum_{i=1}^{n} e^{-\hat{\lambda} x_i}}
258-
259\frac{1}{\hat{\lambda}^2}
260\label{eqn:newtonderivative}
261\end{equation}
262
263Now, the key equations are (\ref{eqn:solvemu}),
264(\ref{eqn:newtontarget}), and (\ref{eqn:newtonderivative}). In
265summary, the inference procedure is the following:
266
267\begin{itemize}
268\item Guess an initial $\hat{\lambda}$ (using the mean/variance
269  method, for example, but any reasonable guess works).
270\item Use Newton/Raphson iterations to find the $\hat{\lambda}$ that satisfies
271      (\ref{eqn:newtontarget}):
272	\begin{itemize}
273	\item calculate the target function $f$ and
274         its first derivative $f'$ at $\hat{\lambda}$, using
275	(\ref{eqn:newtontarget}) to calculate $f$ and
276	(\ref{eqn:newtonderivative}) to calculate $f'$.
277	\item If $f$ is within some absolute tolerance of zero
278	(e.g., $10^{-6}$), stop; we have found $\hat{\lambda}$.
279	\item Else, estimate a new $\hat{\lambda} = \hat{\lambda} - \frac{f}{f'}$,
280	  and do another iteration.
281	\end{itemize}
282\item Plug $\hat{\lambda}$ into (\ref{eqn:solvemu}) to get $\hat{\mu}$.
283\end{itemize}
284
285This algorithm is implemented in \ccode{esl\_evd\_FitComplete()}.  An
286auxiliary function, \ccode{lawless416()}, calculates the target
287function and its derivative (equations (\ref{eqn:newtontarget}) and
288(\ref{eqn:newtonderivative})) given the current estimate of
289$\hat{\lambda}$.  The name comes from Lawless' equation 4.1.6, the
290target function \citep{Lawless82}.
291
292The accuracy of fitting to simulated data (generated with $\mu=-20$
293and $\lambda=0.4$), collated over 500 simulations, is shown in the
294following table:
295
296\begin{center}
297\begin{tabular}{lrrrr} \hline
298                              & \multicolumn{4}{c}{\# of samples}\\
299                              & 100 & 1000  & 10,000 & 100,000 \\
300\% error in $\hat{\mu}$       &  1\%& 0.3\% &  0.1\% & 0.03\%  \\
301max error in $\hat{\mu}$      &  4\%&   2\% &  0.5\% &  0.1\%  \\
302\% error in $\hat{\lambda}$   &  6\%&   2\% &  0.6\% &  0.2\%  \\
303max error in $\hat{\lambda}$  & 36\%&   9\% &    2\% &  0.8\%  \\ \hline
304\end{tabular}
305\end{center}
306
307This is in accord with theoretical expectation. The distribution of
308$\frac{\lambda}{\hat{\lambda}}$ is approximately normal with mean 1 and
309standard error $\frac{0.78}{\sqrt{N}}$ \citep{Lawless82,Altschul01}.
310
311% Altschul says \frac{\hat{\lambda}}{\lambda}, actually, but I believe
312% that's wrong. xref J1/46.
313
314
315\subsubsection{Maximum likelihood fitting to censored data}
316
317A \emph{censored} data problem is when we have $N$ samples, but we
318only observe the values of a subset of $n$ samples $x_1..x_n$ that are
319greater or equal to some cutoff $\phi$. The remaining $z = N-n$
320samples are \emph{censored}, and for these we only know that $x <
321\phi$.  $x_i..x_n$, $n$, $\phi$, and $z$ are all known in a censored
322data problem.
323
324To estimate maximum likelihood parameters $\hat{\theta}$ for some
325distribution from censored data \citep{Gelman95}, the log likelihood
326to maximize is:
327
328
329\[
330  \hat{\theta} = \argmax_{\theta} z \log P(x<\phi \mid \theta)
331                         + \sum_{i=1}^n \log P(x_i \mid \theta)
332\]
333
334Specifically, when fitting a Gumbel distribution, the log likelihood
335to optimize is:
336
337\begin{equation}
338  \log L(\lambda, \mu) =
339    n \log \lambda
340     - z e^{-\lambda(\phi - \mu)}
341     - \sum_{i=1}^{n} \lambda(x_i - \mu)
342     - \sum_{i=1}^{n} e^{-\lambda(x_i - \mu)}
343\label{eqn:censor_logL}
344\end{equation}
345
346To optimize this, we follow a similar procedure as used for complete
347data \citep{Lawless82}. The log likelihood is differentiable with
348respect to $\lambda$ and $\mu$:
349
350\begin{eqnarray}
351\frac{\partial \log L}{\partial \mu} & = &
352n \lambda
353- z \lambda e^{-\lambda (\phi - \mu)}
354- \lambda \sum_{i=1}^{n} e^{-\lambda (x_i - \mu)}
355\label{eqn:censor_dmu}
356\\%
357\frac{\partial \log L}{\partial \lambda} & = &
358\frac{n}{\lambda}
359+ z (\phi - \mu) e^{-\lambda (\phi - \mu)}
360- \sum_{i=1}^{n} (x_i - \mu)
361+ \sum_{i=1}^{n} (x_i - \mu) e^{-\lambda (x_i - \mu)}
362\label{eqn:censor_dlambda}
363\end{eqnarray}
364
365Setting (\ref{eqn:censor_dmu}) to zero and solving for $\hat{\mu}$ in
366terms of $\hat{\lambda}$ gives:
367
368\begin{equation}
369\hat{\mu}  =  - \frac{1}{\hat{\lambda}}
370	\log \left[ \frac{1}{n}
371	\left( z e^{-\hat{\lambda} \phi}
372               + \sum_{i=1}^{n} e^{-\hat{\lambda} x_i} \right)
373	\right]
374\label{eqn:censor_solvemu}
375\end{equation}
376
377Substituting (\ref{eqn:censor_solvemu}) into
378(\ref{eqn:censor_dlambda}) gives the target equation:
379
380\begin{equation}
381\frac{1}{\hat{\lambda}}
382- \frac{1}{n} \sum_{i=1}^{n} x_i +
383\frac{z \phi e^{-\hat{\lambda} \phi} + \sum_{i=1}^{n} x_i e^{-\hat{\lambda} x_i}}
384     {z e^{-\hat{\lambda} \phi} + \sum_{i=1}^{n} e^{-\hat{\lambda} x_i}}
385 =  0
386\label{eqn:censor_newtontarget}
387\end{equation}
388
389To use Newton-Raphson root finding (instead of a slower bisection
390search) we also need the first derivative of this target equation with
391respect to $\lambda$:
392
393\begin{equation}
394\frac{d}{d\hat{\lambda}} =
395\frac{\left(
396        z \phi e^{-\hat{\lambda} \phi}
397        + \sum_{i=1}^{n} x_i e^{-\hat{\lambda} x_i}
398       \right)^2 }
399     {\left(
400        z e^{-\hat{\lambda} \phi}
401        + \sum_{i=1}^{n} e^{-\hat{\lambda} x_i}
402       \right)^2 }
403-
404\frac{z \phi^2 e^{-\hat{\lambda} \phi} + \sum_{i=1}^{n} x_i^2 e^{-\hat{\lambda} x_i}}
405     {z  e^{-\hat{\lambda} \phi} + \sum_{i=1}^{n} e^{-\hat{\lambda} x_i}}
406-
407\frac{1}{\hat{\lambda}^2}
408\label{eqn:censor_newtonderiv}
409\end{equation}
410
411In summary: given $n$ observed samples $x_1..x_n$ from a total sample
412of $N$ samples, $z = N-n$ of which were censored because they have
413values $< \phi$, we solve for maximum likelihood estimates
414$\hat{\lambda}$ and $\hat{\mu}$ using the same procedure we used for
415complete data, by using equations (\ref{eqn:censor_solvemu}),
416(\ref{eqn:censor_newtontarget}), and (\ref{eqn:censor_newtonderiv}) in
417place of equations (\ref{eqn:solvemu}), (\ref{eqn:newtontarget}), and
418(\ref{eqn:newtonderivative}). Easel implements this procedure in
419\ccode{esl\_evd\_FitCensored()}.  The target function
420(\ref{eqn:censor_newtontarget}) and its derivative
421(\ref{eqn:censor_newtonderiv}) are implemented in the auxiliary
422function \ccode{lawless422()} \citep{Lawless82}.
423
424Results on 500 simulated datasets with $\mu = -20, \lambda = 0.4$,
425censored at $\phi = -20$ -- the expected peak of the histogram; that
426is, a censored fit only to the right tail, which contains about 63\%
427of the samples:
428
429\begin{center}
430\begin{tabular}{lrrrr} \hline
431 & \multicolumn{4}{c}{\# samples in EVD histogram}\\
432                        & 100 & 1000  & 10,000 & 100,000 \\
433\% error in $\mu$       &  1\%& 0.4\% &  0.1\% &  0.04\%  \\
434max error in $\mu$      &  5\%&   2\% &  0.5\% &  0.2\%  \\
435\% error in $\lambda$   &  9\%&   3\% &  0.9\% &  0.3\%  \\
436max error in $\lambda$  & 33\%&  11\% &    3\% &    1\%  \\ \hline
437\end{tabular}
438\end{center}
439
440\subsubsection{Maximum likelihood fitting to truncated data}
441
442A \emph{truncated} dataset is when we only observe $n$ samples $x_i$,
443and an \emph{unknown} number $z$ of samples less than some threshold
444$\phi$ were unobserved. Thus, only the right tail of $n$ samples $x_i
445\geq \phi$ as observed. In a truncated dataset, $x_1..x_n$, $n$, and
446$\phi$ are known, but $z$ is unknown.
447
448Solving a truncated data problem motivates a Bayesian approach,
449because we need to integrate out (marginalize) the nuisance $z$
450parameter, and to do this, we have to specify a prior distribution for
451$P(z)$. Gelman \emph{et al.} describe a general Bayesian framework for
452thinking about various types of missing data problems, including
453censored and truncated data \citep{Gelman95}.
454
455In short, to obtain maximum likelihood parameters $\hat{\theta}$ for
456some distribution, given truncated data, the log likelihood we wish to
457maximize is:
458
459\begin{equation}
460  \hat{\theta} = \argmax_\theta -n \log P(x \geq \phi \mid \theta)
461                   + \sum_{i=1}^n \log P(x_i \mid \theta).
462\label{eqn:truncated_objective}
463\end{equation}
464
465\textbf{Detour: derivation of the truncated data likelihood}
466
467The derivation of the above equation may not be immediately obvious.
468The presence of the $n P(x \geq \phi \mid \theta)$ term may be
469counterintuitive, as opposed to the more intuitive $z P(x < \phi \mid
470\theta)$ term that accounts for the missing data in a censored data
471problem. Gelman \emph{et al.} actually don't even show the equation in
472their book; I obtained it from an exercise solution on their web site.
473To convince you (and to remind me) of its correctness, a sketch of the
474derivation follows.
475
476We start with the same likelihood equation that arises in censored
477data for a \emph{known} total number of samples $N$ (where $N=n+z$),
478but since $N$ is unknown, we need to integrate over all possible $N$
479from $n$ to $\infty$:
480
481\begin{eqnarray*}
482   P(\mathbf{x} \mid \theta, \phi) & = &
483    \sum_{N=n}^{\infty}   P(\mathbf{x} \mid \theta, \phi, N) P(N)\\
484   & = &
485    \prod_{i=1}^n P(x_i \mid \theta)
486    \left[
487      \sum_{N=n}^\infty {N \choose n} P(x < \phi \mid \theta)^{N-n} P(N)
488    \right]\\
489\end{eqnarray*}
490
491The $\prod_{i=1}^n P(x_i \mid \theta)$ is straightforward; that sum is
492our problem. The trick is to rearrange it so we can treat it as a
493convergent negative binomial series:
494
495\[
496   (1-p)^{-a} = 1 + ap + \frac{a(a+1)}{2!} p^2 +
497   \frac{a(a+1)(a+2)}{3!} p^3...
498\]
499
500To get the sum into the form of this series, Gelman \emph{et al.}
501suggest using an informative prior $P(N) = \frac{1}{N}$, an apparently
502unmotivated choice that happens to make the sum collapse nicely:
503
504\begin{eqnarray*}
505 &=& P(N=n)
506    + (n+1) P(x < \phi \mid \theta) P(N=n+1)
507    + \frac{(n+1)(n+2)}{2!} P(x < \phi \mid \theta)^2 P(N=n+2) ...\\
508 &= & \frac{1}{n} \left[
509      1
510      + n P(x < \phi \mid \theta)
511      + \frac{n(n+1)}{2!} P(x < \phi \mid \theta)^2
512      + \frac{n(n+1)(n+2)}{3!} P(x < \phi \mid \theta)^3 \right]\\
513 &=& \frac{1}{n} (1 - P(x < \phi \mid \theta))^{-n}\\
514 &=& \frac{1}{n} P(x \geq \phi \mid \theta)^{-n}\\
515\end{eqnarray*}
516
517The $\frac{1}{n}$ is a constant, so we drop it from the likelihood
518equation we'll maximize. Putting this term back together with the
519probability of the observed data and taking the log, we obtain the log
520likelihood in equation (\ref{eqn:truncated_objective}).
521
522Alternatively, we might choose an uninformative improper uniform prior
523$P(N) \propto 1$. This gives a log likelihood that only differs by a
524term of $n+1$ versus $n$:
525
526\begin{equation}
527  \hat{\theta} = \argmax_\theta -(n+1) \log P(x \geq \phi \mid \theta)
528                   + \sum_{i=1}^n \log P(x_i \mid \theta).
529\end{equation}
530
531However, empirically, this form is ineffective, at least for fitting
532Gumbels. The $\frac{1}{N}$ prior performs much better, probably
533because it constrains the solutions to favor smaller, finite, more
534reasonable choices of $N$.
535
536
537
538\textbf{Back to fitting a truncated Gumbel}
539
540For the specific case of fitting a truncated Gumbel, the log
541likelihood (\ref{eqn:truncated_objective}) to optimize is:
542
543\[
544  \log L(\lambda, \mu) =
545     n \log \lambda
546     - \sum_{i=1}^{n} \lambda(x_i - \mu)
547     - \sum_{i=1}^{n} e^{-\lambda(x_i - \mu)}
548     - n \log (1 - \exp(-e^{-\lambda(\phi - \mu)}))
549\label{eqn:truncated_logL}
550\]
551
552This is differentiable with respect to $\lambda$ and $\mu$, but it's
553not going to reduce to the clean root-finding problem that we obtained
554for complete data or censored data. Instead we're going to be left
555with a numerical optimization problem. We can use standard numerical
556optimization code, such as steepest descent or conjugate gradient
557descent. There's just one hitch. These algorithms assume unconstrained
558parameters, but $\lambda$ is constrained to values $>0$. We do a
559change of variables, and use the transformation $\lambda = e^w$ so we
560can optimize the unconstrained parameter $w = \log \lambda$ instead of
561optimizing $\lambda$ directly.  The necessary partial derivatives are
562then:
563
564\begin{eqnarray}
565\frac{\partial \log L}{\partial \mu} & = &
566n \lambda
567- \lambda \sum_{i=1}^{n} e^{-\lambda (x_i - \mu)}
568- \frac{n \lambda \exp \left[ -\lambda (\phi - \mu) - e^{- \lambda (\phi - \mu)} \right]}
569       {1 - \exp(-e^{-\lambda(\phi - \mu)})}
570\label{eqn:truncated_dmu}
571\\%
572\frac{\partial \log L}{\partial w} & = &
573n
574- \sum_{i=1}^{n} \lambda(x_i - \mu)
575+ \sum_{i=1}^{n} \lambda(x_i - \mu) e^{-\lambda (x_i - \mu)}
576+ \frac{n\lambda (\phi-\mu) \exp \left[ -\lambda (\phi - \mu) - e^{- \lambda (\phi - \mu)} \right]}
577       {1 - \exp(-e^{-\lambda(\phi - \mu)})}
578\label{eqn:truncated_dw}
579\end{eqnarray}
580
581This optimization is carried out by \ccode{esl\_evd\_FitTruncated()}.
582The likelihood (\ref{eqn:truncated_logL}) is implemented in
583\ccode{tevd\_func()}, and the derivatives (\ref{eqn:truncated_dmu}) and
584(\ref{eqn:truncated_dw}) are implemented in \ccode{tevd\_dfunc()}.
585\ccode{esl\_evd\_FitTruncated()} simply sets up the problem and passes
586it all off to a conjugate gradient descent optimizer.
587
588Results on 500 simulated datasets with $\mu = -20, \lambda = 0.4$,
589truncated at $\phi = -20$ (leaving the right tail, containing about
59063\% of the samples):
591
592\begin{center}
593\begin{tabular}{lrrrr} \hline
594                              & \multicolumn{4}{c}{\# samples}\\
595                              & 100 & 1000  & 10,000 & 100,000 \\
596\% error in $\hat{\mu}$       & 13\%&   2\% &  0.8\% &  0.3\%  \\
597max error in $\hat{\mu}$      &260\%&  42\% &    3\% &    1\%  \\
598\% error in $\hat{\lambda}$   & 15\%&   5\% &    2\% &  0.6\%  \\
599max error in $\hat{\lambda}$  & 68\%&  18\% &    6\% &    2\%  \\ \hline
600\end{tabular}
601\end{center}
602
603Fitting truncated Gumbel distributions is difficult, requiring much
604more data than fitting complete or censored data. The problem is that
605the right tail becomes a scale-free exponential when $\phi >> \mu$,
606and $\mu$ becomes undetermined. Fits become very inaccurate as $\phi$
607gets larger than $\mu$, and for sufficiently large $\phi$, the
608numerical optimizer will completely fail.
609
610
611
612
613
614
615
616