1
2The generalized extreme value distribution (GEV) includes all three
3types of extreme value distributions: Type I (Gumbel), type II
4(Fr\'{e}chet), and type III (Weibull). Empirically, the scores of some
5sequence alignment algorithms appear to follow GEV distributions.  The
6\eslmod{gev} module is used in estimating the statistical significance
7of such scores.
8
9Most local sequence alignment scores follow the Gumbel distribution.
10Easel's \eslmod{gumbel} module applies specifically to the Gumbel. The
11\eslmod{gev} module is used for Type II or III extreme value
12distributions, or for determining which of the three types of
13distribution that a dataset best fits.
14
15\subsection{The gev API}
16
17The \eslmod{gev} API consists of the following functions:
18
19\vspace{0.5em}
20\begin{center}
21\begin{tabular}{ll}\hline
22    \multicolumn{2}{c}{\textbf{evaluating densities and distributions:}}\\
23\ccode{esl\_gev\_pdf()}     & Returns the probability density, $P(S=x)$.\\
24\ccode{esl\_gev\_logpdf()}  & Returns the log of the pdf, $\log P(S=x)$.\\
25\ccode{esl\_gev\_cdf()}     & Returns the cumulative probability distribution, $P(S \leq x)$.\\
26\ccode{esl\_gev\_logcdf()}  & Returns the log of the cdf, $\log P(S \leq x)$.\\
27\ccode{esl\_gev\_surv()}    & Returns right tail mass, 1-cdf, $P(S > x)$\\
28\ccode{esl\_gev\_logsurv()} & Returns log of 1-cdf, $\log P(S > x)$.\\
29    \multicolumn{2}{c}{\textbf{sampling:}}\\
30\ccode{esl\_gev\_Sample()}  & Returns a GEV-distributed random sample.\\
31    \multicolumn{2}{c}{\textbf{maximum likelihood parameter fitting:}}\\
32\ccode{esl\_gev\_FitComplete()} & Estimates GEV parameters from complete data.\\
33\end{tabular}
34\end{center}
35\vspace{0.5em}
36
37The Gumbel distribution depends on three parameters, $\mu$, $\lambda$,
38and $\alpha$. When these parameters are known, the statistical
39significance (P-value) of a single score $x$ is $P(S>x)$, obtained by
40a call to \ccode{esl\_gev\_surv()}.  The E-value for obtaining that
41score or better in searching a database of $N$ sequences is just
42$NP(S>x)$.
43
44When the parameters are unknown, they can be estimated from scores
45obtained from comparisons of simulated random data. The
46\ccode{esl\_gev\_FitComplete()} function performs maximum likelihood
47parameter fitting \citep{Coles01}.
48
49\subsection{Example of using the gev API}
50
51Below is a code example that samples 10,000 data points from a
52Fr\'{e}chet distribution with $\mu=-20$, $\lambda=0.4$, $\alpha=0.1$;
53reports the min and max samples, and the probability mass to the left
54of the min and to the right of the max (both of which should be about
55$\frac{1}{10000}$, since we took 10,000 samples); and then fits those
56simulated data to a Gumbel and reports the fitted $\mu$ and $\lambda$:
57
58\input{cexcerpts/gev_example}
59
60\subsection{GEV densities}
61
62The probability density function (pdf) and the cumulative distribution
63function (cdf) of the generalized extreme value distribution are
64\citep{Coles01}:
65
66\begin{eqnarray}
67P(X=x) & = & \lambda \left[ 1 + \alpha \lambda (x - \mu) \right]^{-\frac{\alpha+1}{\alpha}}
68       \exp \left\{ - \left[ 1 + \alpha \lambda (x - \mu)
69       \right]^{-\frac{1}{\alpha}} \right\}
70\\%
71\label{eqn:gev_density}
72P(X \geq x) & = & \exp \left\{ - \left[ 1 +
73     \alpha\lambda(x-\mu) \right]^{-\frac{1}{\alpha}} \right\}
74\\%
75\label{eqn:gev_distribution}
76\end{eqnarray}
77
78The parameters $\mu$, $\lambda$, and $\alpha$ are location, scale, and
79shape parameters, respectively, with $-\infty < \mu < \infty$, $0 <
80\lambda < \infty$, and $-\infty < \alpha < \infty$.
81
82The Type II (Fr\'{e}chet) distribution corresponds to $\alpha > 0$,
83and the Type III (Weibull) distribution corresponds to $\alpha < 0$.
84The Type I (Gumbel) distribution arises in the limit $\alpha
85\rightarrow 0$. At values $\alpha \simeq 0$, Easel's GEV functions
86revert to the Gumbel limit case, as opposed to dividing by zero and
87failing.
88
89Technically the GEV is only defined for values of $x$ such that $1 +
90\alpha \lambda (x - \mu) > 0$. However, Easel's functions return
91sensible values outside this domain, such as 0 for nonexistent
92densities.
93
94Generalized extreme value densities for $\mu = 0$ and $\lambda = 1$
95are shown below (left) for three settings of $\alpha$; $\alpha = 0$
96(Gumbel), $\alpha = 0.1$ (Fr\'{e}chet), and $\alpha = -0.1$
97(Weibull). The figure on the right shows the log densities, which more
98clearly show how, relative to the exponential right tail of the
99Gumbel, the Fr\'{e}chet's tail is longer, and the Weibull's tail is
100shorter.
101
102\centerline{
103\begin{minipage}{3in}
104\includegraphics[width=2.8in]{figures/gev_density}
105\end{minipage}
106\begin{minipage}{3in}
107\includegraphics[width=2.8in]{figures/gev_logdensity}
108\end{minipage}
109}
110
111For more details, see the excellent description in \citep{Coles01}.
112Easel's $\{ \mu, \lambda, \alpha \}$ notation differs from the $\{
113\mu, \sigma, \xi \}$ parameterization used by Coles. Use $\lambda =
114\frac{1}{\sigma}$ and $\alpha = \xi$ to translate.
115
116\subsection{Fitting GEV distributions to observed data}
117
118Easel fits GEVs by maximum likelihood estimation by numerically
119optimizing the log likelihood function, using first derivative
120information and conjugate gradient descent.  See the \eslmod{gumbel}
121chapter for a more general introduction to maximum likelihood fitting.
122
123\subsubsection{Maximum likelihood estimation, complete data}
124
125The function \ccode{esl\_gev\_FitComplete()} uses gradient information
126to find parameters that optimize the likelihood function, using the
127conjugate gradient descent code in the \eslmod{minimizer} module.
128
129Given $n$ samples $x_1..x_n$, we want to estimate maximum likelihood
130parameter estimates $\{ \hat{\mu}, \hat{\lambda}, \hat{\alpha} \}$
131that maximize the log likelihood:
132
133\begin{equation}
134\log L(\lambda, \mu, \alpha) = n \log \lambda
135       - \frac{\alpha+1}{\alpha}
136           \sum_{i=1}^{n} \log\left[1+ \alpha\lambda(x_i - \mu) \right]
137       - \sum_{i=1}^{n} \left[ 1 + \alpha\lambda (x_i - \mu) \right]^{\frac{1}{\alpha}}
138\label{eqn:gev_logL}
139\end{equation}
140
141The $\left[ 1 + \alpha\lambda (x_i - \mu) \right]^{\frac{1}{\alpha}}$
142term can be rewritten in a more conveniently differentiable form as
143$\exp \left\{ \frac{1}{\alpha} \log \left[ 1 + \alpha\lambda (x_i - \mu)
144\right] \right\}$.
145
146Since the $\lambda$ parameter is constrained to $\lambda > 0$ but the
147numerical optimizer expects unconstrained parameters, we use a change
148of variables $\lambda = e^w$ and optimize an unconstrained value $w$.
149
150The gradient of the log likelihood with respect to $\mu$, $w$, and
151$\alpha$ is:
152
153%% xref: STL9/118-120
154\begin{eqnarray}
155\frac{\partial \log L}{\partial \mu} & = &
156  \sum_{i=1}^n \frac{\lambda (\alpha+1)}{1+\alpha\lambda(x_i-\mu)}
157 -\sum_{i=1}^n \lambda \exp
158    \left\{ -\frac{\alpha+1}{\alpha} \log
159          \left[1+\alpha\lambda(x_i-\mu)\right] \right\}
160\\%
161\label{eqn:gev_mupartial}
162\frac{\partial \log L}{\partial w} & = &
163  n - \sum_{i=1}^{n} \frac{\lambda (\alpha+1) (x_i - \mu)}
164                          {1 + \alpha \lambda (x_i - \mu)}
165  + \sum_{i=1}^n \lambda (x_i - \mu)
166         \exp \left\{ -\frac{\alpha+1}{\alpha} \log
167          \left[1+\alpha\lambda(x_i-\mu)\right] \right\}
168\\%
169\label{eqn:gev_wpartial}
170\frac{\partial \log L}{\partial \alpha} & = &
171   \sum_{i=1}^n \left\{
172      \begin{array}{l}
173      - \frac{\alpha+1}{\alpha} \frac{\lambda(x_i-\mu)}
174                                  {1 +\alpha\lambda(x_i-\mu)}\\
175      + \frac{1}{\alpha^2} \log \left[ 1 + \alpha\lambda(x_i - \mu) \right]\\
176      + \frac{1}{\alpha} \frac{\lambda(x_i-\mu)}
177                          {1 +\alpha\lambda(x_i-\mu)}
178      e^{-\frac{1}{\alpha} \log\left[ 1 + \alpha\lambda(x_i - \mu) \right]}\\
179     -  \frac{1}{\alpha^2} \log \left[ 1 + \alpha\lambda(x_i - \mu) \right]
180      e^{-\frac{1}{\alpha} \log\left[ 1 + \alpha\lambda(x_i - \mu)
181	 \right]}
182     \end{array}
183     \right.
184\\%
185\label{eqn:gev_alphapartial}
186\end{eqnarray}
187
188When $|\alpha\lambda(x_i - \mu)|$ approaches $0$, the GEV approximates
189a Gumbel distribution and these equations can be simplified using the
190approximation $\log(1+a) \simeq a$.
191
192
193
194
195
196
197
198