1 2The generalized extreme value distribution (GEV) includes all three 3types of extreme value distributions: Type I (Gumbel), type II 4(Fr\'{e}chet), and type III (Weibull). Empirically, the scores of some 5sequence alignment algorithms appear to follow GEV distributions. The 6\eslmod{gev} module is used in estimating the statistical significance 7of such scores. 8 9Most local sequence alignment scores follow the Gumbel distribution. 10Easel's \eslmod{gumbel} module applies specifically to the Gumbel. The 11\eslmod{gev} module is used for Type II or III extreme value 12distributions, or for determining which of the three types of 13distribution that a dataset best fits. 14 15\subsection{The gev API} 16 17The \eslmod{gev} API consists of the following functions: 18 19\vspace{0.5em} 20\begin{center} 21\begin{tabular}{ll}\hline 22 \multicolumn{2}{c}{\textbf{evaluating densities and distributions:}}\\ 23\ccode{esl\_gev\_pdf()} & Returns the probability density, $P(S=x)$.\\ 24\ccode{esl\_gev\_logpdf()} & Returns the log of the pdf, $\log P(S=x)$.\\ 25\ccode{esl\_gev\_cdf()} & Returns the cumulative probability distribution, $P(S \leq x)$.\\ 26\ccode{esl\_gev\_logcdf()} & Returns the log of the cdf, $\log P(S \leq x)$.\\ 27\ccode{esl\_gev\_surv()} & Returns right tail mass, 1-cdf, $P(S > x)$\\ 28\ccode{esl\_gev\_logsurv()} & Returns log of 1-cdf, $\log P(S > x)$.\\ 29 \multicolumn{2}{c}{\textbf{sampling:}}\\ 30\ccode{esl\_gev\_Sample()} & Returns a GEV-distributed random sample.\\ 31 \multicolumn{2}{c}{\textbf{maximum likelihood parameter fitting:}}\\ 32\ccode{esl\_gev\_FitComplete()} & Estimates GEV parameters from complete data.\\ 33\end{tabular} 34\end{center} 35\vspace{0.5em} 36 37The Gumbel distribution depends on three parameters, $\mu$, $\lambda$, 38and $\alpha$. When these parameters are known, the statistical 39significance (P-value) of a single score $x$ is $P(S>x)$, obtained by 40a call to \ccode{esl\_gev\_surv()}. The E-value for obtaining that 41score or better in searching a database of $N$ sequences is just 42$NP(S>x)$. 43 44When the parameters are unknown, they can be estimated from scores 45obtained from comparisons of simulated random data. The 46\ccode{esl\_gev\_FitComplete()} function performs maximum likelihood 47parameter fitting \citep{Coles01}. 48 49\subsection{Example of using the gev API} 50 51Below is a code example that samples 10,000 data points from a 52Fr\'{e}chet distribution with $\mu=-20$, $\lambda=0.4$, $\alpha=0.1$; 53reports the min and max samples, and the probability mass to the left 54of the min and to the right of the max (both of which should be about 55$\frac{1}{10000}$, since we took 10,000 samples); and then fits those 56simulated data to a Gumbel and reports the fitted $\mu$ and $\lambda$: 57 58\input{cexcerpts/gev_example} 59 60\subsection{GEV densities} 61 62The probability density function (pdf) and the cumulative distribution 63function (cdf) of the generalized extreme value distribution are 64\citep{Coles01}: 65 66\begin{eqnarray} 67P(X=x) & = & \lambda \left[ 1 + \alpha \lambda (x - \mu) \right]^{-\frac{\alpha+1}{\alpha}} 68 \exp \left\{ - \left[ 1 + \alpha \lambda (x - \mu) 69 \right]^{-\frac{1}{\alpha}} \right\} 70\\% 71\label{eqn:gev_density} 72P(X \geq x) & = & \exp \left\{ - \left[ 1 + 73 \alpha\lambda(x-\mu) \right]^{-\frac{1}{\alpha}} \right\} 74\\% 75\label{eqn:gev_distribution} 76\end{eqnarray} 77 78The parameters $\mu$, $\lambda$, and $\alpha$ are location, scale, and 79shape parameters, respectively, with $-\infty < \mu < \infty$, $0 < 80\lambda < \infty$, and $-\infty < \alpha < \infty$. 81 82The Type II (Fr\'{e}chet) distribution corresponds to $\alpha > 0$, 83and the Type III (Weibull) distribution corresponds to $\alpha < 0$. 84The Type I (Gumbel) distribution arises in the limit $\alpha 85\rightarrow 0$. At values $\alpha \simeq 0$, Easel's GEV functions 86revert to the Gumbel limit case, as opposed to dividing by zero and 87failing. 88 89Technically the GEV is only defined for values of $x$ such that $1 + 90\alpha \lambda (x - \mu) > 0$. However, Easel's functions return 91sensible values outside this domain, such as 0 for nonexistent 92densities. 93 94Generalized extreme value densities for $\mu = 0$ and $\lambda = 1$ 95are shown below (left) for three settings of $\alpha$; $\alpha = 0$ 96(Gumbel), $\alpha = 0.1$ (Fr\'{e}chet), and $\alpha = -0.1$ 97(Weibull). The figure on the right shows the log densities, which more 98clearly show how, relative to the exponential right tail of the 99Gumbel, the Fr\'{e}chet's tail is longer, and the Weibull's tail is 100shorter. 101 102\centerline{ 103\begin{minipage}{3in} 104\includegraphics[width=2.8in]{figures/gev_density} 105\end{minipage} 106\begin{minipage}{3in} 107\includegraphics[width=2.8in]{figures/gev_logdensity} 108\end{minipage} 109} 110 111For more details, see the excellent description in \citep{Coles01}. 112Easel's $\{ \mu, \lambda, \alpha \}$ notation differs from the $\{ 113\mu, \sigma, \xi \}$ parameterization used by Coles. Use $\lambda = 114\frac{1}{\sigma}$ and $\alpha = \xi$ to translate. 115 116\subsection{Fitting GEV distributions to observed data} 117 118Easel fits GEVs by maximum likelihood estimation by numerically 119optimizing the log likelihood function, using first derivative 120information and conjugate gradient descent. See the \eslmod{gumbel} 121chapter for a more general introduction to maximum likelihood fitting. 122 123\subsubsection{Maximum likelihood estimation, complete data} 124 125The function \ccode{esl\_gev\_FitComplete()} uses gradient information 126to find parameters that optimize the likelihood function, using the 127conjugate gradient descent code in the \eslmod{minimizer} module. 128 129Given $n$ samples $x_1..x_n$, we want to estimate maximum likelihood 130parameter estimates $\{ \hat{\mu}, \hat{\lambda}, \hat{\alpha} \}$ 131that maximize the log likelihood: 132 133\begin{equation} 134\log L(\lambda, \mu, \alpha) = n \log \lambda 135 - \frac{\alpha+1}{\alpha} 136 \sum_{i=1}^{n} \log\left[1+ \alpha\lambda(x_i - \mu) \right] 137 - \sum_{i=1}^{n} \left[ 1 + \alpha\lambda (x_i - \mu) \right]^{\frac{1}{\alpha}} 138\label{eqn:gev_logL} 139\end{equation} 140 141The $\left[ 1 + \alpha\lambda (x_i - \mu) \right]^{\frac{1}{\alpha}}$ 142term can be rewritten in a more conveniently differentiable form as 143$\exp \left\{ \frac{1}{\alpha} \log \left[ 1 + \alpha\lambda (x_i - \mu) 144\right] \right\}$. 145 146Since the $\lambda$ parameter is constrained to $\lambda > 0$ but the 147numerical optimizer expects unconstrained parameters, we use a change 148of variables $\lambda = e^w$ and optimize an unconstrained value $w$. 149 150The gradient of the log likelihood with respect to $\mu$, $w$, and 151$\alpha$ is: 152 153%% xref: STL9/118-120 154\begin{eqnarray} 155\frac{\partial \log L}{\partial \mu} & = & 156 \sum_{i=1}^n \frac{\lambda (\alpha+1)}{1+\alpha\lambda(x_i-\mu)} 157 -\sum_{i=1}^n \lambda \exp 158 \left\{ -\frac{\alpha+1}{\alpha} \log 159 \left[1+\alpha\lambda(x_i-\mu)\right] \right\} 160\\% 161\label{eqn:gev_mupartial} 162\frac{\partial \log L}{\partial w} & = & 163 n - \sum_{i=1}^{n} \frac{\lambda (\alpha+1) (x_i - \mu)} 164 {1 + \alpha \lambda (x_i - \mu)} 165 + \sum_{i=1}^n \lambda (x_i - \mu) 166 \exp \left\{ -\frac{\alpha+1}{\alpha} \log 167 \left[1+\alpha\lambda(x_i-\mu)\right] \right\} 168\\% 169\label{eqn:gev_wpartial} 170\frac{\partial \log L}{\partial \alpha} & = & 171 \sum_{i=1}^n \left\{ 172 \begin{array}{l} 173 - \frac{\alpha+1}{\alpha} \frac{\lambda(x_i-\mu)} 174 {1 +\alpha\lambda(x_i-\mu)}\\ 175 + \frac{1}{\alpha^2} \log \left[ 1 + \alpha\lambda(x_i - \mu) \right]\\ 176 + \frac{1}{\alpha} \frac{\lambda(x_i-\mu)} 177 {1 +\alpha\lambda(x_i-\mu)} 178 e^{-\frac{1}{\alpha} \log\left[ 1 + \alpha\lambda(x_i - \mu) \right]}\\ 179 - \frac{1}{\alpha^2} \log \left[ 1 + \alpha\lambda(x_i - \mu) \right] 180 e^{-\frac{1}{\alpha} \log\left[ 1 + \alpha\lambda(x_i - \mu) 181 \right]} 182 \end{array} 183 \right. 184\\% 185\label{eqn:gev_alphapartial} 186\end{eqnarray} 187 188When $|\alpha\lambda(x_i - \mu)|$ approaches $0$, the GEV approximates 189a Gumbel distribution and these equations can be simplified using the 190approximation $\log(1+a) \simeq a$. 191 192 193 194 195 196 197 198