1 2The \eslmod{distance} module implements routines for inferring 3mutational distances between pairs of aligned sequences, and for 4constructing distance matrices from multiple sequence alignments. 5 6The API for the \eslmod{distance} module is summarized in 7Table~\ref{tbl:distance_api}. 8 9\begin{table}[hbp] 10\begin{center} 11{\small 12\begin{tabular}{|ll|}\hline 13 \apisubhead{Pairwise distances for aligned text sequences}\\ 14\hyperlink{func:esl_dst_CPairId()}{\ccode{esl\_dst\_CPairId()}} & Pairwise identity of two aligned text strings.\\ 15\hyperlink{func:esl_dst_CJukesCantor()}{\ccode{esl\_dst\_CJukesCantor()}} & Jukes-Cantor distance for two aligned strings.\\ 16 \apisubhead{Pairwise distances for aligned digital seqs}\\ 17\hyperlink{func:esl_dst_XPairId()}{\ccode{esl\_dst\_XPairId()}} & Pairwise identity of two aligned digital seqs.\\ 18\hyperlink{func:esl_dst_XJukesCantor()}{\ccode{esl\_dst\_XJukesCantor()}} & Jukes-Cantor distance for two aligned digitized seqs.\\ 19 \apisubhead{Distance matrices for aligned text sequences}\\ 20\hyperlink{func:esl_dst_CPairIdMx()}{\ccode{esl\_dst\_CPairIdMx()}} & NxN identity matrix for N aligned text sequences. \\ 21\hyperlink{func:esl_dst_CDiffMx()}{\ccode{esl\_dst\_CDiffMx()}} & NxN difference matrix for N aligned text sequences.\\ 22\hyperlink{func:esl_dst_CJukesCantorMx()}{\ccode{esl\_dst\_CJukesCantorMx()}} & NxN Jukes/Cantor distance matrix for N aligned text seqs.\\ 23 \apisubhead{Distance matrices for aligned digital sequences}\\ 24\hyperlink{func:esl_dst_XPairIdMx()}{\ccode{esl\_dst\_XPairIdMx()}} & NxN identity matrix for N aligned digital seqs.\\ 25\hyperlink{func:esl_dst_XDiffMx()}{\ccode{esl\_dst\_XDiffMx()}} & NxN difference matrix for N aligned digital seqs.\\ 26\hyperlink{func:esl_dst_XJukesCantorMx()}{\ccode{esl\_dst\_XJukesCantorMx()}} & NxN Jukes/Cantor distance matrix for N aligned digital seqs.\\ 27\hline 28\end{tabular} 29} 30\end{center} 31\caption{The \eslmod{distance} API.} 32\label{tbl:distance_api} 33\end{table} 34 35 36\subsection{Example of using the distance API} 37 38The example code below opens a multiple sequence alignment file and 39reads an alignment from it, then uses one of the routines from the 40\eslmod{distance} module to calculate a fractional identity matrix 41from it. The example then finds the average, minimum, and maximum of 42the values in the identity matrix. 43 44\input{cexcerpts/distance_example} 45 46\subsection{Definition of pairwise identity and pairwise difference} 47 48Given a pairwise sequence alignment of length $L$, between two 49sequences of $n_1$ and $n_2$ residues ($n_1 \leq L$, $n_2 \leq L$), 50where the $L$ aligned symbol pairs are classified and counted as 51$c_{\mbox{ident}}$ identities, $c_{\mbox{mismat}}$ mismatches, and 52$c_{\mbox{indel}}$ pairs that have a gap symbol in either or both 53sequences ($c_{\mbox{ident}} + c_{\mbox{mismat}} + c_{\mbox{indel}} = 54L$), \esldef{pairwise sequence identity} is defined as: 55 56\[ 57 \mbox{pid} = \frac{c_{\mbox{ident}}}{\mbox{MIN}(n_1, n_2)}, 58\] 59 60and \esldef{pairwise sequence difference} is defined as 61\[ 62 \mbox{diff} = 1 - \mbox{pid} = \frac{\mbox{MIN}(n_1,n_2) - c_{\mbox{ident}}}{\mbox{MIN}(n_1, n_2)}. 63\] 64 65Both pid and diff range from 0 to 1. 66 67In the unusual case where $\mbox{MIN}(n_1,n_2)=0$ -- that is, one of 68the aligned sequences consists entirely of gaps -- the percent 69identity $0/0$ is defined as 0. The calculation is robust against 70length 0 sequences, which do arise in real applications. (Not just in 71bad input, either. For example, this arises when dealing with subsets 72of the columns of a multiple alignment.) 73 74There are many ways that pairwise identity might be calculated, 75because there are a variety of choices for the denominator. In Easel, 76identity calculations are used primarily in \emph{ad hoc} sequence 77weight calculations for multiple sequence alignments, as part of 78profile HMM or profile SCFG construction. Multiple alignments will 79often contain short sequence fragments. We want to deal robustly with 80cases where two short fragments may have little overlap, or none at 81all. The most obvious calculation of pairwise identity, 82$c_{\mbox{ident}} / c_{\mbox{ident}} + c_{\mbox{mismat}}$, is not 83robust, because alignments with few aligned residues (either because 84they are highly gappy, or they are partially overlapping fragments) 85may receive artifactually high identities. Other definitions, 86$c_{\mbox{ident}} / L$ or $c_{\mbox{ident}} / \mbox{MEAN}(n_1, n_2)$ 87or $c_{\mbox{ident}} / \mbox{MAX}(n_1, n_2)$ are also not robust, 88sharing the disadvantage that good alignments of fragments to longer 89sequences would be scored as artifactually low identities. 90 91 92\subsection{Generalized Jukes-Cantor distances} 93 94The Jukes-Cantor model of DNA sequence evolution assumes that all 95substitutions occur at the same rate $\alpha$ 96\citep{JukesCantor69}. It is a reversible, multiplicative evolutionary 97model. It implies equiprobable stationary probabilities. The 98\esldef{Jukes/Cantor distance} is the maximum likelihood estimate of 99the number of substitutions per site that have occurred between the 100two sequences, correcting for multiple substitutions that may have 101occurred the same site. Given an ungapped pairwise alignment of length 102$L$ consisting of $c_{\mbox{ident}}$ identities and 103$c_{\mbox{mismat}}$ mismatches (observed substitutions) 104($c_{\mbox{ident}} + c_{\mbox{mismat}} = L$, the fractional observed 105difference $D$ is defined as 106 107\[ 108 D = \frac{c_{\mbox{mismat}}}{c_{\mbox{ident}} + c_{\mbox{mismat}}}, 109\] 110 111and the Jukes-Cantor distance $d$ is defined in terms of $D$ as: 112 113\[ 114 d = -\frac{3}{4} \log \left( 1 - \frac{4}{3} D \right) 115\] 116 117The Jukes/Cantor model does not allow insertions or deletions. When 118calculating ``Jukes/Cantor distances'' for gapped alignments, gap 119symbols are simply ignored, and the same calculations above are 120applied. 121 122The Jukes-Cantor model readily generalizes from the four-letter DNA 123alphabet to any alphabet size $K$, using the same definition of 124observed fractional difference $D$. A \esldef{generalized Jukes-Cantor 125distance} is: 126 127\[ 128 d = -\frac{K-1}{K} \log \left( 1 - \frac{K}{K-1} D \right). 129\] 130 131The large-sample variance of this estimate $d$ is: 132 133\[ 134 \sigma^2 = e^\frac{2Kd}{K-1} \frac{D(1-D)}{L'} 135\] 136 137where $L'$ is the total number of columns counted, exclusive of gaps, 138$L' = c_{\mbox{ident}} + c_{\mbox{mismat}}$. 139 140If the observed $D \geq \frac{K-1}{K}$, the maximum likelihood 141Jukes-Cantor distance is infinity, as is the variance. In this case, 142both $d$ and $V$ are returned as \ccode{HUGE\_VAL}. 143 144 145