1
2The \eslmod{distance} module implements routines for inferring
3mutational distances between pairs of aligned sequences, and for
4constructing distance matrices from multiple sequence alignments.
5
6The API for the \eslmod{distance} module is summarized in
7Table~\ref{tbl:distance_api}.
8
9\begin{table}[hbp]
10\begin{center}
11{\small
12\begin{tabular}{|ll|}\hline
13   \apisubhead{Pairwise distances for aligned text sequences}\\
14\hyperlink{func:esl_dst_CPairId()}{\ccode{esl\_dst\_CPairId()}} & Pairwise identity of two aligned text strings.\\
15\hyperlink{func:esl_dst_CJukesCantor()}{\ccode{esl\_dst\_CJukesCantor()}} & Jukes-Cantor distance for two aligned strings.\\
16   \apisubhead{Pairwise distances for aligned digital seqs}\\
17\hyperlink{func:esl_dst_XPairId()}{\ccode{esl\_dst\_XPairId()}} & Pairwise identity of two aligned digital seqs.\\
18\hyperlink{func:esl_dst_XJukesCantor()}{\ccode{esl\_dst\_XJukesCantor()}} & Jukes-Cantor distance for two aligned digitized seqs.\\
19   \apisubhead{Distance matrices for aligned text sequences}\\
20\hyperlink{func:esl_dst_CPairIdMx()}{\ccode{esl\_dst\_CPairIdMx()}} & NxN identity matrix for N aligned text sequences. \\
21\hyperlink{func:esl_dst_CDiffMx()}{\ccode{esl\_dst\_CDiffMx()}} & NxN difference matrix for N aligned text sequences.\\
22\hyperlink{func:esl_dst_CJukesCantorMx()}{\ccode{esl\_dst\_CJukesCantorMx()}} & NxN Jukes/Cantor distance matrix for N aligned text seqs.\\
23   \apisubhead{Distance matrices for aligned digital sequences}\\
24\hyperlink{func:esl_dst_XPairIdMx()}{\ccode{esl\_dst\_XPairIdMx()}} & NxN identity matrix for N aligned digital seqs.\\
25\hyperlink{func:esl_dst_XDiffMx()}{\ccode{esl\_dst\_XDiffMx()}} & NxN difference matrix for N aligned digital seqs.\\
26\hyperlink{func:esl_dst_XJukesCantorMx()}{\ccode{esl\_dst\_XJukesCantorMx()}} & NxN Jukes/Cantor distance matrix for N aligned digital seqs.\\
27\hline
28\end{tabular}
29}
30\end{center}
31\caption{The \eslmod{distance} API.}
32\label{tbl:distance_api}
33\end{table}
34
35
36\subsection{Example of using the distance API}
37
38The example code below opens a multiple sequence alignment file and
39reads an alignment from it, then uses one of the routines from the
40\eslmod{distance} module to calculate a fractional identity matrix
41from it. The example then finds the average, minimum, and maximum of
42the values in the identity matrix.
43
44\input{cexcerpts/distance_example}
45
46\subsection{Definition of pairwise identity and pairwise difference}
47
48Given a pairwise sequence alignment of length $L$, between two
49sequences of $n_1$ and $n_2$ residues ($n_1 \leq L$, $n_2 \leq L$),
50where the $L$ aligned symbol pairs are classified and counted as
51$c_{\mbox{ident}}$ identities, $c_{\mbox{mismat}}$ mismatches, and
52$c_{\mbox{indel}}$ pairs that have a gap symbol in either or both
53sequences ($c_{\mbox{ident}} + c_{\mbox{mismat}} + c_{\mbox{indel}} =
54L$), \esldef{pairwise sequence identity} is defined as:
55
56\[
57   \mbox{pid} = \frac{c_{\mbox{ident}}}{\mbox{MIN}(n_1, n_2)},
58\]
59
60and \esldef{pairwise sequence difference} is defined as
61\[
62   \mbox{diff} = 1 - \mbox{pid} = \frac{\mbox{MIN}(n_1,n_2) - c_{\mbox{ident}}}{\mbox{MIN}(n_1, n_2)}.
63\]
64
65Both pid and diff range from 0 to 1.
66
67In the unusual case where $\mbox{MIN}(n_1,n_2)=0$ -- that is, one of
68the aligned sequences consists entirely of gaps -- the percent
69identity $0/0$ is defined as 0. The calculation is robust against
70length 0 sequences, which do arise in real applications. (Not just in
71bad input, either. For example, this arises when dealing with subsets
72of the columns of a multiple alignment.)
73
74There are many ways that pairwise identity might be calculated,
75because there are a variety of choices for the denominator. In Easel,
76identity calculations are used primarily in \emph{ad hoc} sequence
77weight calculations for multiple sequence alignments, as part of
78profile HMM or profile SCFG construction. Multiple alignments will
79often contain short sequence fragments. We want to deal robustly with
80cases where two short fragments may have little overlap, or none at
81all. The most obvious calculation of pairwise identity,
82$c_{\mbox{ident}} / c_{\mbox{ident}} + c_{\mbox{mismat}}$, is not
83robust, because alignments with few aligned residues (either because
84they are highly gappy, or they are partially overlapping fragments)
85may receive artifactually high identities. Other definitions,
86$c_{\mbox{ident}} / L$ or $c_{\mbox{ident}} / \mbox{MEAN}(n_1, n_2)$
87or $c_{\mbox{ident}} / \mbox{MAX}(n_1, n_2)$ are also not robust,
88sharing the disadvantage that good alignments of fragments to longer
89sequences would be scored as artifactually low identities.
90
91
92\subsection{Generalized Jukes-Cantor distances}
93
94The Jukes-Cantor model of DNA sequence evolution assumes that all
95substitutions occur at the same rate $\alpha$
96\citep{JukesCantor69}. It is a reversible, multiplicative evolutionary
97model. It implies equiprobable stationary probabilities. The
98\esldef{Jukes/Cantor distance} is the maximum likelihood estimate of
99the number of substitutions per site that have occurred between the
100two sequences, correcting for multiple substitutions that may have
101occurred the same site. Given an ungapped pairwise alignment of length
102$L$ consisting of $c_{\mbox{ident}}$ identities and
103$c_{\mbox{mismat}}$ mismatches (observed substitutions)
104($c_{\mbox{ident}} + c_{\mbox{mismat}} = L$, the fractional observed
105difference $D$ is defined as
106
107\[
108  D = \frac{c_{\mbox{mismat}}}{c_{\mbox{ident}} + c_{\mbox{mismat}}},
109\]
110
111and the Jukes-Cantor distance $d$ is defined in terms of $D$ as:
112
113\[
114  d = -\frac{3}{4} \log \left( 1 - \frac{4}{3} D \right)
115\]
116
117The Jukes/Cantor model does not allow insertions or deletions.  When
118calculating ``Jukes/Cantor distances'' for gapped alignments, gap
119symbols are simply ignored, and the same calculations above are
120applied.
121
122The Jukes-Cantor model readily generalizes from the four-letter DNA
123alphabet to any alphabet size $K$, using the same definition of
124observed fractional difference $D$. A \esldef{generalized Jukes-Cantor
125distance} is:
126
127\[
128  d = -\frac{K-1}{K} \log \left( 1 - \frac{K}{K-1} D \right).
129\]
130
131The large-sample variance of this estimate $d$ is:
132
133\[
134   \sigma^2 = e^\frac{2Kd}{K-1} \frac{D(1-D)}{L'}
135\]
136
137where $L'$ is the total number of columns counted, exclusive of gaps,
138$L' = c_{\mbox{ident}} + c_{\mbox{mismat}}$.
139
140If the observed $D \geq \frac{K-1}{K}$, the maximum likelihood
141Jukes-Cantor distance is infinity, as is the variance. In this case,
142both $d$ and $V$ are returned as \ccode{HUGE\_VAL}.
143
144
145