1# esl-msaweight
2
3The `msaweight` module implements different _ad hoc_ sequence
4weighting algorithms for compensating for overrepresentation in a
5multiple sequence alignment.
6
7The default weighting scheme in HMMER and Infernal is position-based
8(PB) weighting
9[[Henikoff & Henikoff, 1994]](https://www.ncbi.nlm.nih.gov/pubmed/7966282).
10Easel also provides tree-based GSC weights
11[[Gerstein, Sonnhammer, Chothia, 1994]](https://www.ncbi.nlm.nih.gov/pubmed/8120887)
12and filtering-based BLOSUM weights
13[[Henikoff & Henikoff, 1992]](https://www.ncbi.nlm.nih.gov/pubmed/1438297).
14All three weighting schemes result in similar performance in terms of
15sensitivity/specificity of the model parameterizations they
16produce. We tend to default to PB weights because they're the most
17efficient to compute.
18
19
20## Position-based weights
21
22The original Henikoff & Henikoff PB scheme was defined for _ungapped_
23alignments. Let $x_{ij}$ represent the residue in aligned sequence $i$
24and column $j$; let $c_j(a)$ be the observed count of residue $a$ in
25column $j$; and let $r_j$ be the number of different residues in
26column $j$ (the number of nonzero $c_j(a)$ in column $j$). Then the
27weight on sequence $i$ is set to:
28
29\begin{equation}
30    w_i = \frac{N}{L} \sum_{j=1}^{L} \frac{1}{r_j c_{j}(x_{ij})}
31\end{equation}
32
33The total weight $\sum_i w_i$ is $N$.
34
35What's this rule trying to do?  If the alignment consisted of only a
36single column, PB weighting is trying to assign weights that make all
37the weighted frequencies for the observed residues _equal_. For
38example, if we observed 10 A's, 4 C's, and 1 T in 15 sequences of
39length 1, the A sequences would each get weight 0.5, the C sequences
40get weight 1.25, and the T sequence gets weight 5.0.  The weighted
41counts would then be 5, 5, and 5, resulting in weighted frequencies of
420.33.  PB weighting is a simple rule in the spirit of a maximum
43entropy approach, without the iterative optimization that a true
44maximum entropy approach requires
45[[Krogh & Mitchison, 1995]](https://www.ncbi.nlm.nih.gov/pubmed/7584440).
46When averaged over all columns, sequences that use less abundant
47residues in each column (i.e. more dissimilar sequences) will get
48higher weights.
49
50In the limit of very deep alignments ($N \rightarrow \infty$),
51eventually every residue will be observed in every column ($r_j
52\rightarrow 20$) but PB weights still work fine. I mention this
53because every few years I wake up with a nightmare that PB weights
54break on deep alignments. They don't.
55
56We need it to work for _gapped_ sequence alignments.  We extend the
57Henikoff scheme to gapped sequence alignments by ignoring gaps and
58renormalizing by raw (unaligned) sequence lengths.  Let function
59$f(a)$ be either the PB weight increment or 0 depending on whether $a$
60is a residue or a gap:
61
62\begin{equation}
63f(a) = \left\{ \begin{array}{r@{\quad}l}
64                 \frac{1}{r_j c_{j}(a)} & \mbox{if } a \mbox{ is a residue} \\
65                 0                      & \mbox{if } a \mbox{ is a gap}\\
66                \end{array}
67       \right.
68\end{equation}
69
70Define a sequence weight $w'_i$ as an average of these terms over all
71non-gap residues in sequence $i$ (analogous to the ungapped PB
72weights):
73
74\begin{equation}
75    w'_i = \frac{1}{\ell_i} \sum_{j=1}^{L} f(x_{ij})
76\end{equation}
77
78where $\ell_i$ is the raw (unaligned) length of sequence $i$ in
79residues. (If $\ell_i = 0$ set $w'_i = 0$; yes, horribly, sequences
80with no residues in them do appear in people's alignments.) Unlike in
81the ungapped case, the sum of the $w'_i$ isn't a constant, so
82finally renormalize so they sum to $N$:
83
84\begin{equation}
85   w_i = N \frac{w'_i}{\sum_k w'_k}
86\end{equation}
87
88(If pathologically all $w'_i = 0$, set $w_i = 0$ for all $i$.)
89
90(Alternatively, one could use the original PB equation and treat gaps
91as symbols. The resulting weights would be different. It is unclear in
92principle which scheme should be favored, since we haven't defined a
93principled rationale for what "correct" weights would look like for
94gapped sequence alignments. I don't think it matters enough to worry
95about it.)
96
97Given a multiple sequence alignment of $N$ sequences and $L$ columns
98(requiring $O(NL)$ space itself), computation of PB weights requires
99$O(NL)$ time and $O(K)$ space for alphabet size $K$ (4 or 20).
100
101## BLOSUM weights
102
103The BLOSUM weighting scheme:
104  * clusters sequences by single-linkage clustering at a given %
105    identity threshold;
106  * evenly divides a total weight of 1 per cluster across all
107    the sequences in the cluster;
108  * renormalizes so $\sum_i w_i = N$.
109
110Pairwise identity is calculated by `esl_dst_*PairId()` using the
111minimum unaligned length of the two sequences as the denominator:
112
113\begin{equation}
114     \mathrm{pid}(ij) = \frac{\mbox{number of identical aligned pairs for } x_i, x_j}
115	                         { \mbox{MIN}( \ell_i, \ell_j ) }
116\end{equation}
117
118Single linkage clustering (using the efficient algorithm in
119`esl_cluster_SingleLinkage()` requires $O(N)$ space and worst case
120$O(LN^2)$ time, but typically requires about $O(LN \log N)$ time.
121
122
123## GSC tree weights
124
125GSC tree weights
126[[Gerstein, Sonnhammer, Chothia, 1994]](https://www.ncbi.nlm.nih.gov/pubmed/8120887)
127require $O(N^2)$ extra space for an all-vs-all pairwise distance
128matrix and $O(LN^2 + N^3)$ time for constructing the distance matrix
129followed by a UPGMA tree construction.
130
131
132
133
134
135
136
137