1# esl-msaweight 2 3The `msaweight` module implements different _ad hoc_ sequence 4weighting algorithms for compensating for overrepresentation in a 5multiple sequence alignment. 6 7The default weighting scheme in HMMER and Infernal is position-based 8(PB) weighting 9[[Henikoff & Henikoff, 1994]](https://www.ncbi.nlm.nih.gov/pubmed/7966282). 10Easel also provides tree-based GSC weights 11[[Gerstein, Sonnhammer, Chothia, 1994]](https://www.ncbi.nlm.nih.gov/pubmed/8120887) 12and filtering-based BLOSUM weights 13[[Henikoff & Henikoff, 1992]](https://www.ncbi.nlm.nih.gov/pubmed/1438297). 14All three weighting schemes result in similar performance in terms of 15sensitivity/specificity of the model parameterizations they 16produce. We tend to default to PB weights because they're the most 17efficient to compute. 18 19 20## Position-based weights 21 22The original Henikoff & Henikoff PB scheme was defined for _ungapped_ 23alignments. Let $x_{ij}$ represent the residue in aligned sequence $i$ 24and column $j$; let $c_j(a)$ be the observed count of residue $a$ in 25column $j$; and let $r_j$ be the number of different residues in 26column $j$ (the number of nonzero $c_j(a)$ in column $j$). Then the 27weight on sequence $i$ is set to: 28 29\begin{equation} 30 w_i = \frac{N}{L} \sum_{j=1}^{L} \frac{1}{r_j c_{j}(x_{ij})} 31\end{equation} 32 33The total weight $\sum_i w_i$ is $N$. 34 35What's this rule trying to do? If the alignment consisted of only a 36single column, PB weighting is trying to assign weights that make all 37the weighted frequencies for the observed residues _equal_. For 38example, if we observed 10 A's, 4 C's, and 1 T in 15 sequences of 39length 1, the A sequences would each get weight 0.5, the C sequences 40get weight 1.25, and the T sequence gets weight 5.0. The weighted 41counts would then be 5, 5, and 5, resulting in weighted frequencies of 420.33. PB weighting is a simple rule in the spirit of a maximum 43entropy approach, without the iterative optimization that a true 44maximum entropy approach requires 45[[Krogh & Mitchison, 1995]](https://www.ncbi.nlm.nih.gov/pubmed/7584440). 46When averaged over all columns, sequences that use less abundant 47residues in each column (i.e. more dissimilar sequences) will get 48higher weights. 49 50In the limit of very deep alignments ($N \rightarrow \infty$), 51eventually every residue will be observed in every column ($r_j 52\rightarrow 20$) but PB weights still work fine. I mention this 53because every few years I wake up with a nightmare that PB weights 54break on deep alignments. They don't. 55 56We need it to work for _gapped_ sequence alignments. We extend the 57Henikoff scheme to gapped sequence alignments by ignoring gaps and 58renormalizing by raw (unaligned) sequence lengths. Let function 59$f(a)$ be either the PB weight increment or 0 depending on whether $a$ 60is a residue or a gap: 61 62\begin{equation} 63f(a) = \left\{ \begin{array}{r@{\quad}l} 64 \frac{1}{r_j c_{j}(a)} & \mbox{if } a \mbox{ is a residue} \\ 65 0 & \mbox{if } a \mbox{ is a gap}\\ 66 \end{array} 67 \right. 68\end{equation} 69 70Define a sequence weight $w'_i$ as an average of these terms over all 71non-gap residues in sequence $i$ (analogous to the ungapped PB 72weights): 73 74\begin{equation} 75 w'_i = \frac{1}{\ell_i} \sum_{j=1}^{L} f(x_{ij}) 76\end{equation} 77 78where $\ell_i$ is the raw (unaligned) length of sequence $i$ in 79residues. (If $\ell_i = 0$ set $w'_i = 0$; yes, horribly, sequences 80with no residues in them do appear in people's alignments.) Unlike in 81the ungapped case, the sum of the $w'_i$ isn't a constant, so 82finally renormalize so they sum to $N$: 83 84\begin{equation} 85 w_i = N \frac{w'_i}{\sum_k w'_k} 86\end{equation} 87 88(If pathologically all $w'_i = 0$, set $w_i = 0$ for all $i$.) 89 90(Alternatively, one could use the original PB equation and treat gaps 91as symbols. The resulting weights would be different. It is unclear in 92principle which scheme should be favored, since we haven't defined a 93principled rationale for what "correct" weights would look like for 94gapped sequence alignments. I don't think it matters enough to worry 95about it.) 96 97Given a multiple sequence alignment of $N$ sequences and $L$ columns 98(requiring $O(NL)$ space itself), computation of PB weights requires 99$O(NL)$ time and $O(K)$ space for alphabet size $K$ (4 or 20). 100 101## BLOSUM weights 102 103The BLOSUM weighting scheme: 104 * clusters sequences by single-linkage clustering at a given % 105 identity threshold; 106 * evenly divides a total weight of 1 per cluster across all 107 the sequences in the cluster; 108 * renormalizes so $\sum_i w_i = N$. 109 110Pairwise identity is calculated by `esl_dst_*PairId()` using the 111minimum unaligned length of the two sequences as the denominator: 112 113\begin{equation} 114 \mathrm{pid}(ij) = \frac{\mbox{number of identical aligned pairs for } x_i, x_j} 115 { \mbox{MIN}( \ell_i, \ell_j ) } 116\end{equation} 117 118Single linkage clustering (using the efficient algorithm in 119`esl_cluster_SingleLinkage()` requires $O(N)$ space and worst case 120$O(LN^2)$ time, but typically requires about $O(LN \log N)$ time. 121 122 123## GSC tree weights 124 125GSC tree weights 126[[Gerstein, Sonnhammer, Chothia, 1994]](https://www.ncbi.nlm.nih.gov/pubmed/8120887) 127require $O(N^2)$ extra space for an all-vs-all pairwise distance 128matrix and $O(LN^2 + N^3)$ time for constructing the distance matrix 129followed by a UPGMA tree construction. 130 131 132 133 134 135 136 137