1# esl_varint : variable-length binary encodings for integers 2 3HMMER4 introduced compressed representations of sequence alignments 4called "magic capsules". One of the compression techniques in a magic 5capsule is run-length encoding, with integer run lengths encoded using 6variable-length binary codewords (_varint codes_) that can be as short 7as a single bit per integer. 8 9The general idea of a varint code is that you're often in the 10situation where small integer values are more frequent, and more 11frequent values can be encoded by shorter codewords. Because $v$ can 12be arbitrarily big, it isn't immediately apparent that we can apply 13Huffman codes, because Huffman codes assume there's a finite set of 14symbols. However, if we assume that the $v$ follow some particular 15decreasing distribution (geometric, for example), Solomon Golomb 16showed that one can create a code that's the optimal Huffman code for 17a set of values drawn from that distribution. 18 19Golomb codes [Golomb, 1966] are one of the best known varint codes, 20but many other varint codes exist. Each different varint code implies 21a probability distribution over the possible values $v$. The best 22code for a given problem is the one that best matches the actual 23distribution of your $v$'s. Golomb codes are optimal for geometric 24distributions, for example. 25 26The Easel `esl_varint` module provides routines for several different 27varint codes, including Golomb codes, Golomb-Rice codes, exponential 28Golomb codes, Google varint codes, and Elias delta codes. HMMER4 magic 29capsules use exponential Golomb codes. 30 31 32 33 34# Explanation of different varint codes 35 36The varint codes are explained below starting with **_truncated binary 37codes_**, which is not itself a varint code per se, but is a technique 38used by **_Golomb codes_**. Golomb codes take a parameter $m$ that 39controls the shape of the geometric distribution (so we sometimes 40refer to them as _Golomb-m_ codes), and they get fiddly to encode and 41decode when $m$ isn't a power of 2. If we restrict Golomb codes to 42powers of two, $m = 2^k$, we get a simpler set of schemes called the 43**_Golomb-Rice-k_** codes. Golomb-Rice codes consist of groups of 44$2^k$ codewords for each different length in bits - for example, 4 45values $v=0..3$ encoded in 3 bit codewords, 4 values $v=4..7$ encoded 46in 4 bit codewords, 4 values $v=8..11$ encoded in 5 bit codewords, and 47so on. The **_exponential Golomb_** code has group sizes that grow as 48a power of two, instead of being constant - $v=0$ is encoded in 1 bit, 49the 2 values $v=1,2$ are encoded in 3 bits, 4 values $v=3..6$ are 50encoded in 5 bits, and 8 values $v=7..14$ are encoded in 7 bit 51codewords. The **_generalized exponential Golomb-$k$_** codes have 52group sizes that start at $2^k$ instead of 1. 53 54Finally, two other codes that we implement are **_Elias delta codes_** 55[Elias, 1975] and **_Google varint codes_**. 56 57Golomb codes are only here by way of explanation; Easel does not 58implement them. Easel implements Golomb-Rice codes in 59`esl_varint_rice*`, exponential Golomb codes in `esl_varint_expgol*`, 60Elias delta codes in `esl_varint_delta*`, and Google varint codes in 61`esl_varint_google*`. 62 63 64 65## Truncated binary codes 66 67A truncated binary code is a canonical Huffman encoding for $n$ 68uniformly distributed values $v = 0..n-1$. A truncated binary code is 69used to encode the remainder $r$ piece in a Golomb code. By itself, it 70does not encode an arbitrarily large integer $v$. 71 72The idea is that when $n$ is a power of two, the optimal code is 73simply to encode $v$ in its $k = \log_2 n$ bit binary representation; 74but when $n$ is not a power of two, in an optimal Huffman encoding, we 75use $k+1$ bits for some values but $k$ for others. 76 77Let $k = \lfloor \log_2 n \rfloor$; $v$ will be encoded by either $k$ 78or $k+1$ bits. Let $u = 2^{k+1} - n$; $u$ is the number of codewords 79that are $k$ bits long, and $n-u$ are $k+1$ bits long. For $v < u$, 80encode the $k$-bit representation of $v$; for $v \geq u$, encode the 81$k+1$ bit representation of $v+u$. 82 83To decode: read $k$ bits to obtain $v'$. If $v' < u$, $v = v'$; else, 84read one more bit onto $v'$ and $v = v'-u$. 85 86For example, for $n=10$, $k=3$ and $u=6$: 87 88| $v$ | length | code | 89|--------|--------|--------------:| 90| 0 | 3 | 000 | 91| 1 | 3 | 001 | 92| 2 | 3 | 010 | 93| 3 | 3 | 011 | 94| 4 | 3 | 100 | 95| 5 | 3 | 101 | 96| 6 | 4 | 1100 | 97| 7 | 4 | 1101 | 98| 8 | 4 | 1110 | 99| 9 | 4 | 1111 | 100 101 102 103## Golomb-_m_ codes 104 105* **Encodes:** integer $v$; $v \geq 0$. 106* **Argument:** $m$; $m > 0$. 107 108Divide $v$ by $m$, obtaining quotient $q$ and remainder $r$. Encode 109$q$ in unary (i.e. $q$ 1's followed by 0), then append the truncated 110binary encoding of $r$. 111 112For example, 0..9 in a Golomb-3 code: 113 114| $v$ | length | code | 115|-----|--------|-----:| 116| 0 | 2 | 0 0 | 117| 1 | 3 | 0 10 | 118| 2 | 3 | 0 11 | 119| 3 | 3 | 10 0 | 120| 4 | 4 | 10 10 | 121| 5 | 4 | 10 11 | 122| 6 | 4 | 110 0 | 123| 7 | 5 | 110 10 | 124| 8 | 5 | 110 11 | 125| 9 | 5 | 1110 0 | 126 127Golomb codes are optimal for geometric distributions. For a geometric 128run length distribution with extension probability $p \geq 0.5$, choose 129$m = \mathrm{Round}(\frac{-1}{\log_2 p} )$. 130 131 132 133## Golomb-Rice-_k_ codes 134 135* **Encodes:** integer $v$, $v \leq 0$ 136* **Argument:** $k$, $k \geq 0$ 137* **Length:** $1 + k + \lfloor v / 2^k \rfloor$ 138* **Max $v$ in $b$ bits:** $v < 2^k (b-k)$ 139 140Divide $v$ by $2^k$, obtaining quotient $q$ and remainder $r$. Encode 141$q$ in unary ($q$ 1's followed by 0) and append the $k$-bit binary 142representation of $r$. (Golomb-Rice-0 is unary coding: $v+1$ bits to 143store value $v$.) 144 145For example, 0..9 in a Golomb-Rice-2 code: 146 147| $v$ | length | code | 148|-------|----|-------:| 149| 0 | 3 | 0 00 | 150| 1 | 3 | 0 01 | 151| 2 | 3 | 0 10 | 152| 3 | 3 | 0 11 | 153| 4 | 4 | 10 00 | 154| 5 | 4 | 10 01 | 155| 6 | 4 | 10 10 | 156| 7 | 4 | 10 11 | 157| 8 | 5 | 110 00 | 158| 9 | 5 | 110 01 | 159 160Golomb-Rice codes are the simpler subset of Golomb codes where the $m$ 161parameter is a power of two ($m = 2^k$), so the remainder $r$ is 162always encoded by $k$ bits and does not need the fiddly truncated 163binary encoding. 164 165 166## Exponential-Golomb code 167 168* **Encodes:** integer $v$; $v \geq 0$ 169* **Length:** $2 \lfloor \log_2 (v+1) \rfloor + 1$ 170* **Max $v$ in $b$ bits:** $v < 2^{ \lfloor \frac{\mathrm{nbits}-1}{2} \rfloor + 1} - 1$ 171 172The number of leading 0's is the number of bits in the encoded 173integer, and we encode $v+1$ instead of $v$ so that all integers, 174including 0, have a leading 1 bit. 175 176To encode: let $w$ be the minimum width of the binary representation 177of $v+1$: i.e. $\lfloor \log_2 (v+1) \rfloor + 1$. Set $w-1$ leading 178bits to 0, followed by the $w$ bits of the binary representation of 179$v+1$. 180 181For example: 182 183| $v$ | length | code | 184|---|---|------------:| 185| 0 | 1 | 1 | 186| 1 | 3 | 0 10 | 187| 2 | 3 | 0 11 | 188| 3 | 5 | 00 100 | 189| 4 | 5 | 00 101 | 190| 5 | 5 | 00 110 | 191| 6 | 5 | 00 111 | 192| 7 | 7 | 000 1000 | 193| 8 | 7 | 000 1001 | 194| 9 | 7 | 000 1010 | 195 196An _Elias gamma code_ for positive nonzero integers $v > 0$ is 197identical to the Exp-Golomb code for $v-1$: i.e. an Elias gamma code 198just directly encodes $v$ instead of $v+1$. (Yes, this can be a little 199confusing.) 200 201An exp-Golomb code is exp-Golomb-0 in the generalized exp-Golomb codes 202below. Easel's implementation does not distinguish them; to get an 203exponential Golomb code, use exp-Golomb-0. 204 205The largest integer $v$ that can be encoded in a `uint64_t` bitfield 206is $2^{32}-2$; in a `uint32_t`, $2^{16}-2$. The limited range of a 20732-bit exp-Golomb encoding is why the Easel implementation is all in 208`uint64_t`. 209 210 211 212 213## Generalized exponential-Golomb-_k_ codes 214 215* **Encodes:** integer $v$; $v \geq 0$ 216* **Argument:** $k$, the width of the binary representation of the remainder $r$; $k \geq 0$ 217* **Length:** $k + 2 \lfloor \log_2( \lfloor v / 2^k \rfloor + 1) \rfloor + 1$ 218* **Max $v$ in $b$ bits:**: $v < 2^k \left( 2^{\lfloor \frac{b - k - 1}{2} \rfloor + 1} - 1 \right)$ 219 220Divide integer $v$ by $2^k$ to obtain quotient $q$ and remainder $r$. 221Encode $q$ using an Exp-Golomb code (above), followed by the $k$ bit 222binary representation of $r$. 223 224For example, the Exp-Golomb-2 code: 225 226| $v$ | length | code | 227|-----|--------|-----:| 228| 0 | 3 | 1 00| 229| 1 | 3 | 1 01| 230| 2 | 3 | 1 10| 231| 3 | 3 | 1 11| 232| 4 | 5 | 010 00| 233| 5 | 5 | 010 01| 234| 6 | 5 | 010 10| 235| 7 | 5 | 010 11| 236| 8 | 5 | 011 00| 237| 9 | 5 | 011 01| 238 239Exponential Golomb codes were introduced by [Teuhola, 1978], although 240the details of his encoding are different in detail. 241 242 243 244 245## Elias delta code 246 247* **Encodes:** integer $v$; for $v \geq 1$ 248* **Code length:** $\lfloor \log_2 v \rfloor + 2 \lfloor \log_2 (\lfloor \log_2 v \rfloor + 1) \rfloor + 1$ 249 250Let $a = \lfloor \log_2 v \rfloor$, the width of the binary 251representation of $v$ ($2^a$ is the highest power in $v$). Let $b = 252\lfloor \log_2 (a+1) \rfloor$, the width of binary $a+1$. The code 253consists of three parts: $b$ leading 0's, followed by the binary 254representation of $a+1$ in $b+1$ bits, followed by the representation 255of $v \mod 2^a$ in $a$ bits. 256 257In other words, encode $a$ in Elias gamma code, followed by the binary 258representation of $v$ without its leading (highest-order) 1 259(i.e. rightmost $a$ bits of $v$). 260 261For example, 1..10 in Elias delta code: 262 263| value | length | code | 264|----|-----|------------:| 265| 1 | 1 | 1 | 266| 2 | 4 | 0 10 0 | 267| 3 | 4 | 0 10 1 | 268| 4 | 5 | 0 11 00 | 269| 5 | 5 | 0 11 01 | 270| 6 | 5 | 0 11 10 | 271| 7 | 5 | 0 11 11 | 272| 8 | 8 | 00 100 000 | 273| 9 | 8 | 00 100 001 | 274| 10 | 8 | 00 100 010 | 275 276Elias codes including the Elias gamma and delta codes were introduced 277in [Elias, 1975]. 278 279 280## Google varint-_k_ codes 281 282* **Encodes:** integer $v$; $v \geq 0$ 283* **Argument:** $k$, the width of each code group in bits; $k \geq 2$ 284* **Code length:** $k (1 + \lfloor \log_{2^{k-1}} v \rfloor)$ 285* **Max $v$ in $b$ bits:** $v < 2^{(k-1) \lfloor \frac{b}{k} \rfloor}$ 286 287Integer $v$ is encoded in base $2^{k-1}$ digits. Each digit is 288represented by a $k$-bit code group consisting of a one-bit 289continuation flag followed by the $k-1$ bits of the binary 290representation of the digit. Code groups are ordered with least 291significant digit first. The continuation flag is 1 if another code 292group follows. 293 294For example, 0..9 encoded in a google-2 code: 295 296| $v$ | length | code | 297|--------|--------|--------------:| 298| 0 | 2 | 00 | 299| 1 | 2 | 01 | 300| 2 | 4 | 10 01 | 301| 3 | 4 | 11 01 | 302| 4 | 6 | 10 10 01 | 303| 5 | 6 | 11 10 01 | 304| 6 | 6 | 10 11 01 | 305| 7 | 6 | 11 11 01 | 306| 8 | 8 | 10 10 10 01 | 307| 9 | 8 | 11 10 10 01 | 308 309[Google Protobuf](https://developers.google.com/protocol-buffers/docs/encoding#varints) 310uses a google-8 code (i.e. one byte per code group, base 128). This is 311actually the only Google varint code; they want it to be byte-aligned 312for efficiency reasons. Allowing the generalization to any $k$ bits is 313my own thing. 314 315 316# Additional design notes 317 318* Codes can be user input (for example, in HMMER zigars) so varint decoders 319 need to treat bad codes as normal errors, not exceptions. 320 321 322 323# References 324 325Elias, P. Universal codeword sets and representations of the 326integers. IEEE Trans Inform Theory 21:194-203, 1975. 327 328Golomb, SW. Run-length encodings. IEEE Trans Inform Theory 12:399-401, 3291966. 330 331Teuhola, J. A compression method for clustered 332bit-vectors. Information Processing Letters 7:308-311, 1978. 333 334