1# esl_varint : variable-length binary encodings for integers
2
3HMMER4 introduced compressed representations of sequence alignments
4called "magic capsules". One of the compression techniques in a magic
5capsule is run-length encoding, with integer run lengths encoded using
6variable-length binary codewords (_varint codes_) that can be as short
7as a single bit per integer.
8
9The general idea of a varint code is that you're often in the
10situation where small integer values are more frequent, and more
11frequent values can be encoded by shorter codewords. Because $v$ can
12be arbitrarily big, it isn't immediately apparent that we can apply
13Huffman codes, because Huffman codes assume there's a finite set of
14symbols.  However, if we assume that the $v$ follow some particular
15decreasing distribution (geometric, for example), Solomon Golomb
16showed that one can create a code that's the optimal Huffman code for
17a set of values drawn from that distribution.
18
19Golomb codes [Golomb, 1966] are one of the best known varint codes,
20but many other varint codes exist. Each different varint code implies
21a probability distribution over the possible values $v$.  The best
22code for a given problem is the one that best matches the actual
23distribution of your $v$'s. Golomb codes are optimal for geometric
24distributions, for example.
25
26The Easel `esl_varint` module provides routines for several different
27varint codes, including Golomb codes, Golomb-Rice codes, exponential
28Golomb codes, Google varint codes, and Elias delta codes. HMMER4 magic
29capsules use exponential Golomb codes.
30
31
32
33
34# Explanation of different varint codes
35
36The varint codes are explained below starting with **_truncated binary
37codes_**, which is not itself a varint code per se, but is a technique
38used by **_Golomb codes_**. Golomb codes take a parameter $m$ that
39controls the shape of the geometric distribution (so we sometimes
40refer to them as _Golomb-m_ codes), and they get fiddly to encode and
41decode when $m$ isn't a power of 2. If we restrict Golomb codes to
42powers of two, $m = 2^k$, we get a simpler set of schemes called the
43**_Golomb-Rice-k_** codes. Golomb-Rice codes consist of groups of
44$2^k$ codewords for each different length in bits - for example, 4
45values $v=0..3$ encoded in 3 bit codewords, 4 values $v=4..7$ encoded
46in 4 bit codewords, 4 values $v=8..11$ encoded in 5 bit codewords, and
47so on. The **_exponential Golomb_** code has group sizes that grow as
48a power of two, instead of being constant - $v=0$ is encoded in 1 bit,
49the 2 values $v=1,2$ are encoded in 3 bits, 4 values $v=3..6$ are
50encoded in 5 bits, and 8 values $v=7..14$ are encoded in 7 bit
51codewords. The **_generalized exponential Golomb-$k$_** codes have
52group sizes that start at $2^k$ instead of 1.
53
54Finally, two other codes that we implement are **_Elias delta codes_**
55[Elias, 1975] and **_Google varint codes_**.
56
57Golomb codes are only here by way of explanation; Easel does not
58implement them. Easel implements Golomb-Rice codes in
59`esl_varint_rice*`, exponential Golomb codes in `esl_varint_expgol*`,
60Elias delta codes in `esl_varint_delta*`, and Google varint codes in
61`esl_varint_google*`.
62
63
64
65## Truncated binary codes
66
67A truncated binary code is a canonical Huffman encoding for $n$
68uniformly distributed values $v = 0..n-1$. A truncated binary code is
69used to encode the remainder $r$ piece in a Golomb code. By itself, it
70does not encode an arbitrarily large integer $v$.
71
72The idea is that when $n$ is a power of two, the optimal code is
73simply to encode $v$ in its $k = \log_2 n$ bit binary representation;
74but when $n$ is not a power of two, in an optimal Huffman encoding, we
75use $k+1$ bits for some values but $k$ for others.
76
77Let $k = \lfloor \log_2 n \rfloor$; $v$ will be encoded by either $k$
78or $k+1$ bits.  Let $u = 2^{k+1} - n$; $u$ is the number of codewords
79that are $k$ bits long, and $n-u$ are $k+1$ bits long.  For $v < u$,
80encode the $k$-bit representation of $v$; for $v \geq u$, encode the
81$k+1$ bit representation of $v+u$.
82
83To decode: read $k$ bits to obtain $v'$. If $v' < u$, $v = v'$; else,
84read one more bit onto $v'$ and $v = v'-u$.
85
86For example, for $n=10$, $k=3$ and $u=6$:
87
88| $v$    | length |  code         |
89|--------|--------|--------------:|
90| 0      | 3      |         000   |
91| 1      | 3      |         001   |
92| 2      | 3      |         010   |
93| 3      | 3      |         011   |
94| 4      | 3      |         100   |
95| 5      | 3      |         101   |
96| 6      | 4      |        1100   |
97| 7      | 4      |        1101   |
98| 8      | 4      |        1110   |
99| 9      | 4      |        1111   |
100
101
102
103## Golomb-_m_ codes
104
105* **Encodes:** integer $v$; $v \geq 0$.
106* **Argument:** $m$; $m > 0$.
107
108Divide $v$ by $m$, obtaining quotient $q$ and remainder $r$. Encode
109$q$ in unary (i.e. $q$ 1's followed by 0), then append the truncated
110binary encoding of $r$.
111
112For example, 0..9 in a Golomb-3 code:
113
114| $v$ | length | code |
115|-----|--------|-----:|
116| 0 | 2  |    0  0    |
117| 1 | 3  |    0 10    |
118| 2 | 3  |    0 11    |
119| 3 | 3  |   10  0    |
120| 4 | 4  |   10 10    |
121| 5 | 4  |   10 11    |
122| 6 | 4  |  110  0    |
123| 7 | 5  |  110 10    |
124| 8 | 5  |  110 11    |
125| 9 | 5  | 1110  0    |
126
127Golomb codes are optimal for geometric distributions. For a geometric
128run length distribution with extension probability $p \geq 0.5$, choose
129$m = \mathrm{Round}(\frac{-1}{\log_2 p} )$.
130
131
132
133## Golomb-Rice-_k_ codes
134
135* **Encodes:** integer $v$, $v \leq 0$
136* **Argument:** $k$, $k \geq 0$
137* **Length:**  $1 + k + \lfloor v / 2^k \rfloor$
138* **Max $v$ in $b$ bits:**  $v < 2^k (b-k)$
139
140Divide $v$ by $2^k$, obtaining quotient $q$ and remainder $r$. Encode
141$q$ in unary ($q$ 1's followed by 0) and append the $k$-bit binary
142representation of $r$. (Golomb-Rice-0 is unary coding: $v+1$ bits to
143store value $v$.)
144
145For example, 0..9 in a Golomb-Rice-2 code:
146
147| $v$   | length | code |
148|-------|----|-------:|
149|  0    | 3  |   0 00 |
150|  1    | 3  |   0 01 |
151|  2    | 3  |   0 10 |
152|  3    | 3  |   0 11 |
153|  4    | 4  |  10 00 |
154|  5    | 4  |  10 01 |
155|  6    | 4  |  10 10 |
156|  7    | 4  |  10 11 |
157|  8    | 5  | 110 00 |
158|  9    | 5  | 110 01 |
159
160Golomb-Rice codes are the simpler subset of Golomb codes where the $m$
161parameter is a power of two ($m = 2^k$), so the remainder $r$ is
162always encoded by $k$ bits and does not need the fiddly truncated
163binary encoding.
164
165
166## Exponential-Golomb code
167
168* **Encodes:** integer $v$; $v \geq 0$
169* **Length:** $2 \lfloor \log_2 (v+1) \rfloor + 1$
170* **Max $v$ in $b$ bits:** $v < 2^{ \lfloor \frac{\mathrm{nbits}-1}{2} \rfloor + 1} - 1$
171
172The number of leading 0's is the number of bits in the encoded
173integer, and we encode $v+1$ instead of $v$ so that all integers,
174including 0, have a leading 1 bit.
175
176To encode: let $w$ be the minimum width of the binary representation
177of $v+1$: i.e. $\lfloor \log_2 (v+1) \rfloor + 1$. Set $w-1$ leading
178bits to 0, followed by the $w$ bits of the binary representation of
179$v+1$.
180
181For example:
182
183| $v$ | length | code |
184|---|---|------------:|
185| 0 | 1 |        1 |
186| 1 | 3 |     0 10 |
187| 2 | 3 |     0 11 |
188| 3 | 5 |   00 100 |
189| 4 | 5 |   00 101 |
190| 5 | 5 |   00 110 |
191| 6 | 5 |   00 111 |
192| 7 | 7 | 000 1000 |
193| 8 | 7 | 000 1001 |
194| 9 | 7 | 000 1010 |
195
196An _Elias gamma code_ for positive nonzero integers $v > 0$ is
197identical to the Exp-Golomb code for $v-1$: i.e. an Elias gamma code
198just directly encodes $v$ instead of $v+1$. (Yes, this can be a little
199confusing.)
200
201An exp-Golomb code is exp-Golomb-0 in the generalized exp-Golomb codes
202below. Easel's implementation does not distinguish them; to get an
203exponential Golomb code, use exp-Golomb-0.
204
205The largest integer $v$ that can be encoded in a `uint64_t` bitfield
206is $2^{32}-2$; in a `uint32_t`, $2^{16}-2$. The limited range of a
20732-bit exp-Golomb encoding is why the Easel implementation is all in
208`uint64_t`.
209
210
211
212
213## Generalized exponential-Golomb-_k_ codes
214
215* **Encodes:** integer $v$; $v \geq 0$
216* **Argument:** $k$, the width of the binary representation of the remainder $r$; $k \geq 0$
217* **Length:** $k + 2 \lfloor \log_2( \lfloor v / 2^k \rfloor + 1) \rfloor + 1$
218* **Max $v$ in $b$ bits:**: $v < 2^k \left( 2^{\lfloor \frac{b - k - 1}{2} \rfloor + 1} - 1 \right)$
219
220Divide integer $v$ by $2^k$ to obtain quotient $q$ and remainder $r$.
221Encode $q$ using an Exp-Golomb code (above), followed by the $k$ bit
222binary representation of $r$.
223
224For example, the Exp-Golomb-2 code:
225
226| $v$ | length | code |
227|-----|--------|-----:|
228|  0    | 3      |    1 00|
229|  1    | 3      |    1 01|
230|  2    | 3      |    1 10|
231|  3    | 3      |    1 11|
232|  4    | 5      |  010 00|
233|  5    | 5      |  010 01|
234|  6    | 5      |  010 10|
235|  7    | 5      |  010 11|
236|  8    | 5      |  011 00|
237|  9    | 5      |  011 01|
238
239Exponential Golomb codes were introduced by [Teuhola, 1978], although
240the details of his encoding are different in detail.
241
242
243
244
245## Elias delta code
246
247* **Encodes:** integer $v$; for $v \geq 1$
248* **Code length:** $\lfloor \log_2 v \rfloor + 2 \lfloor \log_2 (\lfloor \log_2 v \rfloor + 1) \rfloor + 1$
249
250Let $a = \lfloor \log_2 v \rfloor$, the width of the binary
251representation of $v$ ($2^a$ is the highest power in $v$). Let $b =
252\lfloor \log_2 (a+1) \rfloor$, the width of binary $a+1$. The code
253consists of three parts: $b$ leading 0's, followed by the binary
254representation of $a+1$ in $b+1$ bits, followed by the representation
255of $v \mod 2^a$ in $a$ bits.
256
257In other words, encode $a$ in Elias gamma code, followed by the binary
258representation of $v$ without its leading (highest-order) 1
259(i.e. rightmost $a$ bits of $v$).
260
261For example, 1..10 in Elias delta code:
262
263| value | length | code  |
264|----|-----|------------:|
265|  1 |   1 |          1  |
266|  2 |   4 |     0 10 0  |
267|  3 |   4 |     0 10 1  |
268|  4 |   5 |     0 11 00 |
269|  5 |   5 |     0 11 01 |
270|  6 |   5 |     0 11 10 |
271|  7 |   5 |     0 11 11 |
272|  8 |   8 |  00 100 000 |
273|  9 |   8 |  00 100 001 |
274| 10 |   8 |  00 100 010 |
275
276Elias codes including the Elias gamma and delta codes were introduced
277in [Elias, 1975].
278
279
280## Google varint-_k_ codes
281
282* **Encodes:** integer $v$; $v \geq 0$
283* **Argument:** $k$, the width of each code group in bits; $k \geq 2$
284* **Code length:** $k (1 + \lfloor \log_{2^{k-1}} v \rfloor)$
285* **Max $v$ in $b$ bits:** $v < 2^{(k-1) \lfloor \frac{b}{k} \rfloor}$
286
287Integer $v$ is encoded in base $2^{k-1}$ digits. Each digit is
288represented by a $k$-bit code group consisting of a one-bit
289continuation flag followed by the $k-1$ bits of the binary
290representation of the digit. Code groups are ordered with least
291significant digit first.  The continuation flag is 1 if another code
292group follows.
293
294For example, 0..9 encoded in a google-2 code:
295
296| $v$    | length |  code         |
297|--------|--------|--------------:|
298| 0      | 2      |          00   |
299| 1      | 2      |          01   |
300| 2      | 4      |       10 01   |
301| 3      | 4      |       11 01   |
302| 4      | 6      |    10 10 01   |
303| 5      | 6      |    11 10 01   |
304| 6      | 6      |    10 11 01   |
305| 7      | 6      |    11 11 01   |
306| 8      | 8      | 10 10 10 01   |
307| 9      | 8      | 11 10 10 01   |
308
309[Google Protobuf](https://developers.google.com/protocol-buffers/docs/encoding#varints)
310uses a google-8 code (i.e. one byte per code group, base 128). This is
311actually the only Google varint code; they want it to be byte-aligned
312for efficiency reasons. Allowing the generalization to any $k$ bits is
313my own thing.
314
315
316# Additional design notes
317
318* Codes can be user input (for example, in HMMER zigars) so varint decoders
319  need to treat bad codes as normal errors, not exceptions.
320
321
322
323# References
324
325Elias, P. Universal codeword sets and representations of the
326integers. IEEE Trans Inform Theory 21:194-203, 1975.
327
328Golomb, SW. Run-length encodings. IEEE Trans Inform Theory 12:399-401,
3291966.
330
331Teuhola, J. A compression method for clustered
332bit-vectors. Information Processing Letters 7:308-311, 1978.
333
334