infernal-1.1.3/easel/esl_varint.md

# esl_varint : variable-length binary encodings for integers

HMMER4 introduced compressed representations of sequence alignments
called "magic capsules". One of the compression techniques in a magic
capsule is run-length encoding, with integer run lengths encoded using
variable-length binary codewords (_varint codes_) that can be as short
as a single bit per integer.

The general idea of a varint code is that you're often in the
situation where small integer values are more frequent, and more
frequent values can be encoded by shorter codewords. Because $v$ can
be arbitrarily big, it isn't immediately apparent that we can apply
Huffman codes, because Huffman codes assume there's a finite set of
symbols.  However, if we assume that the $v$ follow some particular
decreasing distribution (geometric, for example), Solomon Golomb
showed that one can create a code that's the optimal Huffman code for
a set of values drawn from that distribution.

Golomb codes [Golomb, 1966] are one of the best known varint codes,
but many other varint codes exist. Each different varint code implies
a probability distribution over the possible values $v$.  The best
code for a given problem is the one that best matches the actual
distribution of your $v$'s. Golomb codes are optimal for geometric
distributions, for example.

The Easel `esl_varint` module provides routines for several different
varint codes, including Golomb codes, Golomb-Rice codes, exponential
Golomb codes, Google varint codes, and Elias delta codes. HMMER4 magic
capsules use exponential Golomb codes.


# Explanation of different varint codes

The varint codes are explained below starting with **_truncated binary
codes_**, which is not itself a varint code per se, but is a technique
used by **_Golomb codes_**. Golomb codes take a parameter $m$ that
controls the shape of the geometric distribution (so we sometimes
refer to them as _Golomb-m_ codes), and they get fiddly to encode and
decode when $m$ isn't a power of 2. If we restrict Golomb codes to
powers of two, $m = 2^k$, we get a simpler set of schemes called the
**_Golomb-Rice-k_** codes. Golomb-Rice codes consist of groups of
$2^k$ codewords for each different length in bits - for example, 4
values $v=0..3$ encoded in 3 bit codewords, 4 values $v=4..7$ encoded
in 4 bit codewords, 4 values $v=8..11$ encoded in 5 bit codewords, and
so on. The **_exponential Golomb_** code has group sizes that grow as
a power of two, instead of being constant - $v=0$ is encoded in 1 bit,
the 2 values $v=1,2$ are encoded in 3 bits, 4 values $v=3..6$ are
encoded in 5 bits, and 8 values $v=7..14$ are encoded in 7 bit
codewords. The **_generalized exponential Golomb-$k$_** codes have
group sizes that start at $2^k$ instead of 1.

Finally, two other codes that we implement are **_Elias delta codes_**
[Elias, 1975] and **_Google varint codes_**.

Golomb codes are only here by way of explanation; Easel does not
implement them. Easel implements Golomb-Rice codes in
`esl_varint_rice*`, exponential Golomb codes in `esl_varint_expgol*`,
Elias delta codes in `esl_varint_delta*`, and Google varint codes in
`esl_varint_google*`.


## Truncated binary codes

A truncated binary code is a canonical Huffman encoding for $n$
uniformly distributed values $v = 0..n-1$. A truncated binary code is
used to encode the remainder $r$ piece in a Golomb code. By itself, it
does not encode an arbitrarily large integer $v$.

The idea is that when $n$ is a power of two, the optimal code is
simply to encode $v$ in its $k = \log_2 n$ bit binary representation;
but when $n$ is not a power of two, in an optimal Huffman encoding, we
use $k+1$ bits for some values but $k$ for others.

Let $k = \lfloor \log_2 n \rfloor$; $v$ will be encoded by either $k$
or $k+1$ bits.  Let $u = 2^{k+1} - n$; $u$ is the number of codewords
that are $k$ bits long, and $n-u$ are $k+1$ bits long.  For $v < u$,
encode the $k$-bit representation of $v$; for $v \geq u$, encode the
$k+1$ bit representation of $v+u$.

To decode: read $k$ bits to obtain $v'$. If $v' < u$, $v = v'$; else,
read one more bit onto $v'$ and $v = v'-u$.

For example, for $n=10$, $k=3$ and $u=6$:

| $v$    | length |  code         |
|--------|--------|--------------:|
| 0      | 3      |         000   |
| 1      | 3      |         001   |
| 2      | 3      |         010   |
| 3      | 3      |         011   |
| 4      | 3      |         100   |
| 5      | 3      |         101   |
| 6      | 4      |        1100   |
| 7      | 4      |        1101   |
| 8      | 4      |        1110   |
| 9      | 4      |        1111   |


## Golomb-_m_ codes

* **Encodes:** integer $v$; $v \geq 0$.
* **Argument:** $m$; $m > 0$.

Divide $v$ by $m$, obtaining quotient $q$ and remainder $r$. Encode
$q$ in unary (i.e. $q$ 1's followed by 0), then append the truncated
binary encoding of $r$.

For example, 0..9 in a Golomb-3 code:

| $v$ | length | code |
|-----|--------|-----:|
| 0 | 2  |    0  0    |
| 1 | 3  |    0 10    |
| 2 | 3  |    0 11    |
| 3 | 3  |   10  0    |
| 4 | 4  |   10 10    |
| 5 | 4  |   10 11    |
| 6 | 4  |  110  0    |
| 7 | 5  |  110 10    |
| 8 | 5  |  110 11    |
| 9 | 5  | 1110  0    |

Golomb codes are optimal for geometric distributions. For a geometric
run length distribution with extension probability $p \geq 0.5$, choose
$m = \mathrm{Round}(\frac{-1}{\log_2 p} )$.


## Golomb-Rice-_k_ codes

* **Encodes:** integer $v$, $v \leq 0$
* **Argument:** $k$, $k \geq 0$
* **Length:**  $1 + k + \lfloor v / 2^k \rfloor$
* **Max $v$ in $b$ bits:**  $v < 2^k (b-k)$

Divide $v$ by $2^k$, obtaining quotient $q$ and remainder $r$. Encode
$q$ in unary ($q$ 1's followed by 0) and append the $k$-bit binary
representation of $r$. (Golomb-Rice-0 is unary coding: $v+1$ bits to
store value $v$.)

For example, 0..9 in a Golomb-Rice-2 code:

| $v$   | length | code |
|-------|----|-------:|
|  0    | 3  |   0 00 |
|  1    | 3  |   0 01 |
|  2    | 3  |   0 10 |
|  3    | 3  |   0 11 |
|  4    | 4  |  10 00 |
|  5    | 4  |  10 01 |
|  6    | 4  |  10 10 |
|  7    | 4  |  10 11 |
|  8    | 5  | 110 00 |
|  9    | 5  | 110 01 |

Golomb-Rice codes are the simpler subset of Golomb codes where the $m$
parameter is a power of two ($m = 2^k$), so the remainder $r$ is
always encoded by $k$ bits and does not need the fiddly truncated
binary encoding.


## Exponential-Golomb code

* **Encodes:** integer $v$; $v \geq 0$
* **Length:** $2 \lfloor \log_2 (v+1) \rfloor + 1$
* **Max $v$ in $b$ bits:** $v < 2^{ \lfloor \frac{\mathrm{nbits}-1}{2} \rfloor + 1} - 1$

The number of leading 0's is the number of bits in the encoded
integer, and we encode $v+1$ instead of $v$ so that all integers,
including 0, have a leading 1 bit.

To encode: let $w$ be the minimum width of the binary representation
of $v+1$: i.e. $\lfloor \log_2 (v+1) \rfloor + 1$. Set $w-1$ leading
bits to 0, followed by the $w$ bits of the binary representation of
$v+1$.

For example:

| $v$ | length | code |
|---|---|------------:|
| 0 | 1 |        1 |
| 1 | 3 |     0 10 |
| 2 | 3 |     0 11 |
| 3 | 5 |   00 100 |
| 4 | 5 |   00 101 |
| 5 | 5 |   00 110 |
| 6 | 5 |   00 111 |
| 7 | 7 | 000 1000 |
| 8 | 7 | 000 1001 |
| 9 | 7 | 000 1010 |

An _Elias gamma code_ for positive nonzero integers $v > 0$ is
identical to the Exp-Golomb code for $v-1$: i.e. an Elias gamma code
just directly encodes $v$ instead of $v+1$. (Yes, this can be a little
confusing.)

An exp-Golomb code is exp-Golomb-0 in the generalized exp-Golomb codes
below. Easel's implementation does not distinguish them; to get an
exponential Golomb code, use exp-Golomb-0.

The largest integer $v$ that can be encoded in a `uint64_t` bitfield
is $2^{32}-2$; in a `uint32_t`, $2^{16}-2$. The limited range of a
32-bit exp-Golomb encoding is why the Easel implementation is all in
`uint64_t`.


## Generalized exponential-Golomb-_k_ codes

* **Encodes:** integer $v$; $v \geq 0$
* **Argument:** $k$, the width of the binary representation of the remainder $r$; $k \geq 0$
* **Length:** $k + 2 \lfloor \log_2( \lfloor v / 2^k \rfloor + 1) \rfloor + 1$
* **Max $v$ in $b$ bits:**: $v < 2^k \left( 2^{\lfloor \frac{b - k - 1}{2} \rfloor + 1} - 1 \right)$

Divide integer $v$ by $2^k$ to obtain quotient $q$ and remainder $r$.
Encode $q$ using an Exp-Golomb code (above), followed by the $k$ bit
binary representation of $r$.

For example, the Exp-Golomb-2 code:

| $v$ | length | code |
|-----|--------|-----:|
|  0    | 3      |    1 00|
|  1    | 3      |    1 01|
|  2    | 3      |    1 10|
|  3    | 3      |    1 11|
|  4    | 5      |  010 00|
|  5    | 5      |  010 01|
|  6    | 5      |  010 10|
|  7    | 5      |  010 11|
|  8    | 5      |  011 00|
|  9    | 5      |  011 01|

Exponential Golomb codes were introduced by [Teuhola, 1978], although
the details of his encoding are different in detail.


## Elias delta code

* **Encodes:** integer $v$; for $v \geq 1$
* **Code length:** $\lfloor \log_2 v \rfloor + 2 \lfloor \log_2 (\lfloor \log_2 v \rfloor + 1) \rfloor + 1$

Let $a = \lfloor \log_2 v \rfloor$, the width of the binary
representation of $v$ ($2^a$ is the highest power in $v$). Let $b =
\lfloor \log_2 (a+1) \rfloor$, the width of binary $a+1$. The code
consists of three parts: $b$ leading 0's, followed by the binary
representation of $a+1$ in $b+1$ bits, followed by the representation
of $v \mod 2^a$ in $a$ bits.

In other words, encode $a$ in Elias gamma code, followed by the binary
representation of $v$ without its leading (highest-order) 1
(i.e. rightmost $a$ bits of $v$).

For example, 1..10 in Elias delta code:

| value | length | code  |
|----|-----|------------:|
|  1 |   1 |          1  |
|  2 |   4 |     0 10 0  |
|  3 |   4 |     0 10 1  |
|  4 |   5 |     0 11 00 |
|  5 |   5 |     0 11 01 |
|  6 |   5 |     0 11 10 |
|  7 |   5 |     0 11 11 |
|  8 |   8 |  00 100 000 |
|  9 |   8 |  00 100 001 |
| 10 |   8 |  00 100 010 |

Elias codes including the Elias gamma and delta codes were introduced
in [Elias, 1975].


## Google varint-_k_ codes

* **Encodes:** integer $v$; $v \geq 0$
* **Argument:** $k$, the width of each code group in bits; $k \geq 2$
* **Code length:** $k (1 + \lfloor \log_{2^{k-1}} v \rfloor)$
* **Max $v$ in $b$ bits:** $v < 2^{(k-1) \lfloor \frac{b}{k} \rfloor}$

Integer $v$ is encoded in base $2^{k-1}$ digits. Each digit is
represented by a $k$-bit code group consisting of a one-bit
continuation flag followed by the $k-1$ bits of the binary
representation of the digit. Code groups are ordered with least
significant digit first.  The continuation flag is 1 if another code
group follows.

For example, 0..9 encoded in a google-2 code:

| $v$    | length |  code         |
|--------|--------|--------------:|
| 0      | 2      |          00   |
| 1      | 2      |          01   |
| 2      | 4      |       10 01   |
| 3      | 4      |       11 01   |
| 4      | 6      |    10 10 01   |
| 5      | 6      |    11 10 01   |
| 6      | 6      |    10 11 01   |
| 7      | 6      |    11 11 01   |
| 8      | 8      | 10 10 10 01   |
| 9      | 8      | 11 10 10 01   |

[Google Protobuf](https://developers.google.com/protocol-buffers/docs/encoding#varints)
uses a google-8 code (i.e. one byte per code group, base 128). This is
actually the only Google varint code; they want it to be byte-aligned
for efficiency reasons. Allowing the generalization to any $k$ bits is
my own thing.


# Additional design notes

* Codes can be user input (for example, in HMMER zigars) so varint decoders
  need to treat bad codes as normal errors, not exceptions.


# References

Elias, P. Universal codeword sets and representations of the
integers. IEEE Trans Inform Theory 21:194-203, 1975.

Golomb, SW. Run-length encodings. IEEE Trans Inform Theory 12:399-401,
1966.

Teuhola, J. A compression method for clustered
bit-vectors. Information Processing Letters 7:308-311, 1978.