• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..12-Nov-2020-

README.mdH A D07-Nov-20203.7 KiB10180

common_adler32.wuffsH A D07-Nov-20201.5 KiB5948

README.md

1# Adler-32
2
3Adler-32 is a checksum algorithm that hashes byte sequences to 32 bit values.
4It is named after its inventor, Mark Adler, who also co-invented the Gzip and
5Zlib compressed file formats. Amongst other differences, Gzip uses CRC-32 as
6its checksum and Zlib uses Adler-32.
7
8The algorithm, described in [RFC 1950](https://www.ietf.org/rfc/rfc1950.txt),
9is simple. Conceptually, there are two unsigned integers `s1` and `s2` of
10infinite precision, initialized to `0` and `1`. These two accumulators are
11updated for every input byte `src[i]`. At the end of the loop, `s1` is `1` plus
12the sum of all source bytes and `s2` is the sum of all (intermediate and final)
13`s1` values:
14
15    var s1 = 1;
16    var s2 = 0;
17    for_each i in_the_range_of src {
18        s1 = s1 + src[i];
19        s2 = s2 + s1;
20    }
21    return ((s2 % 65521) << 16) | (s1 % 65521);
22
23The final `uint32_t` hash value is composed of two 16-bit values: `(s1 %
2465521)` in the low 16 bits and `(s2 % 65521)` in the high 16 bits. `65521` is
25the largest prime number less than `(1 << 16)`.
26
27Infinite precision arithmetic requires arbitrarily large amounts of memory. In
28practice, computing the Adler-32 hash instead uses a `uint32_t` typed `s1` and
29`s2`, modifying the algorithm to be concious of overflow inside the loop:
30
31    uint32_t s1 = 1;
32    uint32_t s2 = 0;
33    for_each i in_the_range_of src {
34        s1 = (s1 + src[i]) % 65521;
35        s2 = (s2 + s1)     % 65521;
36    }
37    return (s2 << 16) | s1;
38
39The loop can be split into two levels, so that the relatively expensive modulo
40operation can be hoisted out of the inner loop:
41
42    uint32_t s1 = 1;
43    uint32_t s2 = 0;
44    for_each_sub_slice s of_length_up_to M partitioning src {
45        for_each i in_the_range_of s {
46            s1 = s1 + s[i];
47            s2 = s2 + s1;
48        }
49        s1 = s1 % 65521;
50        s2 = s2 % 65521;
51    }
52    return (s2 << 16) | s1;
53
54We just need to find the largest `M` such that the inner loop cannot overflow.
55The worst case scenario is that `s1` and `s2` both start the inner loop at
56`65520` and every subsequent `src[i]` byte equals `0xFF`. A simple
57[computation](https://play.golang.org/p/wdx6BPDs2-R) finds that the largest
58non-overflowing `M` is 5552.
59
60In a happy coincidence, 5552 is an exact multiple of 16, which often works well
61with loop unrolling and with SIMD alignment.
62
63
64## Comparison with CRC-32
65
66Adler-32 is a very simple hashing algorithm. While its output is nominally a
67`uint32_t` value, it isn't uniformly distributed across the entire `uint32_t`
68range. The `[65521, 65535]` range of each 16-bit half of an Adler-32 checksum
69is never touched.
70
71While neither Adler-32 or CRC-32 are cryptographic hash functions, there is
72still a stark difference in the patterns (or lack of) in their hash values of
73the `N`-byte string consisting entirely of zeroes, as [this Go
74program](https://play.golang.org/p/SkPVp0tBnDl) shows:
75
76    N  Adler-32    CRC-32      Input
77    0  0x00000001  0x00000000  ""
78    1  0x00010001  0xD202EF8D  "\x00"
79    2  0x00020001  0x41D912FF  "\x00\x00"
80    3  0x00030001  0xFF41D912  "\x00\x00\x00"
81    4  0x00040001  0x2144DF1C  "\x00\x00\x00\x00"
82    5  0x00050001  0xC622F71D  "\x00\x00\x00\x00\x00"
83    6  0x00060001  0xB1C2A1A3  "\x00\x00\x00\x00\x00\x00"
84    7  0x00070001  0x9D6CDF7E  "\x00\x00\x00\x00\x00\x00\x00"
85
86Adler-32 is a simpler algorithm than CRC-32. At the time Adler-32 was invented,
87it had noticably higher throughput. With modern SIMD implementations, that
88performance difference has largely disappeared.
89
90
91# Worked Example
92
93A worked example for calculating the Adler-32 hash of the three byte input
94"Hi\n", starting from the initial state `(s1 = 1)` and `(s2 = 0)`:
95
96    src[i]  ((s2 << 16) | s1)
97    ----    0x00000001
98    0x48    0x00490049
99    0x69    0x00FB00B2
100    0x0A    0x01B700BC
101