1# LZW (Lempel Ziv Welch) Compression
2
3LZW is a general purpose compression algorithm, not specific to GIF, PDF, TIFF
4or even to graphics. In practice, there are two incompatible implementations,
5LSB (Least Significant Bits) and MSB (Most Significant Bits) first.
6
7The GIF format uses LSB first. The literal width can vary between 2 and 8 bits,
8inclusive. There is no EarlyChange option.
9
10The PDF and TIFF formats use MSB first. The literal width is fixed at 8 bits.
11There is an EarlyChange option that PDF sometimes uses (see Table 3.7 in the
12[PDF 1.4
13spec](https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/pdf_reference_archives/PDFReference.pdf))
14and TIFF always uses.
15
16
17# Codes
18
19An LZW encoding is a stream of codes, each code emitting zero or more bytes.
20There are four types of codes: literals, copies, clear and end.
21
22Literal codes emit exactly one byte. For a given literal width L, there are (1
23<< L) literal codes. For example, if L is at least 7, then the 0x41 code emits
24the ASCII 'A' byte. If L is exactly 7, the highest literal code is 0x7F.
25
26There is exactly one clear code and one end code, whose encoded values are
27right after the literal codes: CLEAR = ((1 << L) + 0) and END = ((1 << L) + 1).
28For 8 bit wide literals, CLEAR is 0x100 and END is 0x101. The clear code is
29discussed below. The end code means no more data. Both emit zero bytes.
30
31Copy codes (with encoded values ranging from (END + 1) to 0xFFF, or
32equivalently, up to 4095 or up to 12 bits) emit two or more bytes, bytes that
33were previously emitted in the decoded stream. This is essentially a lookup in
34a key-value table. For example, the table could state that the 0x200 code emits
35"foo" and the 0x300 code emits "bar".
36
37Valid copy codes are in the range [END + 1, N], where N starts at END (i.e. the
38table's key range starts empty) and grows over time, up to a maximum value of
390xFFF. A clear code resets N to its starting value, END. In effect, a clear
40code clears the key-value table.
41
42There is no explicit way to "increase N" or "add this key-value to the table".
43Instead, as long as the table isn't full (i.e. that N < 0xFFF), every literal
44or copy code, other than the first one in the stream or the first one after a
45clear code, implicitly adds a key-value pair (with a key of N) and then
46increments N by 1.
47
48The first literal or copy code in the stream (or after a clear code) merely
49increments N to (END + 1) without adding a key-value pair. In practice, it is
50equivalent to have that first code set a fake key-value pair for N == END,
51provided that table lookup algorithm ensures that "END means an end code" takes
52precendence.
53
54
55## Codes Example
56
57The value in each key-value pair consists of a prefix (being 1 or more bytes)
58and a suffix (exactly 1 byte). The prefix equals the output emitted by the
59previous code: the one that was processed immediately before the current code.
60The suffix is the first byte of the output emitted by the current code.
61
62For example, if a literal T code (which emits "T") is followed by a literal O
63(which emits "O"), then the value in the implicit key-value pair is the entire
64"T" concatenated with the first byte of "O", so the value is "TO".
65
66For example, if a copy 0x104 code (which emits "BE") is followed by a copy
670x106 code (which emits "OR"), then the value is "BEO".
68
69For example, if a copy 0x10E code (which emits "TOBE") is followed by a literal
70Y (which emits "Y"), then the value is "TOBEY".
71
72It is possible for a copy code to equal N, the very key to be implicitly added
73to the table. The prefix is, once again, the output emitted by the previous
74code. The suffix byte is, once again, the first byte of the output emitted by
75the current code, which in this case is the first byte of the prefix.
76
77For example, if the previous code emitted "OTX", then a copy code of N would
78both emit "OTXO" and add a (key=N, value="OXTO") pair to the table.
79
80One possible encoding of "TOBEORNOTTOBEORTOBEORNOTXOTXOTXOOTXOOOTXOOOTOBEY" is
81in the table below. With dots to punctuate each code's emitted output, that is
82"T.O.B.E.O.R.N.O.T.TO.BE.OR.TOB.EO.RN.OT.X.OTX.OTXO.OTXOO.OTXOOO.TOBE.Y".
83
84```
85 Code   Emits    Key    Value   Pre1+Suf1  LM1  /Q  %Q   PreQ+SufQ
86    T       T
87    O       O  0x102       TO      T O       1   0   1      - TO.
88    B       B  0x103       OB      O B       1   0   1      - OB.
89    E       E  0x104       BE      B E       1   0   1      - BE.
90    O       O  0x105       EO      E O       1   0   1      - EO.
91    R       R  0x106       OR      O R       1   0   1      - OR.
92    N       N  0x107       RN      R N       1   0   1      - RN.
93    O       O  0x108       NO      N O       1   0   1      - NO.
94    T       T  0x109       OT      O T       1   0   1      - OT.
950x102      TO  0x10A       TT      T T       1   0   1      - TT.
960x104      BE  0x10B      TOB  0x102 B       2   0   2      - TOB
970x106      OR  0x10C      BEO  0x104 O       2   0   2      - BEO
980x10B     TOB  0x10D      ORT  0x106 T       2   0   2      - ORT
990x105      EO  0x10E     TOBE  0x10B E       3   1   0  0x10B E..
1000x107      RN  0x10F      EOR  0x105 R       2   0   2      - EOR
1010x109      OT  0x110      RNO  0x107 O       2   0   2      - RNO
102    X       X  0x111      OTX  0x109 X       2   0   2      - OTX
1030x111     OTX  0x112       XO      X O       1   0   1      - XO.
1040x113    OTXO  0x113     OTXO  0x111 O       3   1   0  0x111 O..
1050x114   OTXOO  0x114    OTXOO  0x113 O       4   1   1  0x111 OO.
1060x115  OTXOOO  0x115   OTXOOO  0x114 O       5   1   2  0x111 OOO
1070x10E    TOBE  0x116  OTXOOOT  0x115 T       6   2   0  0x115 T..
108    Y       Y  0x117    TOBEY  0x10E Y       4   1   1  0x10B EY.
1090x101   -end-
110```
111
112See `script/print-lzw-example.go` for the code that generated that table,
113including three implementations of a simplified core of the LZW algorithm.
114
115The first four columns (Code, Emits, Key, Value) are discussed above. The
116remaining columns are discussed below.
117
118
119## Prefix+Suffix Representation
120
121A naive implementation of the key-value table, with the complete value stored
122contiguously for each key, would have quadratic worst case memory requirements,
123as each successive value can be up to 1 byte longer than the previous longest.
124
125In contrast, the Pre1+Suf1 columns are a compact (3 bytes per key-value pair
126for a uint16 prefix and uint8 suffix) representation of the values. Each value
127is the concatenation of a variable length prefix (stored as a table key) and a
1281-byte suffix. For example, the value for key 0x10C is "BE" + "O", and "BE" is
129the value for the key 0x104.
130
131Note that the Pre1 column is exactly the same as the Code column, shifted
132vertically by one row. Note also that the Suf1 column is exactly the same as
133the final byte of the Value column.
134
135To reconstruct a code's value, start with the suffix and work backwards,
136walking the prefix chain producing one byte at a time. Either reconstruct the
137value in an intermediate buffer and then once fully reconstructed (and its
138length is known) copy it to the output buffer, or walk the chain twice (the
139first pass calculates the value length, the second pass reconstructs the value
140directly in the output buffer, skipping a copy) or store each value's length
141along with its prefix and suffix (an additional uint16 stored per key-value
142pair) and reconstruct the value directly in the output buffer in one pass.
143
144Some of the computations are slightly easier, avoiding multiple "+1"s and "-1"s
145in the program, if we store the "length minus 1" of each value, not the length.
146This is the LM1 column. Note that the LM1 of all literal codes are 0.
147
148
149## Multi-Byte Suffixes
150
151It can be faster (albeit with higher memory requirements) to store up-to-Q-byte
152suffixes instead of 1-byte suffixes, for some positive integer Q: the quantum
153of bytes per read or write. In effect, changing Q (usually at compile time, not
154at run time) is a classic performance versus memory trade-off.
155
156On modern CPUs that can do unaligned 32 or 64 bit loads and stores, a Q of 4 or
1578 can perform very well. To keep this worked example short and simple, we use a
158Q of 3, meaning that we store 1, 2 or 3 byte suffixes per key, and the prefix
159length is a multiple of 3. For example:
160
161  - The value "BE" is the "" prefix and the 2-byte "BE" suffix.
162  - The value "TOB" is the "" prefix and the 3-byte "TOB" suffix.
163  - The value "OTXOOOT" is the "OTXOOO" prefix and the 1-byte "T" suffix.
164
165The number of steps (or equivalently, Q-byte chunks) in the prefix chain is
166(LM1 / Q), which is the "/Q" column in the example above.
167
168The suffix length is ((LM1 % Q) + 1). The first part, (LM1 % Q), is the "%Q"
169column in the example above. When Q is 8, the "/Q" and "%Q" operations are
170simple bitwise operations, such as ">> 3" or "& 7". When Q is 1 (a degenerate
171case), the "%Q" column is all zeroes and all suffixes are 1 byte long, which is
172unsurprisingly equivalent to the 1-byte suffix representation described above.
173
174When copying the suffix to a buffer (either an intermediate buffer or a final
175buffer), it is often unnecessary to copy *exactly* suffix length bytes,
176provided that there is enough destination buffer space for all Q bytes. Even if
177we write excess bytes, decoding subsequent codes will overwrite the excess, or
178else we'll hit an end code and the excess will be ignored. Again, on modern
179CPUs, it can be faster to write *at least 7* bytes (i.e. *exactly 8*) than to
180write *exactly 7*.
181
182Even so, the (LM1 % Q) value is still necessary to calculate where to start
183writing those Q-byte chunk. After that, reconstructing the value proceeds in
184the same way as in the 1-byte suffix algorithm. It just involves larger chunks.
185For example, the value for the key 0x116 is the suffix "T" preceded by the
186value for 0x115, which is the suffix "OOO" preceded by the value for 0x111,
187which is the suffix "OTX" with no prefix: the value is "OTX"+"OOO"+"T".
188
189The PreQ column is more complicated than the Pre1 column. It is no longer just
190a vertically shifted Code column, and can now contain "no prefix" values (or
191equivalently, the "/Q" column is 0). When inserting a new key-value pair, we
192only set the N'th prefix equal to the previously seen code (the Code column
193shifted by one row) if the suffix for that previous code was a full Q bytes (or
194equivalently, if the current row's "%Q" value is 0). Otherwise, we copy the
195previous code's prefix and suffix, and then extend the suffix with one more
196byte. For example, the PreQ+SufQ value for the 0x117 key is based on the
197PreQ+SufQ value (the same PreQ, an extended SufQ) of the previously seen code,
198namely 0x10E.
199
200
201# Varying Bit Widths
202
203The description above has discussed an LZW stream as a stream of codes. In the
204actual wire format, those codes are packed into a stream of bytes, and a code
205can straddle byte boundaries.
206
207Specifically, the next code in the stream takes B bits, where B is the smallest
208number of that bits that can distinguish all valid codes: those codes in the
209range [0, N]. If N is 0x101, B is 9. If N is 0x3FF, B is 10 bits. If N is
2100x400, B is 11. Remember that each successive code increments N, up to a
211maximum N of 0xFFF, or 12 bits.
212
213The literal width can also vary in the range [2..8] for LZW as used by GIF. In
214that case, B starts off as 1 more than the literal width. For example, with a
215literal width of 2, the four literal codes are 0x0, 0x1, 0x2 and 0x3, which are
216followed by a clear code of 0x4 and an end code of 0x5, so N is 0x5 and B is 3.
217
218
219## Early Change
220
221Unfortunately, some implementations misinterpreted the algorithm, and
222incremented B *before* instead of *after* emitting a bit-packed code for N ==
223((1 << B) - 1). Accordingly, LZW as used in TIFF and sometimes in PDF
224(depending on the EarlyChange bit) increment B one iteration earlier. The
225maximum value of B is still 12.
226
227This misinterpretation is unfixable due to backwards compatibility,
228
229
230## LSB and MSB
231
232The bits are then packed into bytes in a predetermined ordering. For GIF, the
233bits are packed Least Significant Bits first. For PDF and TIFF, Most
234Significant Bits first. With a 8 bit literal width (and therefore a 9 bit
235initial code width), the two literal codes T (0x54) and O (0x4F) followed by an
236end code 0x101 would be encoded as 27 bits, or 4 bytes (with 5 padding bits,
237which are ignored but are conventionally set to zero).
238
239In LSB first, it would be four bytes, 0x54 0x9E 0x04 0x04:
240
241    offset  xoffset ASCII   hex     binary
242    000000  0x0000  T       0x54    0b_0101_0100
243    000001  0x0001  .       0x9E    0b_1001_1110
244    000002  0x0002  .       0x04    0b_0000_0100
245    000003  0x0003  .       0x04    0b_0000_0100
246
247These four bytes contain three 9-bit codes:
248
249    binary                              width   code
250    0b_...._...._...._...0_0101_0100     9      0x0054
251    0b_...._...._...._..00_1001_111.     9      0x004F
252    0b_...._...._...._.100_0000_01..     9      0x0101
253
254In MSB first, it would again be four bytes, 0x2A 0x13 0xE0 0x20:
255
256    offset  xoffset ASCII   hex     binary
257    000000  0x0000  *       0x2A    0b_0010_1010
258    000001  0x0001  .       0x13    0b_0001_0011
259    000002  0x0002  .       0xE0    0b_1110_0000
260    000003  0x0003          0x20    0b_0010_0000
261
262Again, these four bytes contain three 9-bit codes:
263
264    binary                              width   code
265    0b_0010_1010_0..._...._...._....     9      0x0054
266    0b_.001_0011_11.._...._...._....     9      0x004F
267    0b_..10_0000_001._...._...._....     9      0x0101
268
269See `std/gif/README.md` for a complete worked example of LSB-first encoded LZW
270data, including a varying B width.
271
272
273# More Wire Format Examples
274
275See `test/data/artificial/gif-*.commentary.txt`
276