1# LZW (Lempel Ziv Welch) Compression 2 3LZW is a general purpose compression algorithm, not specific to GIF, PDF, TIFF 4or even to graphics. In practice, there are two incompatible implementations, 5LSB (Least Significant Bits) and MSB (Most Significant Bits) first. 6 7The GIF format uses LSB first. The literal width can vary between 2 and 8 bits, 8inclusive. There is no EarlyChange option. 9 10The PDF and TIFF formats use MSB first. The literal width is fixed at 8 bits. 11There is an EarlyChange option that PDF sometimes uses (see Table 3.7 in the 12[PDF 1.4 13spec](https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/pdf_reference_archives/PDFReference.pdf)) 14and TIFF always uses. 15 16 17# Codes 18 19An LZW encoding is a stream of codes, each code emitting zero or more bytes. 20There are four types of codes: literals, copies, clear and end. 21 22Literal codes emit exactly one byte. For a given literal width L, there are (1 23<< L) literal codes. For example, if L is at least 7, then the 0x41 code emits 24the ASCII 'A' byte. If L is exactly 7, the highest literal code is 0x7F. 25 26There is exactly one clear code and one end code, whose encoded values are 27right after the literal codes: CLEAR = ((1 << L) + 0) and END = ((1 << L) + 1). 28For 8 bit wide literals, CLEAR is 0x100 and END is 0x101. The clear code is 29discussed below. The end code means no more data. Both emit zero bytes. 30 31Copy codes (with encoded values ranging from (END + 1) to 0xFFF, or 32equivalently, up to 4095 or up to 12 bits) emit two or more bytes, bytes that 33were previously emitted in the decoded stream. This is essentially a lookup in 34a key-value table. For example, the table could state that the 0x200 code emits 35"foo" and the 0x300 code emits "bar". 36 37Valid copy codes are in the range [END + 1, N], where N starts at END (i.e. the 38table's key range starts empty) and grows over time, up to a maximum value of 390xFFF. A clear code resets N to its starting value, END. In effect, a clear 40code clears the key-value table. 41 42There is no explicit way to "increase N" or "add this key-value to the table". 43Instead, as long as the table isn't full (i.e. that N < 0xFFF), every literal 44or copy code, other than the first one in the stream or the first one after a 45clear code, implicitly adds a key-value pair (with a key of N) and then 46increments N by 1. 47 48The first literal or copy code in the stream (or after a clear code) merely 49increments N to (END + 1) without adding a key-value pair. In practice, it is 50equivalent to have that first code set a fake key-value pair for N == END, 51provided that table lookup algorithm ensures that "END means an end code" takes 52precendence. 53 54 55## Codes Example 56 57The value in each key-value pair consists of a prefix (being 1 or more bytes) 58and a suffix (exactly 1 byte). The prefix equals the output emitted by the 59previous code: the one that was processed immediately before the current code. 60The suffix is the first byte of the output emitted by the current code. 61 62For example, if a literal T code (which emits "T") is followed by a literal O 63(which emits "O"), then the value in the implicit key-value pair is the entire 64"T" concatenated with the first byte of "O", so the value is "TO". 65 66For example, if a copy 0x104 code (which emits "BE") is followed by a copy 670x106 code (which emits "OR"), then the value is "BEO". 68 69For example, if a copy 0x10E code (which emits "TOBE") is followed by a literal 70Y (which emits "Y"), then the value is "TOBEY". 71 72It is possible for a copy code to equal N, the very key to be implicitly added 73to the table. The prefix is, once again, the output emitted by the previous 74code. The suffix byte is, once again, the first byte of the output emitted by 75the current code, which in this case is the first byte of the prefix. 76 77For example, if the previous code emitted "OTX", then a copy code of N would 78both emit "OTXO" and add a (key=N, value="OXTO") pair to the table. 79 80One possible encoding of "TOBEORNOTTOBEORTOBEORNOTXOTXOTXOOTXOOOTXOOOTOBEY" is 81in the table below. With dots to punctuate each code's emitted output, that is 82"T.O.B.E.O.R.N.O.T.TO.BE.OR.TOB.EO.RN.OT.X.OTX.OTXO.OTXOO.OTXOOO.TOBE.Y". 83 84``` 85 Code Emits Key Value Pre1+Suf1 LM1 /Q %Q PreQ+SufQ 86 T T 87 O O 0x102 TO T O 1 0 1 - TO. 88 B B 0x103 OB O B 1 0 1 - OB. 89 E E 0x104 BE B E 1 0 1 - BE. 90 O O 0x105 EO E O 1 0 1 - EO. 91 R R 0x106 OR O R 1 0 1 - OR. 92 N N 0x107 RN R N 1 0 1 - RN. 93 O O 0x108 NO N O 1 0 1 - NO. 94 T T 0x109 OT O T 1 0 1 - OT. 950x102 TO 0x10A TT T T 1 0 1 - TT. 960x104 BE 0x10B TOB 0x102 B 2 0 2 - TOB 970x106 OR 0x10C BEO 0x104 O 2 0 2 - BEO 980x10B TOB 0x10D ORT 0x106 T 2 0 2 - ORT 990x105 EO 0x10E TOBE 0x10B E 3 1 0 0x10B E.. 1000x107 RN 0x10F EOR 0x105 R 2 0 2 - EOR 1010x109 OT 0x110 RNO 0x107 O 2 0 2 - RNO 102 X X 0x111 OTX 0x109 X 2 0 2 - OTX 1030x111 OTX 0x112 XO X O 1 0 1 - XO. 1040x113 OTXO 0x113 OTXO 0x111 O 3 1 0 0x111 O.. 1050x114 OTXOO 0x114 OTXOO 0x113 O 4 1 1 0x111 OO. 1060x115 OTXOOO 0x115 OTXOOO 0x114 O 5 1 2 0x111 OOO 1070x10E TOBE 0x116 OTXOOOT 0x115 T 6 2 0 0x115 T.. 108 Y Y 0x117 TOBEY 0x10E Y 4 1 1 0x10B EY. 1090x101 -end- 110``` 111 112See `script/print-lzw-example.go` for the code that generated that table, 113including three implementations of a simplified core of the LZW algorithm. 114 115The first four columns (Code, Emits, Key, Value) are discussed above. The 116remaining columns are discussed below. 117 118 119## Prefix+Suffix Representation 120 121A naive implementation of the key-value table, with the complete value stored 122contiguously for each key, would have quadratic worst case memory requirements, 123as each successive value can be up to 1 byte longer than the previous longest. 124 125In contrast, the Pre1+Suf1 columns are a compact (3 bytes per key-value pair 126for a uint16 prefix and uint8 suffix) representation of the values. Each value 127is the concatenation of a variable length prefix (stored as a table key) and a 1281-byte suffix. For example, the value for key 0x10C is "BE" + "O", and "BE" is 129the value for the key 0x104. 130 131Note that the Pre1 column is exactly the same as the Code column, shifted 132vertically by one row. Note also that the Suf1 column is exactly the same as 133the final byte of the Value column. 134 135To reconstruct a code's value, start with the suffix and work backwards, 136walking the prefix chain producing one byte at a time. Either reconstruct the 137value in an intermediate buffer and then once fully reconstructed (and its 138length is known) copy it to the output buffer, or walk the chain twice (the 139first pass calculates the value length, the second pass reconstructs the value 140directly in the output buffer, skipping a copy) or store each value's length 141along with its prefix and suffix (an additional uint16 stored per key-value 142pair) and reconstruct the value directly in the output buffer in one pass. 143 144Some of the computations are slightly easier, avoiding multiple "+1"s and "-1"s 145in the program, if we store the "length minus 1" of each value, not the length. 146This is the LM1 column. Note that the LM1 of all literal codes are 0. 147 148 149## Multi-Byte Suffixes 150 151It can be faster (albeit with higher memory requirements) to store up-to-Q-byte 152suffixes instead of 1-byte suffixes, for some positive integer Q: the quantum 153of bytes per read or write. In effect, changing Q (usually at compile time, not 154at run time) is a classic performance versus memory trade-off. 155 156On modern CPUs that can do unaligned 32 or 64 bit loads and stores, a Q of 4 or 1578 can perform very well. To keep this worked example short and simple, we use a 158Q of 3, meaning that we store 1, 2 or 3 byte suffixes per key, and the prefix 159length is a multiple of 3. For example: 160 161 - The value "BE" is the "" prefix and the 2-byte "BE" suffix. 162 - The value "TOB" is the "" prefix and the 3-byte "TOB" suffix. 163 - The value "OTXOOOT" is the "OTXOOO" prefix and the 1-byte "T" suffix. 164 165The number of steps (or equivalently, Q-byte chunks) in the prefix chain is 166(LM1 / Q), which is the "/Q" column in the example above. 167 168The suffix length is ((LM1 % Q) + 1). The first part, (LM1 % Q), is the "%Q" 169column in the example above. When Q is 8, the "/Q" and "%Q" operations are 170simple bitwise operations, such as ">> 3" or "& 7". When Q is 1 (a degenerate 171case), the "%Q" column is all zeroes and all suffixes are 1 byte long, which is 172unsurprisingly equivalent to the 1-byte suffix representation described above. 173 174When copying the suffix to a buffer (either an intermediate buffer or a final 175buffer), it is often unnecessary to copy *exactly* suffix length bytes, 176provided that there is enough destination buffer space for all Q bytes. Even if 177we write excess bytes, decoding subsequent codes will overwrite the excess, or 178else we'll hit an end code and the excess will be ignored. Again, on modern 179CPUs, it can be faster to write *at least 7* bytes (i.e. *exactly 8*) than to 180write *exactly 7*. 181 182Even so, the (LM1 % Q) value is still necessary to calculate where to start 183writing those Q-byte chunk. After that, reconstructing the value proceeds in 184the same way as in the 1-byte suffix algorithm. It just involves larger chunks. 185For example, the value for the key 0x116 is the suffix "T" preceded by the 186value for 0x115, which is the suffix "OOO" preceded by the value for 0x111, 187which is the suffix "OTX" with no prefix: the value is "OTX"+"OOO"+"T". 188 189The PreQ column is more complicated than the Pre1 column. It is no longer just 190a vertically shifted Code column, and can now contain "no prefix" values (or 191equivalently, the "/Q" column is 0). When inserting a new key-value pair, we 192only set the N'th prefix equal to the previously seen code (the Code column 193shifted by one row) if the suffix for that previous code was a full Q bytes (or 194equivalently, if the current row's "%Q" value is 0). Otherwise, we copy the 195previous code's prefix and suffix, and then extend the suffix with one more 196byte. For example, the PreQ+SufQ value for the 0x117 key is based on the 197PreQ+SufQ value (the same PreQ, an extended SufQ) of the previously seen code, 198namely 0x10E. 199 200 201# Varying Bit Widths 202 203The description above has discussed an LZW stream as a stream of codes. In the 204actual wire format, those codes are packed into a stream of bytes, and a code 205can straddle byte boundaries. 206 207Specifically, the next code in the stream takes B bits, where B is the smallest 208number of that bits that can distinguish all valid codes: those codes in the 209range [0, N]. If N is 0x101, B is 9. If N is 0x3FF, B is 10 bits. If N is 2100x400, B is 11. Remember that each successive code increments N, up to a 211maximum N of 0xFFF, or 12 bits. 212 213The literal width can also vary in the range [2..8] for LZW as used by GIF. In 214that case, B starts off as 1 more than the literal width. For example, with a 215literal width of 2, the four literal codes are 0x0, 0x1, 0x2 and 0x3, which are 216followed by a clear code of 0x4 and an end code of 0x5, so N is 0x5 and B is 3. 217 218 219## Early Change 220 221Unfortunately, some implementations misinterpreted the algorithm, and 222incremented B *before* instead of *after* emitting a bit-packed code for N == 223((1 << B) - 1). Accordingly, LZW as used in TIFF and sometimes in PDF 224(depending on the EarlyChange bit) increment B one iteration earlier. The 225maximum value of B is still 12. 226 227This misinterpretation is unfixable due to backwards compatibility, 228 229 230## LSB and MSB 231 232The bits are then packed into bytes in a predetermined ordering. For GIF, the 233bits are packed Least Significant Bits first. For PDF and TIFF, Most 234Significant Bits first. With a 8 bit literal width (and therefore a 9 bit 235initial code width), the two literal codes T (0x54) and O (0x4F) followed by an 236end code 0x101 would be encoded as 27 bits, or 4 bytes (with 5 padding bits, 237which are ignored but are conventionally set to zero). 238 239In LSB first, it would be four bytes, 0x54 0x9E 0x04 0x04: 240 241 offset xoffset ASCII hex binary 242 000000 0x0000 T 0x54 0b_0101_0100 243 000001 0x0001 . 0x9E 0b_1001_1110 244 000002 0x0002 . 0x04 0b_0000_0100 245 000003 0x0003 . 0x04 0b_0000_0100 246 247These four bytes contain three 9-bit codes: 248 249 binary width code 250 0b_...._...._...._...0_0101_0100 9 0x0054 251 0b_...._...._...._..00_1001_111. 9 0x004F 252 0b_...._...._...._.100_0000_01.. 9 0x0101 253 254In MSB first, it would again be four bytes, 0x2A 0x13 0xE0 0x20: 255 256 offset xoffset ASCII hex binary 257 000000 0x0000 * 0x2A 0b_0010_1010 258 000001 0x0001 . 0x13 0b_0001_0011 259 000002 0x0002 . 0xE0 0b_1110_0000 260 000003 0x0003 0x20 0b_0010_0000 261 262Again, these four bytes contain three 9-bit codes: 263 264 binary width code 265 0b_0010_1010_0..._...._...._.... 9 0x0054 266 0b_.001_0011_11.._...._...._.... 9 0x004F 267 0b_..10_0000_001._...._...._.... 9 0x0101 268 269See `std/gif/README.md` for a complete worked example of LSB-first encoded LZW 270data, including a varying B width. 271 272 273# More Wire Format Examples 274 275See `test/data/artificial/gif-*.commentary.txt` 276