1[Info-ZIP note, 981119: this file is based on PKWARE's appnote.txt of 2 15 February 1996, taking into account PKWARE's revised appnote.txt version 3 of 01 September 1998. It has been unofficially corrected and extended by 4 Info-ZIP without explicit permission by PKWARE. Although Info-ZIP 5 believes the information to be accurate and complete, it is provided 6 under a disclaimer similar to the PKWARE disclaimer below, differing 7 only in the substitution of "Info-ZIP" for "PKWARE". In other words, 8 use this information at your own risk, but we think it's correct. 9 10 Specification info from PKWARE that was obviously wrong has been corrected 11 silently (e.g. missing structure fields, wrong numbers 12 As of PKZIPW 2.50, two new incompatibilities have been introduced by PKWARE; 13 they are noted below. Note that the "NTFS tag" conflict is currently not 14 real; PKZIPW 2.50 actually tags NTFS files as having come from a FAT 15 file system, too.] 16 17 18Disclaimer 19---------- 20 21Although PKWARE will attempt to supply current and accurate 22information relating to its file formats, algorithms, and the 23subject programs, the possibility of error can not be eliminated. 24PKWARE therefore expressly disclaims any warranty that the 25information contained in the associated materials relating to the 26subject programs and/or the format of the files created or 27accessed by the subject programs and/or the algorithms used by 28the subject programs, or any other matter, is current, correct or 29accurate as delivered. Any risk of damage due to any possible 30inaccurate information is assumed by the user of the information. 31Furthermore, the information relating to the subject programs 32and/or the file formats created or accessed by the subject 33programs and/or the algorithms used by the subject programs is 34subject to change without notice. 35 36 37General Format of a ZIP file 38---------------------------- 39 40 Files stored in arbitrary order. Large zipfiles can span multiple 41 diskette media. 42 43 Overall zipfile format: 44 45 [local file header + file data + data_descriptor] . . . 46 [central directory] end of central directory record 47 48 49 A. Local file header: 50 51 local file header signature 4 bytes (0x04034b50) 52 version needed to extract 2 bytes 53 general purpose bit flag 2 bytes 54 compression method 2 bytes 55 last mod file time 2 bytes 56 last mod file date 2 bytes 57 crc-32 4 bytes 58 compressed size 4 bytes 59 uncompressed size 4 bytes 60 filename length 2 bytes 61 extra field length 2 bytes 62 63 filename (variable size) 64 extra field (variable size) 65 66 67 B. Data descriptor: 68 69 data descriptor signature 4 bytes (0x08074b50) 70 crc-32 4 bytes 71 compressed size 4 bytes 72 uncompressed size 4 bytes 73 74 This descriptor exists only if bit 3 of the general 75 purpose bit flag is set (see below). It is byte aligned 76 and immediately follows the last byte of compressed data. 77 This descriptor is used only when it was not possible to 78 seek in the output zip file, e.g., when the output zip file 79 was standard output or a non seekable device. 80 81 C. Central directory structure: 82 83 [file header] . . . end of central dir record 84 85 File header: 86 87 central file header signature 4 bytes (0x02014b50) 88 version made by 2 bytes 89 version needed to extract 2 bytes 90 general purpose bit flag 2 bytes 91 compression method 2 bytes 92 last mod file time 2 bytes 93 last mod file date 2 bytes 94 crc-32 4 bytes 95 compressed size 4 bytes 96 uncompressed size 4 bytes 97 filename length 2 bytes 98 extra field length 2 bytes 99 file comment length 2 bytes 100 disk number start 2 bytes 101 internal file attributes 2 bytes 102 external file attributes 4 bytes 103 relative offset of local header 4 bytes 104 105 filename (variable size) 106 extra field (variable size) 107 file comment (variable size) 108 109 End of central dir record: 110 111 end of central dir signature 4 bytes (0x06054b50) 112 number of this disk 2 bytes 113 number of the disk with the 114 start of the central directory 2 bytes 115 total number of entries in 116 the central dir on this disk 2 bytes 117 total number of entries in 118 the central dir 2 bytes 119 size of the central directory 4 bytes 120 offset of start of central 121 directory with respect to 122 the starting disk number 4 bytes 123 zipfile comment length 2 bytes 124 zipfile comment (variable size) 125 126 127 D. Explanation of fields: 128 129 version made by (2 bytes) 130 131 The upper byte indicates the host system (OS) for the 132 file. Software can use this information to determine 133 the line record format for text files etc. The current 134 mappings are: 135 136 0 - FAT file system (DOS, OS/2, NT) + PKZIPW 2.50 VFAT, NTFS 137 1 - Amiga 138 2 - VMS (VAX or Alpha AXP) 139 3 - Unix 140 4 - VM/CMS 141 5 - Atari 142 6 - HPFS file system (OS/2, NT 3.x) 143 7 - Macintosh 144 8 - Z-System 145 9 - CP/M 146 10 - TOPS-20 [supposedly PKZIPW 2.50 NTFS] 147 11 - NTFS file system (NT) [used by Info-ZIP, only] 148 12 - SMS/QDOS 149 13 - Acorn RISC OS 150 14 - VFAT file system (Win95, NT) [Info-ZIP reservation, unused] 151 15 - MVS 152 16 - BeOS (BeBox or PowerMac) 153 17 - Tandem 154 18 thru 255 - unused 155 156 The lower byte indicates the version number of the 157 software used to encode the file. The value/10 158 indicates the major version number, and the value 159 mod 10 is the minor version number. 160 161 version needed to extract (2 bytes) 162 163 The minimum software version needed to extract the 164 file, mapped as above. 165 166 general purpose bit flag: (2 bytes) 167 168 Bit 0: If set, indicates that the file is encrypted. 169 170 (For Method 6 - Imploding) 171 Bit 1: If the compression method used was type 6, 172 Imploding, then this bit, if set, indicates 173 an 8K sliding dictionary was used. If clear, 174 then a 4K sliding dictionary was used. 175 Bit 2: If the compression method used was type 6, 176 Imploding, then this bit, if set, indicates 177 an 3 Shannon-Fano trees were used to encode the 178 sliding dictionary output. If clear, then 2 179 Shannon-Fano trees were used. 180 181 (For Method 8 - Deflating) 182 Bit 2 Bit 1 183 0 0 Normal (-en) compression option was used. 184 0 1 Maximum (-ex) compression option was used. 185 1 0 Fast (-ef) compression option was used. 186 1 1 Super Fast (-es) compression option was used. 187 188 Note: Bits 1 and 2 are undefined if the compression 189 method is any other. 190 191 Bit 3: If this bit is set, the fields crc-32, compressed size 192 and uncompressed size are set to zero in the local 193 header. The correct values are put in the data descriptor 194 immediately following the compressed data. (Note: PKZIP 195 version 2.04g for DOS only recognizes this bit for method 8 196 compression, newer versions of PKZIP recognize this bit 197 for any compression method.) 198 [Info-ZIP note: This bit was introduced by PKZIP 2.04 for 199 DOS. In general, this feature can only be reliably used 200 together with compression methods that allow intrinsic 201 detection of the "end-of-compressed-data" condition. From 202 the set of compression methods described in this Zip archive 203 specification, only "deflate" meets this requirement. 204 Especially, the method STORED does not work! 205 The Info-ZIP tools recognize this bit regardless of the 206 compression method; but, they rely on correctly set 207 "compressed size" information in the central directory entry.] 208 209 Bit 5: If this bit is set, this indicates that the file is compressed 210 patched data. (Note: Requires PKZIP version 2.70 or greater) 211 212 The upper three bits are reserved and used internally 213 by the software when processing the zipfile. The 214 remaining bits are unused. 215 216 compression method: (2 bytes) 217 218 (see accompanying documentation for algorithm 219 descriptions) 220 221 0 - The file is stored (no compression) 222 1 - The file is Shrunk 223 2 - The file is Reduced with compression factor 1 224 3 - The file is Reduced with compression factor 2 225 4 - The file is Reduced with compression factor 3 226 5 - The file is Reduced with compression factor 4 227 6 - The file is Imploded 228 7 - Reserved for Tokenizing compression algorithm 229 8 - The file is Deflated 230 9 - Reserved for enhanced Deflating 231 10 - PKWARE Data Compression Library Imploding 232 233 date and time fields: (2 bytes each) 234 235 The date and time are encoded in standard MS-DOS format. 236 If input came from standard input, the date and time are 237 those at which compression was started for this data. 238 239 CRC-32: (4 bytes) 240 241 The CRC-32 algorithm was generously contributed by 242 David Schwaderer and can be found in his excellent 243 book "C Programmers Guide to NetBIOS" published by 244 Howard W. Sams & Co. Inc. The 'magic number' for 245 the CRC is 0xdebb20e3. The proper CRC pre and post 246 conditioning is used, meaning that the CRC register 247 is pre-conditioned with all ones (a starting value 248 of 0xffffffff) and the value is post-conditioned by 249 taking the one's complement of the CRC residual. 250 If bit 3 of the general purpose flag is set, this 251 field is set to zero in the local header and the correct 252 value is put in the data descriptor and in the central 253 directory. 254 255 compressed size: (4 bytes) 256 uncompressed size: (4 bytes) 257 258 The size of the file compressed and uncompressed, 259 respectively. If bit 3 of the general purpose bit flag 260 is set, these fields are set to zero in the local header 261 and the correct values are put in the data descriptor and 262 in the central directory. 263 264 filename length: (2 bytes) 265 extra field length: (2 bytes) 266 file comment length: (2 bytes) 267 268 The length of the filename, extra field, and comment 269 fields respectively. The combined length of any 270 directory record and these three fields should not 271 generally exceed 65,535 bytes. If input came from standard 272 input, the filename length is set to zero. 273 274 [Info-ZIP note: 275 This feature is not yet supported by any PKWARE version of ZIP 276 (at least not in PKZIP for DOS and PKZIP for Windows/WinNT). 277 The Info-ZIP programs handle standard input differently: 278 If input came from standard input, the filename is set to "-" 279 (length one).] 280 281 282 disk number start: (2 bytes) 283 284 The number of the disk on which this file begins. 285 286 internal file attributes: (2 bytes) 287 288 The lowest bit of this field indicates, if set, that 289 the file is apparently an ASCII or text file. If not 290 set, that the file apparently contains binary data. 291 The remaining bits are unused in version 1.0. 292 293 external file attributes: (4 bytes) 294 295 The mapping of the external attributes is 296 host-system dependent (see 'version made by'). For 297 MS-DOS, the low order byte is the MS-DOS directory 298 attribute byte. If input came from standard input, this 299 field is set to zero. 300 301 relative offset of local header: (4 bytes) 302 303 This is the offset from the start of the first disk on 304 which this file appears, to where the local header should 305 be found. 306 307 filename: (Variable) 308 309 The name of the file, with optional relative path. 310 The path stored should not contain a drive or 311 device letter, or a leading slash. All slashes 312 should be forward slashes '/' as opposed to 313 backwards slashes '\' for compatibility with Amiga 314 and Unix file systems etc. If input came from standard 315 input, there is no filename field. 316 [Info-ZIP discrepancy: 317 If input came from standard input, the file name is set 318 to "-" (without the quotes). 319 As far as we know, the PKWARE specification for "input from 320 stdin" is not supported by PKZIP/PKUNZIP for DOS, OS/2, Windows 321 Windows NT.] 322 323 extra field: (Variable) 324 325 This is for future expansion. If additional information 326 needs to be stored in the future, it should be stored 327 here. Earlier versions of the software can then safely 328 skip this file, and find the next file or header. This 329 field will be 0 length in version 1.0. 330 331 In order to allow different programs and different types 332 of information to be stored in the 'extra' field in .ZIP 333 files, the following structure should be used for all 334 programs storing data in this field: 335 336 header1+data1 + header2+data2 . . . 337 338 Each header should consist of: 339 340 Header ID - 2 bytes 341 Data Size - 2 bytes 342 343 Note: all fields stored in Intel low-byte/high-byte order. 344 345 The Header ID field indicates the type of data that is in 346 the following data block. 347 348 Header ID's of 0 thru 31 are reserved for use by PKWARE. 349 The remaining ID's can be used by third party vendors for 350 proprietary usage. 351 352 The current Header ID mappings defined by PKWARE are: 353 354 0x0007 AV Info 355 0x0009 OS/2 extended attributes (also Info-ZIP) 356 0x000a PKWARE Win95/WinNT FileTimes [undocumented!] 357 0x000c PKWARE VAX/VMS (also Info-ZIP) 358 0x000d PKWARE Unix 359 0x000f Patch Descriptor 360 361 The Header ID mappings defined by Info-ZIP and third parties are: 362 363 0x07c8 Info-ZIP Macintosh (old, J. Lee) 364 0x2605 ZipIt Macintosh (first version) 365 0x2705 ZipIt Macintosh v 1.3.5 and newer (w/o full filename) 366 0x334d Info-ZIP Macintosh (new, D. Haase's 'Mac3' field ) 367 0x4341 Acorn/SparkFS (David Pilling) 368 0x4453 Windows NT security descriptor (binary ACL) 369 0x4704 VM/CMS 370 0x470f MVS 371 0x4b46 FWKCS MD5 (third party, see below) 372 0x4c41 OS/2 access control list (text ACL) 373 0x4d49 Info-ZIP VMS (VAX or Alpha) 374 0x5356 AOS/VS (binary ACL) 375 0x5455 extended timestamp 376 0x5855 Info-ZIP Unix (original; also OS/2, NT, etc.) 377 0x6542 BeOS (BeBox, PowerMac, etc.) 378 0x756e ASi Unix 379 0x7855 Info-ZIP Unix (new) 380 0xfb4a SMS/QDOS 381 382 The Data Size field indicates the size of the following 383 data block. Programs can use this value to skip to the 384 next header block, passing over any data blocks that are 385 not of interest. 386 387 Note: As stated above, the size of the entire .ZIP file 388 header, including the filename, comment, and extra 389 field should not exceed 64K in size. 390 391 In case two different programs should appropriate the same 392 Header ID value, it is strongly recommended that each 393 program place a unique signature of at least two bytes in 394 size (and preferably 4 bytes or bigger) at the start of 395 each data area. Every program should verify that its 396 unique signature is present, in addition to the Header ID 397 value being correct, before assuming that it is a block of 398 known type. 399 400 In the following descriptions, note that "Short" means two bytes, 401 "Long" means four bytes, and "Long-Long" means eight bytes, 402 regardless of their native sizes. Unless specifically noted, all 403 integer fields should be interpreted as unsigned (non-negative) 404 numbers. 405 406 407 -OS/2 Extended Attributes Extra Field: 408 ==================================== 409 410 The following is the layout of the OS/2 extended attributes "extra" 411 block. (Last Revision 19960922) 412 413 Note: all fields stored in Intel low-byte/high-byte order. 414 415 Local-header version: 416 417 Value Size Description 418 ----- ---- ----------- 419 (OS/2) 0x0009 Short tag for this extra block type 420 TSize Short total data size for this block 421 BSize Long uncompressed EA data size 422 CType Short compression type 423 EACRC Long CRC value for uncompressed EA data 424 (var.) variable compressed EA data 425 426 Central-header version: 427 428 Value Size Description 429 ----- ---- ----------- 430 (OS/2) 0x0009 Short tag for this extra block type 431 TSize Short total data size for this block 432 BSize Long size of uncompressed local EA data 433 434 The value of CType is interpreted according to the "compression 435 method" section above; i.e., 0 for stored, 8 for deflated, etc. 436 437 The OS/2 extended attribute structure (FEA2LIST) is compressed and 438 then stored in its entirety within this structure. There will only 439 ever be one block of data in the variable-length field. 440 441 442 -OS/2 Access Control List Extra Field: 443 ==================================== 444 445 The following is the layout of the OS/2 ACL extra block. 446 (Last Revision 19960922) 447 448 Local-header version: 449 450 Value Size Description 451 ----- ---- ----------- 452 (ACL) 0x4c41 Short tag for this extra block type 453 TSize Short total data size for this block 454 BSize Long uncompressed ACL data size 455 CType Short compression type 456 EACRC Long CRC value for uncompressed ACL data 457 (var.) variable compressed ACL data 458 459 Central-header version: 460 461 Value Size Description 462 ----- ---- ----------- 463 (ACL) 0x4c41 Short tag for this extra block type 464 TSize Short total data size for this block 465 BSize Long size of uncompressed local ACL data 466 467 The value of CType is interpreted according to the "compression 468 method" section above; i.e., 0 for stored, 8 for deflated, etc. 469 470 The uncompressed ACL data consist of a text header of the form 471 "ACL1:%hX,%hd\n", where the first field is the OS/2 ACCINFO acc_attr 472 member and the second is acc_count, followed by acc_count strings 473 of the form "%s,%hx\n", where the first field is acl_ugname (user 474 group name) and the second acl_access. This block type will be 475 extended for other operating systems as needed. 476 477 478 -Windows NT Security Descriptor Extra Field: 479 ========================================== 480 481 The following is the layout of the NT Security Descriptor (another 482 type of ACL) extra block. (Last Revision 19960922) 483 484 Local-header version: 485 486 Value Size Description 487 ----- ---- ----------- 488 (SD) 0x4453 Short tag for this extra block type 489 TSize Short total data size for this block 490 BSize Long uncompressed SD data size 491 Version Byte version of uncompressed SD data format 492 CType Short compression type 493 EACRC Long CRC value for uncompressed SD data 494 (var.) variable compressed SD data 495 496 Central-header version: 497 498 Value Size Description 499 ----- ---- ----------- 500 (SD) 0x4453 Short tag for this extra block type 501 TSize Short total data size for this block 502 BSize Long size of uncompressed local SD data 503 504 The value of CType is interpreted according to the "compression 505 method" section above; i.e., 0 for stored, 8 for deflated, etc. 506 Version specifies how the compressed data are to be interpreted 507 and allows for future expansion of this extra field type. Currently 508 only version 0 is defined. 509 510 For version 0, the compressed data are to be interpreted as a single 511 valid Windows NT SECURITY_DESCRIPTOR data structure, in self-relative 512 format. 513 514 515 -PKWARE Win95/WinNT Extra Field: 516 ============================== 517 518 The following description covers PKWARE's undocumented 519 Windows 95 & Windows NT extra field, introduced with the 520 release of PKZIP for Windows 2.50. (Last Revision 19980425) 521 522 This field has a fixed data size of 32 bytes and is only stored 523 as local extra field. 524 525 Value Size Description 526 ----- ---- ----------- 527 (WinNT) 0x000a Short Tag for this "extra" block type 528 TSize Short Total Data Size for this block 529 Unknwn1 Long ???? (all 0 ?) 530 Unknwn2 Long ???? 531 ModTime Long-Long 64-bit NTFS last-modified filetime 532 AccTime Long-Long 64-bit NTFS last-access filetime 533 CreTime Long-Long 64-bit NTFS creation filetime 534 535 The NTFS filetimes are 64-bit unsigned integers, stored in Intel 536 (least significant byte first) byte order. They determine the 537 number of 1.0E-07 seconds (1/10th microseconds!) past WinNT "epoch", 538 which is "01-Jan-1601 00:00:00 UTC". 539 540 541 -PKWARE VAX/VMS Extra Field: 542 ========================== 543 544 The following is the layout of PKWARE's VAX/VMS attributes "extra" 545 block. (Last Revision 12/17/91) 546 547 Note: all fields stored in Intel low-byte/high-byte order. 548 549 Value Size Description 550 ----- ---- ----------- 551 (VMS) 0x000c Short Tag for this "extra" block type 552 TSize Short Total Data Size for this block 553 CRC Long 32-bit CRC for remainder of the block 554 Tag1 Short VMS attribute tag value #1 555 Size1 Short Size of attribute #1, in bytes 556 (var.) Size1 Attribute #1 data 557 . 558 . 559 . 560 TagN Short VMS attribute tage value #N 561 SizeN Short Size of attribute #N, in bytes 562 (var.) SizeN Attribute #N data 563 564 Rules: 565 566 1. There will be one or more of attributes present, which will 567 each be preceded by the above TagX & SizeX values. These 568 values are identical to the ATR$C_XXXX and ATR$S_XXXX constants 569 which are defined in ATR.H under VMS C. Neither of these values 570 will ever be zero. 571 572 2. No word alignment or padding is performed. 573 574 3. A well-behaved PKZIP/VMS program should never produce more than 575 one sub-block with the same TagX value. Also, there will never 576 be more than one "extra" block of type 0x000c in a particular 577 directory record. 578 579 580 -Info-ZIP VMS Extra Field: 581 ======================== 582 583 The following is the layout of Info-ZIP's VMS attributes extra 584 block for VAX or Alpha AXP. The local-header and central-header 585 versions are identical. (Last Revision 19960922) 586 587 Value Size Description 588 ----- ---- ----------- 589 (VMS2) 0x4d49 Short tag for this extra block type 590 TSize Short total data size for this block 591 ID Long block ID 592 Flags Short info bytes 593 BSize Short uncompressed block size 594 Reserved Long (reserved) 595 (var.) variable compressed VMS file-attributes block 596 597 The block ID is one of the following unterminated strings: 598 599 "VFAB" struct FAB 600 "VALL" struct XABALL 601 "VFHC" struct XABFHC 602 "VDAT" struct XABDAT 603 "VRDT" struct XABRDT 604 "VPRO" struct XABPRO 605 "VKEY" struct XABKEY 606 "VMSV" version (e.g., "V6.1"; truncated at hyphen) 607 "VNAM" reserved 608 609 The lower three bits of Flags indicate the compression method. The 610 currently defined methods are: 611 612 0 stored (not compressed) 613 1 simple "RLE" 614 2 deflated 615 616 The "RLE" method simply replaces zero-valued bytes with zero-valued 617 bits and non-zero-valued bytes with a "1" bit followed by the byte 618 value. 619 620 The variable-length compressed data contains only the data corre- 621 sponding to the indicated structure or string. Typically multiple 622 VMS2 extra fields are present (each with a unique block type). 623 624 625 -Info-ZIP Macintosh Extra Field: 626 ============================== 627 628 The following is the layout of the (old) Info-ZIP resource-fork extra 629 block for Macintosh. The local-header and central-header versions 630 are identical. (Last Revision 19960922) 631 632 Value Size Description 633 ----- ---- ----------- 634 (Mac) 0x07c8 Short tag for this extra block type 635 TSize Short total data size for this block 636 "JLEE" beLong extra-field signature 637 FInfo 16 bytes Macintosh FInfo structure 638 CrDat beLong HParamBlockRec fileParam.ioFlCrDat 639 MdDat beLong HParamBlockRec fileParam.ioFlMdDat 640 Flags beLong info bits 641 DirID beLong HParamBlockRec fileParam.ioDirID 642 VolName 28 bytes volume name (optional) 643 644 All fields but the first two are in native Macintosh format 645 (big-endian Motorola order, not little-endian Intel). The least 646 significant bit of Flags is 1 if the file is a data fork, 0 other- 647 wise. In addition, if this extra field is present, the filename 648 has an extra 'd' or 'r' appended to indicate data fork or resource 649 fork. The 28-byte VolName field may be omitted. 650 651 652 -ZipIt Macintosh Extra Field (long): 653 ================================== 654 655 The following is the layout of the ZipIt extra block for Macintosh. 656 The local-header and central-header versions are identical. 657 (Last Revision 19970130) 658 659 Value Size Description 660 ----- ---- ----------- 661 (Mac2) 0x2605 Short tag for this extra block type 662 TSize Short total data size for this block 663 "ZPIT" beLong extra-field signature 664 FnLen Byte length of FileName 665 FileName variable full Macintosh filename 666 FileType Byte[4] four-byte Mac file type string 667 Creator Byte[4] four-byte Mac creator string 668 669 670 -ZipIt Macintosh Extra Field (short): 671 =================================== 672 673 The following is the layout of a shortened variant of the 674 ZipIt extra block for Macintosh (without "full name" entry). 675 This variant is used by ZipIt 1.3.5 and newer for entries that 676 do not need a "full Mac filename" record. 677 The local-header and central-header versions are identical. 678 (Last Revision 19980903) 679 680 Value Size Description 681 ----- ---- ----------- 682 (Mac2b) 0x2705 Short tag for this extra block type 683 TSize Short total data size for this block 684 "ZPIT" beLong extra-field signature 685 FileType Byte[4] four-byte Mac file type string 686 Creator Byte[4] four-byte Mac creator string 687 688 689 -Info-ZIP Macintosh Extra Field (new): 690 ==================================== 691 692 The following is the layout of the (new) Info-ZIP extra 693 block for Macintosh, designed by Dirk Haase. 694 All values are in little-endian. 695 (Last Revision 19981005) 696 697 Local-header version: 698 699 Value Size Description 700 ----- ---- ----------- 701 (Mac3) 0x334d Short tag for this extra block type ("M3") 702 TSize Short total data size for this block 703 BSize Long uncompressed finder attribute data size 704 Flags Short info bits 705 fdType Byte[4] Type of the File (4-byte string) 706 fdCreator Byte[4] Creator of the File (4-byte string) 707 (CType) Short compression type 708 (CRC) Long CRC value for uncompressed MacOS data 709 Attribs variable finder attribute data (see below) 710 711 712 Central-header version: 713 714 Value Size Description 715 ----- ---- ----------- 716 (Mac3) 0x334d Short tag for this extra block type ("M3") 717 TSize Short total data size for this block 718 BSize Long uncompressed finder attribute data size 719 Flags Short info bits 720 fdType Byte[4] Type of the File (4-byte string) 721 fdCreator Byte[4] Creator of the File (4-byte string) 722 723 The third bit of Flags in both headers indicates whether 724 the LOCAL extra field is uncompressed (and therefore whether CType 725 and CRC are omitted): 726 727 Bits of the Flags: 728 bit 0 if set, file is a data fork; otherwise unset 729 bit 1 if set, filename will be not changed 730 bit 2 if set, Attribs is uncompressed (no CType, CRC) 731 bit 3 if set, date and times are in 64 bit 732 if zero date and times are in 32 bit. 733 bit 4 if set, timezone offsets fields for the native 734 Mac times are omitted (UTC support deactivated) 735 bits 5-15 reserved; 736 737 738 Attributes: 739 740 Attribs is a Mac-specific block of data in little-endian format with 741 the following structure (if compressed, uncompress it first): 742 743 Value Size Description 744 ----- ---- ----------- 745 fdFlags Short Finder Flags 746 fdLocation.v Short Finder Icon Location 747 fdLocation.h Short Finder Icon Location 748 fdFldr Short Folder containing file 749 750 FXInfo 16 bytes Macintosh FXInfo structure 751 FXInfo-Structure: 752 fdIconID Short 753 fdUnused[3] Short unused but reserved 6 bytes 754 fdScript Byte Script flag and number 755 fdXFlags Byte More flag bits 756 fdComment Short Comment ID 757 fdPutAway Long Home Dir ID 758 759 FVersNum Byte file version number 760 may be not used by MacOS 761 ACUser Byte directory access rights 762 763 FlCrDat ULong date and time of creation 764 FlMdDat ULong date and time of last modification 765 FlBkDat ULong date and time of last backup 766 These time numbers are original Mac FileTime values (local time!). 767 Currently, date-time width is 32-bit, but future version may 768 support be 64-bit times (see flags) 769 770 CrGMTOffs Long(signed!) difference "local Creat. time - UTC" 771 MdGMTOffs Long(signed!) difference "local Modif. time - UTC" 772 BkGMTOffs Long(signed!) difference "local Backup time - UTC" 773 These "local time - UTC" differences (stored in seconds) may be 774 used to support timestamp adjustment after inter-timezone transfer. 775 These fields are optional; bit 4 of the flags word controls their 776 presence. 777 778 Charset Short TextEncodingBase (Charset) 779 valid for the following two fields 780 781 FullPath variable Path of the current file. 782 Zero terminated string (C-String) 783 Currently coded in the native Charset. 784 785 Comment variable Finder Comment of the current file. 786 Zero terminated string (C-String) 787 Currently coded in the native Charset. 788 789 790 -Acorn SparkFS Extra Field: 791 ========================= 792 793 The following is the layout of David Pilling's SparkFS extra block 794 for Acorn RISC OS. The local-header and central-header versions are 795 identical. (Last Revision 19960922) 796 797 Value Size Description 798 ----- ---- ----------- 799 (Acorn) 0x4341 Short tag for this extra block type 800 TSize Short total data size for this block 801 "ARC0" Long extra-field signature 802 LoadAddr Long load address or file type 803 ExecAddr Long exec address 804 Attr Long file permissions 805 Zero Long reserved; always zero 806 807 The following bits of Attr are associated with the given file 808 permissions: 809 810 bit 0 user-writable ('W') 811 bit 1 user-readable ('R') 812 bit 2 reserved 813 bit 3 locked ('L') 814 bit 4 publicly writable ('w') 815 bit 5 publicly readable ('r') 816 bit 6 reserved 817 bit 7 reserved 818 819 820 -VM/CMS Extra Field: 821 ================== 822 823 The following is the layout of the file-attributes extra block for 824 VM/CMS. The local-header and central-header versions are 825 identical. (Last Revision 19960922) 826 827 Value Size Description 828 ----- ---- ----------- 829 (VM/CMS) 0x4704 Short tag for this extra block type 830 TSize Short total data size for this block 831 flData variable file attributes data 832 833 flData is an uncompressed fldata_t struct. 834 835 836 -MVS Extra Field: 837 =============== 838 839 The following is the layout of the file-attributes extra block for 840 MVS. The local-header and central-header versions are identical. 841 (Last Revision 19960922) 842 843 Value Size Description 844 ----- ---- ----------- 845 (MVS) 0x470f Short tag for this extra block type 846 TSize Short total data size for this block 847 flData variable file attributes data 848 849 flData is an uncompressed fldata_t struct. 850 851 852 -PKWARE Unix Extra Field: 853 ======================== 854 855 The following is the layout of PKWARE's Unix "extra" block. 856 It was introduced with the release of PKZIP for Unix 2.50. 857 Note: all fields are stored in Intel low-byte/high-byte order. 858 (Last Revision 19980901) 859 860 This field has a minimum data size of 12 bytes and is only stored 861 as local extra field. 862 863 Value Size Description 864 ----- ---- ----------- 865 (Unix0) 0x000d Short Tag for this "extra" block type 866 TSize Short Total Data Size for this block 867 AcTime Long time of last access (UTC/GMT) 868 ModTime Long time of last modification (UTC/GMT) 869 UID Short Unix user ID 870 GID Short Unix group ID 871 (var) variable Variable length data field 872 873 The variable length data field will contain file type 874 specific data. Currently the only values allowed are 875 the original "linked to" file names for hard or symbolic links. 876 877 The fixed part of this field has the same layout as Info-ZIP's 878 abandoned "Unix1 timestamps & owner ID info" extra field; 879 only the two tag bytes are different. 880 881 882 -PATCH Descriptor Extra Field: 883 ============================ 884 885 The following is the layout of the Patch Descriptor "extra" 886 block. 887 888 Note: all fields stored in Intel low-byte/high-byte order. 889 890 Value Size Description 891 ----- ---- ----------- 892 (Patch) 0x000f Short Tag for this "extra" block type 893 TSize Short Size of the total "extra" block 894 Version Short Version of the descriptor 895 Flags Long Actions and reactions (see below) 896 OldSize Long Size of the file about to be patched 897 OldCRC Long 32-bit CRC of the file about to be patched 898 NewSize Long Size of the resulting file 899 NewCRC Long 32-bit CRC of the resulting file 900 901 902 Actions and reactions 903 904 Bits Description 905 ---- ---------------- 906 0 Use for autodetection 907 1 Treat as selfpatch 908 2-3 RESERVED 909 4-5 Action (see below) 910 6-7 RESERVED 911 8-9 Reaction (see below) to absent file 912 10-11 Reaction (see below) to newer file 913 12-13 Reaction (see below) to unknown file 914 14-15 RESERVED 915 16-31 RESERVED 916 917 Actions 918 919 Action Value 920 ------ ----- 921 none 0 922 add 1 923 delete 2 924 patch 3 925 926 Reactions 927 928 Reaction Value 929 -------- ----- 930 ask 0 931 skip 1 932 ignore 2 933 fail 3 934 935 936 -Extended Timestamp Extra Field: 937 ============================== 938 939 The following is the layout of the extended-timestamp extra block. 940 (Last Revision 19970118) 941 942 Local-header version: 943 944 Value Size Description 945 ----- ---- ----------- 946 (time) 0x5455 Short tag for this extra block type 947 TSize Short total data size for this block 948 Flags Byte info bits 949 (ModTime) Long time of last modification (UTC/GMT) 950 (AcTime) Long time of last access (UTC/GMT) 951 (CrTime) Long time of original creation (UTC/GMT) 952 953 Central-header version: 954 955 Value Size Description 956 ----- ---- ----------- 957 (time) 0x5455 Short tag for this extra block type 958 TSize Short total data size for this block 959 Flags Byte info bits (refers to local header!) 960 (ModTime) Long time of last modification (UTC/GMT) 961 962 The central-header extra field contains the modification time only, 963 or no timestamp at all. TSize is used to flag its presence or 964 absence. But note: 965 966 If "Flags" indicates that Modtime is present in the local header 967 field, it MUST be present in the central header field, too! 968 This correspondence is required because the modification time 969 value may be used to support trans-timezone freshening and 970 updating operations with zip archives. 971 972 The time values are in standard Unix signed-long format, indicating 973 the number of seconds since 1 January 1970 00:00:00. The times 974 are relative to Coordinated Universal Time (UTC), also sometimes 975 referred to as Greenwich Mean Time (GMT). To convert to local time, 976 the software must know the local timezone offset from UTC/GMT. 977 978 The lower three bits of Flags in both headers indicate which time- 979 stamps are present in the LOCAL extra field: 980 981 bit 0 if set, modification time is present 982 bit 1 if set, access time is present 983 bit 2 if set, creation time is present 984 bits 3-7 reserved for additional timestamps; not set 985 986 Those times that are present will appear in the order indicated, but 987 any combination of times may be omitted. (Creation time may be 988 present without access time, for example.) TSize should equal 989 (1 + 4*(number of set bits in Flags)), as the block is currently 990 defined. Other timestamps may be added in the future. 991 992 993 -Info-ZIP Unix Extra Field (type 1): 994 ================================== 995 996 The following is the layout of the old Info-ZIP extra block for 997 Unix. It has been replaced by the extended-timestamp extra block 998 (0x5455) and the Unix type 2 extra block (0x7855). 999 (Last Revision 19970118) 1000 1001 Local-header version: 1002 1003 Value Size Description 1004 ----- ---- ----------- 1005 (Unix1) 0x5855 Short tag for this extra block type 1006 TSize Short total data size for this block 1007 AcTime Long time of last access (UTC/GMT) 1008 ModTime Long time of last modification (UTC/GMT) 1009 UID Short Unix user ID 1010 GID Short Unix group ID 1011 1012 Central-header version: 1013 1014 Value Size Description 1015 ----- ---- ----------- 1016 (Unix1) 0x5855 Short tag for this extra block type 1017 TSize Short total data size for this block 1018 AcTime Long time of last access (GMT/UTC) 1019 ModTime Long time of last modification (GMT/UTC) 1020 1021 The file access and modification times are in standard Unix signed- 1022 long format, indicating the number of seconds since 1 January 1970 1023 00:00:00. The times are relative to Coordinated Universal Time 1024 (UTC), also sometimes referred to as Greenwich Mean Time (GMT). To 1025 convert to local time, the software must know the local timezone 1026 offset from UTC/GMT. The modification time may be used by non-Unix 1027 systems to support inter-timezone freshening and updating of zip 1028 archives. 1029 1030 The local-header extra block may optionally contain UID and GID 1031 info for the file. The local-header TSize value is the only 1032 indication of this. Note that Unix UIDs and GIDs are usually 1033 specific to a particular machine, and they generally require root 1034 access to restore. 1035 1036 This extra field type is obsolete, but it has been in use since 1037 mid-1994. Therefore future archiving software should continue to 1038 support it. Some guidelines: 1039 1040 An archive member should either contain the old "Unix1" 1041 extra field block or the new extra field types "time" and/or 1042 "Unix2". 1043 1044 If both the old "Unix1" block type and one or both of the new 1045 block types "time" and "Unix2" are found, the "Unix1" block 1046 should be considered invalid and ignored. 1047 1048 Unarchiving software should recognize both old and new extra 1049 field block types, but the info from new types overrides the 1050 old "Unix1" field. 1051 1052 Archiving software should recognize "Unix1" extra fields for 1053 timestamp comparison but never create it for updated, freshened 1054 or new archive members. When copying existing members to a new 1055 archive, any "Unix1" extra field blocks should be converted to 1056 the new "time" and/or "Unix2" types. 1057 1058 1059 -Info-ZIP Unix Extra Field (type 2): 1060 ================================== 1061 1062 The following is the layout of the new Info-ZIP extra block for 1063 Unix. (Last Revision 19960922) 1064 1065 Local-header version: 1066 1067 Value Size Description 1068 ----- ---- ----------- 1069 (Unix2) 0x7855 Short tag for this extra block type 1070 TSize Short total data size for this block 1071 UID Short Unix user ID 1072 GID Short Unix group ID 1073 1074 Central-header version: 1075 1076 Value Size Description 1077 ----- ---- ----------- 1078 (Unix2) 0x7855 Short tag for this extra block type 1079 TSize Short total data size for this block 1080 1081 The data size of the central-header version is zero; it is used 1082 solely as a flag that UID/GID info is present in the local-header 1083 extra field. If additional fields are ever added to the local 1084 version, the central version may be extended to indicate this. 1085 1086 Note that Unix UIDs and GIDs are usually specific to a particular 1087 machine, and they generally require root access to restore. 1088 1089 1090 -ASi Unix Extra Field: 1091 ==================== 1092 1093 The following is the layout of the ASi extra block for Unix. The 1094 local-header and central-header versions are identical. 1095 (Last Revision 19960916) 1096 1097 Value Size Description 1098 ----- ---- ----------- 1099 (Unix3) 0x756e Short tag for this extra block type 1100 TSize Short total data size for this block 1101 CRC Long CRC-32 of the remaining data 1102 Mode Short file permissions 1103 SizDev Long symlink'd size OR major/minor dev num 1104 UID Short user ID 1105 GID Short group ID 1106 (var.) variable symbolic link filename 1107 1108 Mode is the standard Unix st_mode field from struct stat, containing 1109 user/group/other permissions, setuid/setgid and symlink info, etc. 1110 1111 If Mode indicates that this file is a symbolic link, SizDev is the 1112 size of the file to which the link points. Otherwise, if the file 1113 is a device, SizDev contains the standard Unix st_rdev field from 1114 struct stat (includes the major and minor numbers of the device). 1115 SizDev is undefined in other cases. 1116 1117 If Mode indicates that the file is a symbolic link, the final field 1118 will be the name of the file to which the link points. The file- 1119 name length can be inferred from TSize. 1120 1121 [Note that TSize may incorrectly refer to the data size not counting 1122 the CRC; i.e., it may be four bytes too small.] 1123 1124 1125 -BeOS Extra Field: 1126 ================ 1127 1128 The following is the layout of the file-attributes extra block for 1129 BeOS. (Last Revision 19970531) 1130 1131 Local-header version: 1132 1133 Value Size Description 1134 ----- ---- ----------- 1135 (BeOS) 0x6542 Short tag for this extra block type 1136 TSize Short total data size for this block 1137 BSize Long uncompressed file attribute data size 1138 Flags Byte info bits 1139 (CType) Short compression type 1140 (CRC) Long CRC value for uncompressed file attribs 1141 Attribs variable file attribute data 1142 1143 Central-header version: 1144 1145 Value Size Description 1146 ----- ---- ----------- 1147 (BeOS) 0x6542 Short tag for this extra block type 1148 TSize Short total data size for this block 1149 BSize Long size of uncompressed local EF block data 1150 Flags Byte info bits 1151 1152 The least significant bit of Flags in both headers indicates whether 1153 the LOCAL extra field is uncompressed (and therefore whether CType 1154 and CRC are omitted): 1155 1156 bit 0 if set, Attribs is uncompressed (no CType, CRC) 1157 bits 1-7 reserved; if set, assume error or unknown data 1158 1159 Currently the only supported compression types are deflated (type 8) 1160 and stored (type 0); the latter is not used by Info-ZIP's Zip but is 1161 supported by UnZip. 1162 1163 Attribs is a BeOS-specific block of data in big-endian format with 1164 the following structure (if compressed, uncompress it first): 1165 1166 Value Size Description 1167 ----- ---- ----------- 1168 Name variable attribute name (null-terminated string) 1169 Type Long attribute type (32-bit unsigned integer) 1170 Size Long Long data size for this sub-block (64 bits) 1171 Data variable attribute data 1172 1173 The attribute structure is repeated for every attribute. The Data 1174 field may contain anything--text, flags, bitmaps, etc. 1175 1176 1177 -SMS/QDOS Extra Field: 1178 ==================== 1179 1180 The following is the layout of the file-attributes extra block for 1181 SMS/QDOS. The local-header and central-header versions are identical. 1182 (Last Revision 19960929) 1183 1184 Value Size Description 1185 ----- ---- ----------- 1186 (QDOS) 0xfb4a Short tag for this extra block type 1187 TSize Short total data size for this block 1188 LongID Long extra-field signature 1189 (ExtraID) Long additional signature/flag bytes 1190 QDirect 64 bytes qdirect structure 1191 1192 LongID may be "QZHD" or "QDOS". In the latter case, ExtraID will 1193 be present. Its first three bytes are "02\0"; the last byte is 1194 currently undefined. 1195 1196 QDirect contains the file's uncompressed directory info (qdirect 1197 struct). Its elements are in native (big-endian) format: 1198 1199 d_length beLong file length 1200 d_access byte file access type 1201 d_type byte file type 1202 d_datalen beLong data length 1203 d_reserved beLong unused 1204 d_szname beShort size of filename 1205 d_name 36 bytes filename 1206 d_update beLong time of last update 1207 d_refdate beLong file version number 1208 d_backup beLong time of last backup (archive date) 1209 1210 1211 -AOS/VS Extra Field: 1212 ================== 1213 1214 The following is the layout of the extra block for Data General 1215 AOS/VS. The local-header and central-header versions are identical. 1216 (Last Revision 19961125) 1217 1218 Value Size Description 1219 ----- ---- ----------- 1220 (AOSVS) 0x5356 Short tag for this extra block type 1221 TSize Short total data size for this block 1222 "FCI\0" Long extra-field signature 1223 Version Byte version of AOS/VS extra block (10 = 1.0) 1224 Fstat variable fstat packet 1225 AclBuf variable raw ACL data ($MXACL bytes) 1226 1227 Fstat contains the file's uncompressed fstat packet, which is one of 1228 the following: 1229 1230 normal fstat packet (P_FSTAT struct) 1231 DIR/CPD fstat packet (P_FSTAT_DIR struct) 1232 unit (device) fstat packet (P_FSTAT_UNIT struct) 1233 IPC file fstat packet (P_FSTAT_IPC struct) 1234 1235 AclBuf contains the raw ACL data; its length is $MXACL. 1236 1237 1238 -FWKCS MD5 Extra Field: 1239 ===================== 1240 1241 The following is the layout of the optional extra block used by the 1242 FWKCS utility. There is no local-header version; the following 1243 applies only to the central header. (Last Revision 19961207) 1244 1245 Central-header version: 1246 1247 Value Size Description 1248 ----- ---- ----------- 1249 (MD5) 0x4b46 Short tag for this extra block type 1250 TSize Short total data size for this block (19) 1251 "MD5" 3 bytes extra-field signature 1252 MD5hash 16 bytes 128-bit MD5 hash of uncompressed data 1253 1254 The MD5 hash in this extra block is used to automatically identify 1255 files independent of their filenames; it is an an enhanced contents- 1256 signature. 1257 1258 FWKCS provides an option to strip this extra field, if 1259 present, from a zipfile central directory. In adding 1260 this extra field, FWKCS preserves Zipfile Authenticity 1261 Verification; if stripping this extra field, FWKCS 1262 preserves all versions of AV through PKZIP version 2.04g. 1263 1264 ``The MD5 algorithm is being placed in the public domain for review 1265 and possible adoption as a standard.'' (Ron Rivest, MIT Laboratory 1266 for Computer Science and RSA Data Security, Inc., April 1992, RFC 1267 1321, 11.76-77). FWKCS, and FWKCS Contents_Signature System, are 1268 trademarks of Frederick W. Kantor. 1269 1270 1271 1272 file comment: (Variable) 1273 1274 The comment for this file. 1275 1276 number of this disk: (2 bytes) 1277 1278 The number of this disk, which contains central 1279 directory end record. 1280 1281 number of the disk with the start of the central directory: (2 bytes) 1282 1283 The number of the disk on which the central 1284 directory starts. 1285 1286 total number of entries in the central dir on this disk: (2 bytes) 1287 1288 The number of central directory entries on this disk. 1289 1290 total number of entries in the central dir: (2 bytes) 1291 1292 The total number of files in the zipfile. 1293 1294 1295 size of the central directory: (4 bytes) 1296 1297 The size (in bytes) of the entire central directory. 1298 1299 offset of start of central directory with respect to 1300 the starting disk number: (4 bytes) 1301 1302 Offset of the start of the central directory on the 1303 disk on which the central directory starts. 1304 1305 zipfile comment length: (2 bytes) 1306 1307 The length of the comment for this zipfile. 1308 1309 zipfile comment: (Variable) 1310 1311 The comment for this zipfile. 1312 1313 1314 D. General notes: 1315 1316 1) All fields unless otherwise noted are unsigned and stored 1317 in Intel low-byte:high-byte, low-word:high-word order. 1318 1319 2) String fields are not null terminated, since the 1320 length is given explicitly. 1321 1322 3) Local headers should not span disk boundaries. Also, even 1323 though the central directory can span disk boundaries, no 1324 single record in the central directory should be split 1325 across disks. 1326 1327 4) The entries in the central directory may not necessarily 1328 be in the same order that files appear in the zipfile. 1329 1330UnShrinking - Method 1 1331---------------------- 1332 1333Shrinking is a Dynamic Ziv-Lempel-Welch compression algorithm 1334with partial clearing. The initial code size is 9 bits, and 1335the maximum code size is 13 bits. Shrinking differs from 1336conventional Dynamic Ziv-Lempel-Welch implementations in several 1337respects: 1338 13391) The code size is controlled by the compressor, and is not 1340 automatically increased when codes larger than the current 1341 code size are created (but not necessarily used). When 1342 the decompressor encounters the code sequence 256 1343 (decimal) followed by 1, it should increase the code size 1344 read from the input stream to the next bit size. No 1345 blocking of the codes is performed, so the next code at 1346 the increased size should be read from the input stream 1347 immediately after where the previous code at the smaller 1348 bit size was read. Again, the decompressor should not 1349 increase the code size used until the sequence 256,1 is 1350 encountered. 1351 13522) When the table becomes full, total clearing is not 1353 performed. Rather, when the compressor emits the code 1354 sequence 256,2 (decimal), the decompressor should clear 1355 all leaf nodes from the Ziv-Lempel tree, and continue to 1356 use the current code size. The nodes that are cleared 1357 from the Ziv-Lempel tree are then re-used, with the lowest 1358 code value re-used first, and the highest code value 1359 re-used last. The compressor can emit the sequence 256,2 1360 at any time. 1361 1362 1363 1364Expanding - Methods 2-5 1365----------------------- 1366 1367The Reducing algorithm is actually a combination of two 1368distinct algorithms. The first algorithm compresses repeated 1369byte sequences, and the second algorithm takes the compressed 1370stream from the first algorithm and applies a probabilistic 1371compression method. 1372 1373The probabilistic compression stores an array of 'follower 1374sets' S(j), for j=0 to 255, corresponding to each possible 1375ASCII character. Each set contains between 0 and 32 1376characters, to be denoted as S(j)[0],...,S(j)[m], where m<32. 1377The sets are stored at the beginning of the data area for a 1378Reduced file, in reverse order, with S(255) first, and S(0) 1379last. 1380 1381The sets are encoded as { N(j), S(j)[0],...,S(j)[N(j)-1] }, 1382where N(j) is the size of set S(j). N(j) can be 0, in which 1383case the follower set for S(j) is empty. Each N(j) value is 1384encoded in 6 bits, followed by N(j) eight bit character values 1385corresponding to S(j)[0] to S(j)[N(j)-1] respectively. If 1386N(j) is 0, then no values for S(j) are stored, and the value 1387for N(j-1) immediately follows. 1388 1389Immediately after the follower sets, is the compressed data 1390stream. The compressed data stream can be interpreted for the 1391probabilistic decompression as follows: 1392 1393 1394let Last-Character <- 0. 1395loop until done 1396 if the follower set S(Last-Character) is empty then 1397 read 8 bits from the input stream, and copy this 1398 value to the output stream. 1399 otherwise if the follower set S(Last-Character) is non-empty then 1400 read 1 bit from the input stream. 1401 if this bit is not zero then 1402 read 8 bits from the input stream, and copy this 1403 value to the output stream. 1404 otherwise if this bit is zero then 1405 read B(N(Last-Character)) bits from the input 1406 stream, and assign this value to I. 1407 Copy the value of S(Last-Character)[I] to the 1408 output stream. 1409 1410 assign the last value placed on the output stream to 1411 Last-Character. 1412end loop 1413 1414 1415B(N(j)) is defined as the minimal number of bits required to 1416encode the value N(j)-1. 1417 1418 1419The decompressed stream from above can then be expanded to 1420re-create the original file as follows: 1421 1422 1423let State <- 0. 1424 1425loop until done 1426 read 8 bits from the input stream into C. 1427 case State of 1428 0: if C is not equal to DLE (144 decimal) then 1429 copy C to the output stream. 1430 otherwise if C is equal to DLE then 1431 let State <- 1. 1432 1433 1: if C is non-zero then 1434 let V <- C. 1435 let Len <- L(V) 1436 let State <- F(Len). 1437 otherwise if C is zero then 1438 copy the value 144 (decimal) to the output stream. 1439 let State <- 0 1440 1441 2: let Len <- Len + C 1442 let State <- 3. 1443 1444 3: move backwards D(V,C) bytes in the output stream 1445 (if this position is before the start of the output 1446 stream, then assume that all the data before the 1447 start of the output stream is filled with zeros). 1448 copy Len+3 bytes from this position to the output stream. 1449 let State <- 0. 1450 end case 1451end loop 1452 1453 1454The functions F,L, and D are dependent on the 'compression 1455factor', 1 through 4, and are defined as follows: 1456 1457For compression factor 1: 1458 L(X) equals the lower 7 bits of X. 1459 F(X) equals 2 if X equals 127 otherwise F(X) equals 3. 1460 D(X,Y) equals the (upper 1 bit of X) * 256 + Y + 1. 1461For compression factor 2: 1462 L(X) equals the lower 6 bits of X. 1463 F(X) equals 2 if X equals 63 otherwise F(X) equals 3. 1464 D(X,Y) equals the (upper 2 bits of X) * 256 + Y + 1. 1465For compression factor 3: 1466 L(X) equals the lower 5 bits of X. 1467 F(X) equals 2 if X equals 31 otherwise F(X) equals 3. 1468 D(X,Y) equals the (upper 3 bits of X) * 256 + Y + 1. 1469For compression factor 4: 1470 L(X) equals the lower 4 bits of X. 1471 F(X) equals 2 if X equals 15 otherwise F(X) equals 3. 1472 D(X,Y) equals the (upper 4 bits of X) * 256 + Y + 1. 1473 1474 1475Imploding - Method 6 1476-------------------- 1477 1478The Imploding algorithm is actually a combination of two distinct 1479algorithms. The first algorithm compresses repeated byte 1480sequences using a sliding dictionary. The second algorithm is 1481used to compress the encoding of the sliding dictionary output, 1482using multiple Shannon-Fano trees. 1483 1484The Imploding algorithm can use a 4K or 8K sliding dictionary 1485size. The dictionary size used can be determined by bit 1 in the 1486general purpose flag word; a 0 bit indicates a 4K dictionary 1487while a 1 bit indicates an 8K dictionary. 1488 1489The Shannon-Fano trees are stored at the start of the compressed 1490file. The number of trees stored is defined by bit 2 in the 1491general purpose flag word; a 0 bit indicates two trees stored, a 14921 bit indicates three trees are stored. If 3 trees are stored, 1493the first Shannon-Fano tree represents the encoding of the 1494Literal characters, the second tree represents the encoding of 1495the Length information, the third represents the encoding of the 1496Distance information. When 2 Shannon-Fano trees are stored, the 1497Length tree is stored first, followed by the Distance tree. 1498 1499The Literal Shannon-Fano tree, if present is used to represent 1500the entire ASCII character set, and contains 256 values. This 1501tree is used to compress any data not compressed by the sliding 1502dictionary algorithm. When this tree is present, the Minimum 1503Match Length for the sliding dictionary is 3. If this tree is 1504not present, the Minimum Match Length is 2. 1505 1506The Length Shannon-Fano tree is used to compress the Length part 1507of the (length,distance) pairs from the sliding dictionary 1508output. The Length tree contains 64 values, ranging from the 1509Minimum Match Length, to 63 plus the Minimum Match Length. 1510 1511The Distance Shannon-Fano tree is used to compress the Distance 1512part of the (length,distance) pairs from the sliding dictionary 1513output. The Distance tree contains 64 values, ranging from 0 to 151463, representing the upper 6 bits of the distance value. The 1515distance values themselves will be between 0 and the sliding 1516dictionary size, either 4K or 8K. 1517 1518The Shannon-Fano trees themselves are stored in a compressed 1519format. The first byte of the tree data represents the number of 1520bytes of data representing the (compressed) Shannon-Fano tree 1521minus 1. The remaining bytes represent the Shannon-Fano tree 1522data encoded as: 1523 1524 High 4 bits: Number of values at this bit length + 1. (1 - 16) 1525 Low 4 bits: Bit Length needed to represent value + 1. (1 - 16) 1526 1527The Shannon-Fano codes can be constructed from the bit lengths 1528using the following algorithm: 1529 15301) Sort the Bit Lengths in ascending order, while retaining the 1531 order of the original lengths stored in the file. 1532 15332) Generate the Shannon-Fano trees: 1534 1535 Code <- 0 1536 CodeIncrement <- 0 1537 LastBitLength <- 0 1538 i <- number of Shannon-Fano codes - 1 (either 255 or 63) 1539 1540 loop while i >= 0 1541 Code = Code + CodeIncrement 1542 if BitLength(i) <> LastBitLength then 1543 LastBitLength=BitLength(i) 1544 CodeIncrement = 1 shifted left (16 - LastBitLength) 1545 ShannonCode(i) = Code 1546 i <- i - 1 1547 end loop 1548 1549 15503) Reverse the order of all the bits in the above ShannonCode() 1551 vector, so that the most significant bit becomes the least 1552 significant bit. For example, the value 0x1234 (hex) would 1553 become 0x2C48 (hex). 1554 15554) Restore the order of Shannon-Fano codes as originally stored 1556 within the file. 1557 1558Example: 1559 1560 This example will show the encoding of a Shannon-Fano tree 1561 of size 8. Notice that the actual Shannon-Fano trees used 1562 for Imploding are either 64 or 256 entries in size. 1563 1564Example: 0x02, 0x42, 0x01, 0x13 1565 1566 The first byte indicates 3 values in this table. Decoding the 1567 bytes: 1568 0x42 = 5 codes of 3 bits long 1569 0x01 = 1 code of 2 bits long 1570 0x13 = 2 codes of 4 bits long 1571 1572 This would generate the original bit length array of: 1573 (3, 3, 3, 3, 3, 2, 4, 4) 1574 1575 There are 8 codes in this table for the values 0 thru 7. Using the 1576 algorithm to obtain the Shannon-Fano codes produces: 1577 1578 Reversed Order Original 1579Val Sorted Constructed Code Value Restored Length 1580--- ------ ----------------- -------- -------- ------ 15810: 2 1100000000000000 11 101 3 15821: 3 1010000000000000 101 001 3 15832: 3 1000000000000000 001 110 3 15843: 3 0110000000000000 110 010 3 15854: 3 0100000000000000 010 100 3 15865: 3 0010000000000000 100 11 2 15876: 4 0001000000000000 1000 1000 4 15887: 4 0000000000000000 0000 0000 4 1589 1590 1591The values in the Val, Order Restored and Original Length columns 1592now represent the Shannon-Fano encoding tree that can be used for 1593decoding the Shannon-Fano encoded data. How to parse the 1594variable length Shannon-Fano values from the data stream is beyond the 1595scope of this document. (See the references listed at the end of 1596this document for more information.) However, traditional decoding 1597schemes used for Huffman variable length decoding, such as the 1598Greenlaw algorithm, can be successfully applied. 1599 1600The compressed data stream begins immediately after the 1601compressed Shannon-Fano data. The compressed data stream can be 1602interpreted as follows: 1603 1604loop until done 1605 read 1 bit from input stream. 1606 1607 if this bit is non-zero then (encoded data is literal data) 1608 if Literal Shannon-Fano tree is present 1609 read and decode character using Literal Shannon-Fano tree. 1610 otherwise 1611 read 8 bits from input stream. 1612 copy character to the output stream. 1613 otherwise (encoded data is sliding dictionary match) 1614 if 8K dictionary size 1615 read 7 bits for offset Distance (lower 7 bits of offset). 1616 otherwise 1617 read 6 bits for offset Distance (lower 6 bits of offset). 1618 1619 using the Distance Shannon-Fano tree, read and decode the 1620 upper 6 bits of the Distance value. 1621 1622 using the Length Shannon-Fano tree, read and decode 1623 the Length value. 1624 1625 Length <- Length + Minimum Match Length 1626 1627 if Length = 63 + Minimum Match Length 1628 read 8 bits from the input stream, 1629 add this value to Length. 1630 1631 move backwards Distance+1 bytes in the output stream, and 1632 copy Length characters from this position to the output 1633 stream. (if this position is before the start of the output 1634 stream, then assume that all the data before the start of 1635 the output stream is filled with zeros). 1636end loop 1637 1638Tokenizing - Method 7 1639-------------------- 1640 1641This method is not used by PKZIP. 1642 1643Deflating - Method 8 1644----------------- 1645 1646The Deflate algorithm is similar to the Implode algorithm using 1647a sliding dictionary of up to 32K with secondary compression 1648from Huffman/Shannon-Fano codes. 1649 1650The compressed data is stored in blocks with a header describing 1651the block and the Huffman codes used in the data block. The header 1652format is as follows: 1653 1654 Bit 0: Last Block bit This bit is set to 1 if this is the last 1655 compressed block in the data. 1656 Bits 1-2: Block type 1657 00 (0) - Block is stored - All stored data is byte aligned. 1658 Skip bits until next byte, then next word = block length, 1659 followed by the ones compliment of the block length word. 1660 Remaining data in block is the stored data. 1661 1662 01 (1) - Use fixed Huffman codes for literal and distance codes. 1663 Lit Code Bits Dist Code Bits 1664 --------- ---- --------- ---- 1665 0 - 143 8 0 - 31 5 1666 144 - 255 9 1667 256 - 279 7 1668 280 - 287 8 1669 1670 Literal codes 286-287 and distance codes 30-31 are never 1671 used but participate in the huffman construction. 1672 1673 10 (2) - Dynamic Huffman codes. (See expanding Huffman codes) 1674 1675 11 (3) - Reserved - Flag a "Error in compressed data" if seen. 1676 1677Expanding Huffman Codes 1678----------------------- 1679If the data block is stored with dynamic Huffman codes, the Huffman 1680codes are sent in the following compressed format: 1681 1682 5 Bits: # of Literal codes sent - 257 (257 - 286) 1683 All other codes are never sent. 1684 5 Bits: # of Dist codes - 1 (1 - 32) 1685 4 Bits: # of Bit Length codes - 4 (4 - 19) 1686 1687The Huffman codes are sent as bit lengths and the codes are built as 1688described in the implode algorithm. The bit lengths themselves are 1689compressed with Huffman codes. There are 19 bit length codes: 1690 1691 0 - 15: Represent bit lengths of 0 - 15 1692 16: Copy the previous bit length 3 - 6 times. 1693 The next 2 bits indicate repeat length (0 = 3, ... ,3 = 6) 1694 Example: Codes 8, 16 (+2 bits 11), 16 (+2 bits 10) will 1695 expand to 12 bit lengths of 8 (1 + 6 + 5) 1696 17: Repeat a bit length of 0 for 3 - 10 times. (3 bits of length) 1697 18: Repeat a bit length of 0 for 11 - 138 times (7 bits of length) 1698 1699The lengths of the bit length codes are sent packed 3 bits per value 1700(0 - 7) in the following order: 1701 1702 16, 17, 18, 0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15 1703 1704The Huffman codes should be built as described in the Implode algorithm 1705except codes are assigned starting at the shortest bit length, i.e. the 1706shortest code should be all 0's rather than all 1's. Also, codes with 1707a bit length of zero do not participate in the tree construction. The 1708codes are then used to decode the bit lengths for the literal and distance 1709tables. 1710 1711The bit lengths for the literal tables are sent first with the number 1712of entries sent described by the 5 bits sent earlier. There are up 1713to 286 literal characters; the first 256 represent the respective 8 1714bit character, code 256 represents the End-Of-Block code, the remaining 171529 codes represent copy lengths of 3 thru 258. There are up to 30 1716distance codes representing distances from 1 thru 32k as described 1717below. 1718 1719 Length Codes 1720 ------------ 1721 Extra Extra Extra Extra 1722 Code Bits Length Code Bits Lengths Code Bits Lengths Code Bits Length(s) 1723 ---- ---- ------ ---- ---- ------- ---- ---- ------- ---- ---- --------- 1724 257 0 3 265 1 11,12 273 3 35-42 281 5 131-162 1725 258 0 4 266 1 13,14 274 3 43-50 282 5 163-194 1726 259 0 5 267 1 15,16 275 3 51-58 283 5 195-226 1727 260 0 6 268 1 17,18 276 3 59-66 284 5 227-257 1728 261 0 7 269 2 19-22 277 4 67-82 285 0 258 1729 262 0 8 270 2 23-26 278 4 83-98 1730 263 0 9 271 2 27-30 279 4 99-114 1731 264 0 10 272 2 31-34 280 4 115-130 1732 1733 Distance Codes 1734 -------------- 1735 Extra Extra Extra Extra 1736 Code Bits Dist Code Bits Dist Code Bits Distance Code Bits Distance 1737 ---- ---- ---- ---- ---- ------ ---- ---- -------- ---- ---- -------- 1738 0 0 1 8 3 17-24 16 7 257-384 24 11 4097-6144 1739 1 0 2 9 3 25-32 17 7 385-512 25 11 6145-8192 1740 2 0 3 10 4 33-48 18 8 513-768 26 12 8193-12288 1741 3 0 4 11 4 49-64 19 8 769-1024 27 12 12289-16384 1742 4 1 5,6 12 5 65-96 20 9 1025-1536 28 13 16385-24576 1743 5 1 7,8 13 5 97-128 21 9 1537-2048 29 13 24577-32768 1744 6 2 9-12 14 6 129-192 22 10 2049-3072 1745 7 2 13-16 15 6 193-256 23 10 3073-4096 1746 1747The compressed data stream begins immediately after the 1748compressed header data. The compressed data stream can be 1749interpreted as follows: 1750 1751do 1752 read header from input stream. 1753 1754 if stored block 1755 skip bits until byte aligned 1756 read count and 1's compliment of count 1757 copy count bytes data block 1758 otherwise 1759 loop until end of block code sent 1760 decode literal character from input stream 1761 if literal < 256 1762 copy character to the output stream 1763 otherwise 1764 if literal = end of block 1765 break from loop 1766 otherwise 1767 decode distance from input stream 1768 1769 move backwards distance bytes in the output stream, and 1770 copy length characters from this position to the output 1771 stream. 1772 end loop 1773while not last block 1774 1775if data descriptor exists 1776 skip bits until byte aligned 1777 check data descriptor signature 1778 read crc and sizes 1779endif 1780 1781Decryption 1782---------- 1783 1784The encryption used in PKZIP was generously supplied by Roger 1785Schlafly. PKWARE is grateful to Mr. Schlafly for his expert 1786help and advice in the field of data encryption. 1787 1788PKZIP encrypts the compressed data stream. Encrypted files must 1789be decrypted before they can be extracted. 1790 1791Each encrypted file has an extra 12 bytes stored at the start of 1792the data area defining the encryption header for that file. The 1793encryption header is originally set to random values, and then 1794itself encrypted, using three, 32-bit keys. The key values are 1795initialized using the supplied encryption password. After each byte 1796is encrypted, the keys are then updated using pseudo-random number 1797generation techniques in combination with the same CRC-32 algorithm 1798used in PKZIP and described elsewhere in this document. 1799 1800The following is the basic steps required to decrypt a file: 1801 18021) Initialize the three 32-bit keys with the password. 18032) Read and decrypt the 12-byte encryption header, further 1804 initializing the encryption keys. 18053) Read and decrypt the compressed data stream using the 1806 encryption keys. 1807 1808 1809Step 1 - Initializing the encryption keys 1810----------------------------------------- 1811 1812Key(0) <- 305419896 1813Key(1) <- 591751049 1814Key(2) <- 878082192 1815 1816loop for i <- 0 to length(password)-1 1817 update_keys(password(i)) 1818end loop 1819 1820 1821Where update_keys() is defined as: 1822 1823 1824update_keys(char): 1825 Key(0) <- crc32(key(0),char) 1826 Key(1) <- Key(1) + (Key(0) & 000000ffH) 1827 Key(1) <- Key(1) * 134775813 + 1 1828 Key(2) <- crc32(key(2),key(1) >> 24) 1829end update_keys 1830 1831 1832Where crc32(old_crc,char) is a routine that given a CRC value and a 1833character, returns an updated CRC value after applying the CRC-32 1834algorithm described elsewhere in this document. 1835 1836 1837Step 2 - Decrypting the encryption header 1838----------------------------------------- 1839 1840The purpose of this step is to further initialize the encryption 1841keys, based on random data, to render a plaintext attack on the 1842data ineffective. 1843 1844 1845Read the 12-byte encryption header into Buffer, in locations 1846Buffer(0) thru Buffer(11). 1847 1848loop for i <- 0 to 11 1849 C <- buffer(i) ^ decrypt_byte() 1850 update_keys(C) 1851 buffer(i) <- C 1852end loop 1853 1854 1855Where decrypt_byte() is defined as: 1856 1857 1858unsigned char decrypt_byte() 1859 local unsigned short temp 1860 temp <- Key(2) | 2 1861 decrypt_byte <- (temp * (temp ^ 1)) >> 8 1862end decrypt_byte 1863 1864 1865After the header is decrypted, the last 1 or 2 bytes in Buffer 1866should be the high-order word/byte of the CRC for the file being 1867decrypted, stored in Intel low-byte/high-byte order, or the high-order 1868byte of the file time if bit 3 of the general purpose bit flag is set. 1869Versions of PKZIP prior to 2.0 used a 2 byte CRC check; a 1 byte CRC check is 1870used on versions after 2.0. This can be used to test if the password 1871supplied is correct or not. 1872 1873 1874Step 3 - Decrypting the compressed data stream 1875---------------------------------------------- 1876 1877The compressed data stream can be decrypted as follows: 1878 1879 1880loop until done 1881 read a character into C 1882 Temp <- C ^ decrypt_byte() 1883 update_keys(temp) 1884 output Temp 1885end loop 1886 1887 1888In addition to the above mentioned contributors to PKZIP and PKUNZIP, 1889I would like to extend special thanks to Robert Mahoney for suggesting 1890the extension .ZIP for this software. 1891 1892 1893References: 1894 1895 Fiala, Edward R., and Greene, Daniel H., "Data compression with 1896 finite windows", Communications of the ACM, Volume 32, Number 4, 1897 April 1989, pages 490-505. 1898 1899 Held, Gilbert, "Data Compression, Techniques and Applications, 1900 Hardware and Software Considerations", 1901 John Wiley & Sons, 1987. 1902 1903 Huffman, D.A., "A method for the construction of minimum-redundancy 1904 codes", Proceedings of the IRE, Volume 40, Number 9, September 1952, 1905 pages 1098-1101. 1906 1907 Nelson, Mark, "LZW Data Compression", Dr. Dobbs Journal, Volume 14, 1908 Number 10, October 1989, pages 29-37. 1909 1910 Nelson, Mark, "The Data Compression Book", M&T Books, 1991. 1911 1912 Storer, James A., "Data Compression, Methods and Theory", 1913 Computer Science Press, 1988 1914 1915 Welch, Terry, "A Technique for High-Performance Data Compression", 1916 IEEE Computer, Volume 17, Number 6, June 1984, pages 8-19. 1917 1918 Ziv, J. and Lempel, A., "A universal algorithm for sequential data 1919 compression", Communications of the ACM, Volume 30, Number 6, 1920 June 1987, pages 520-540. 1921 1922 Ziv, J. and Lempel, A., "Compression of individual sequences via 1923 variable-rate coding", IEEE Transactions on Information Theory, 1924 Volume 24, Number 5, September 1978, pages 530-536. 1925