1<section> <date> 17. December 2002 </date> 2<h2> ZIP Format </h2> About Zip Parsing Internals... 3 4<!--border--> 5 6<section> 7<h3> ZIP Trailer Block </h3> 8 9<P> 10 The general ZIP file format is written sequentially - each file 11 being added gets a local file header and its inflated data. When 12 all files are written then a central directory is written - and 13 this central directory may even span multiple disks. And each 14 disk gets a descriptor block that contains a pointer to the start 15 of the central directory. This descriptor is always written last 16 and therefore we call it the "ZIP File Trailer Block". 17</P> 18<P> 19 Okay, so we know that this ZIP Trailer is always at the end of a zip 20 file and that is has a fixed length, and a magic four-byte value at 21 its block start. That should make it easy to detect zip files but in 22 the real world it is not that easy - it is allowed to add a zip 23 archive comment text <em>after</em> the Trailer block. It's rarely 24 used these days but it turns out that a zip reader must be ready 25 to search for the Trailer block starting at the end of the file 26 and looking upwards for the Trailer magic (it's "PK\5\6" btw). 27</P> 28<P> 29 Now that's what the internal function __zip_find_disk_trailer is 30 used for. It's somewhat optimized as we try to use mmap features 31 of the underlying operating system. The returned structure is 32 called zzip_disk_trailer in the library source code, and we only 33 need two values actually: u_rootseek and u_rootsize. The first of 34 these can be used to lseek to the place of the central directory 35 and the second value tells us the byte size of the central directory. 36</P> 37 38</section><section> 39<h3> ZIP Central Directory </h3> 40 41<P> 42 So here we are at the central directory. The disk trailer did also 43 tell us how many entries are there but it is not that easy to read 44 them. Each directory entry (zzip_root_dirent type) has again a 45 magic value up front followed by a few items but they all have some 46 dos format - consider the timestamps, and atleast size/seek values 47 are in intel byteorder. So we might want to parse them into a format 48 that is easier to handle in internal code. 49</P> 50<P> 51 That is also needed for another reason - there are three items in that 52 directory entry being size values of three variadic fields following 53 right after the directory. That's right, three of these. The first 54 variadic field is the filename of this directory entry. In other 55 words, the root directory entry does not contain a seek value of 56 where the filename starts off, the start of the filename is 57 implicitly given with the end address of the directory entry. 58</P> 59<P> 60 The size value for the filename does simply say how long the 61 filename is - however, and more importantly, it allows us to 62 compute the start of the next variadic field, called the extra 63 info field. Well, we do not need any value from that extra info 64 block (it has unix filemode bits when packed under unix) but we 65 can be quite sure that this field is not null either. And that 66 was the second variadic field. 67</P> 68<P> 69 There is a third variadic field however - it's the comment field. 70 That was pretty heavily used in the good old DOS days. We are not 71 used to it anymore since filenames are generally self-descriptive 72 today but in the DOS days a filename was 8+3 chars maximum - and 73 it was in the comment field that told users what's in there. It 74 turned out that many software archives used zip format for just 75 that purpose as their primary distribution format - for being 76 able to attach a comment line with each entry. 77</P> 78<P> 79 Now, these three variadic fields have each an entry in the 80 directory entry header telling of their size. And after these 81 three variadic fields the next directory entry follows right in. 82 Yes, again there is no seek value here - we have to take the sum 83 of the three field sizes and add that to the end address of the 84 directory entry - just to be able to get to the next entry. 85</P> 86 87</section><section> 88<h3> Internal Directory </h3> 89 90<P> 91 Now, the external ZIP format is too complicated. We cut it down 92 to the bare minimum we actually need. The fields in the entry 93 are parsed into a format directly usable, and from the variadic 94 fields we only keep the filename. Oh, and we ensure that the 95 filename gets a trailing null byte, so it can surely be passed 96 down into libc routines. 97</P> 98<P> 99 There is another trick by the way - we use the u_rootsize value 100 to malloc a block for the internal directory. That ensures the 101 internal root directory entries are in nearby locations, and 102 including the filenames themselves which we put in between the 103 dirent entries. That's not only similar to the external directory 104 format, but when calling readdir and looking for a matching 105 filename of an zzip_open call, this will ensure the memory is 106 fetched in a linear fashion. Modern cpu architectures are able 107 to burst through it. 108</P> 109<P> 110 One might think to use a more complicated internal directory 111 format - like hash tables or something. However, they all suffer 112 from the fact that memory access patterns will be somewhat random 113 which eats a lot of speed. It is hardly predictable under what 114 circumstances it gets us a benefit, but the problem is certainly 115 not off-world: there are zzip archives with 13k+ entries. In a real 116 filesystem people will not put 13k files into one directory, of 117 course - but for the zip central directory all entries are listed 118 in parallel with their subdirectory paths attached. So, if the 119 original subtree had a number of directories, they'll end up in 120 parallel in the zip's central directory. 121</P> 122 123</section><section> 124<h3> File Entry </h3> 125 126<P> 127 The zip directory entry has one value that is called z_off in the 128 zziplib sources - it's the seek value to the start of the actual 129 file data, or more correctly it points to the "local file header". 130 Each file data block is preceded/followed with a little frame. 131 There is not much interesting information in these framing blocks, 132 the values are duplicates of the ones found in the zip central 133 directory - however, we must skip the local file header (and a 134 possible duplicate of filename and extrainfo) to arrive at the 135 actual file data. 136</P> 137<P> 138 When the start of the actual file data, we can finally read data. 139 The zziplib library does only know about two choices defined by 140 the value in the z_compr field - a value of "0" means "stored" 141 and data has been stored in uncompresed format, so that we can 142 just copy it out of the file to the application buffer. 143</P> 144<P> 145 A value of "8" means "deflated", and here we initialize the zlib 146 and every file data is decompressed before copying it to the 147 application buffer. Care must be taken here since zlib input data 148 and decompressed data may differ significantly. The zlib compression 149 will not even obey byte boundaries - a single bit may expand to 150 hundreds of bytes. That's why each ZZIP_FILE has a decompression 151 buffer attached. 152</P> 153<P> 154 All the other z_compr values are only of historical meaning, 155 the infozip unix tools will only create deflated content, and 156 the same applies to pkzip 2.x tools. If there would be any other 157 value than "0" or "8" then zziplib can not decompress it, simple 158 as that. 159</P> 160 161</section><section> 162<h3> ZZIP_DIR / ZZIP_FILE </h3> 163 164<P> 165 The ZZIP_DIR internal structures stores a posix handle to the 166 zip file, and a pointer to the parsed central directory block. 167 One can use readdir/rewinddir to walk each entry in the central 168 directory and compare with the filenames attached. And that's 169 what will be done at a zzip_open call to find the file entry. 170</P> 171<P> 172 There are a few more fields in the ZZIP_DIR structure, where 173 most of these are related to the use of this struct as a 174 shared recource. You can use zzip_file_open to walk the 175 preparsed central directory and return a new ZZIP_FILE handle 176 for that entry. 177</P> 178<P> 179 That ZZIP_FILE handle contains a back pointer its ZZIP_DIR 180 that it was made from - and the back pointer also serves as flag 181 that the ZZIP_FILE handle points to a file within a ZIP file as 182 opposed to wrapping a real file in the real directory tree. 183 Each ZZIP_FILE will increment a shared counter, so that the 184 next dir_close will be deferred until all ZZIP_FILE have been 185 destroyed. 186</P> 187<P> 188 Another optmization is the cache-pointer in the ZZIP_DIR. It is 189 quite common to read data entries sequentially, as that the 190 zip directory is scanned for files matching a specific pattern, 191 and when a match is seen, that file is openened. However, each 192 ZZIP_FILE needs a decompression buffer, and we keep a cache of 193 the last one freed so that it can be picked up right away for the 194 next zzip_file_open. 195</P> 196<P> 197 Note that using multiple zzip_open() directly, each will open 198 and parse a zip directory of its own. That's bloat both in 199 terms of memory consumption and execution speed. One should try 200 to take advantage of the feature that multiple ZZIP_FILE's can 201 share a common ZZIP_DIR with a common preparsed copy of the 202 zip's central directory. That can be done directly with using 203 zzip_file_open to use a ZZIP_DIR as a factory for ZZIP_FILE, 204 but also zzip_freopen can be used to reuse the old internal 205 central directory, instead of parsing it again. 206</P> 207<P> 208 And while zzip_freopen would release the old ZZIP_FILE handle 209 only resuing the ZZIP_DIR attached, one can use another routine 210 directly called zzip_open_shared that will create a ZZIP_FILE 211 from an existing ZZIP_FILE. Oh, and not need to worry about 212 problems when a filepath given to zzip_freopen() happens to 213 be in another place, another directory, another zip archive. 214 In that case, the old zzip's internal directory is freed and 215 the others directory read - the preparsed central directory 216 is only used if that is actually possible. 217</P> 218 219</section></section> 220