1# ZAP File Format 2 3## Legend 4 5### Sections 6 7 |========| 8 | | section 9 |========| 10 11### Fixed-size fields 12 13 |--------| |----| |--| |-| 14 | | uint64 | | uint32 | | uint16 | | uint8 15 |--------| |----| |--| |-| 16 17### Varints 18 19 |~~~~~~~~| 20 | | varint(up to uint64) 21 |~~~~~~~~| 22 23### Arbitrary-length fields 24 25 |--------...---| 26 | | arbitrary-length field (string, vellum, roaring bitmap) 27 |--------...---| 28 29### Chunked data 30 31 [--------] 32 [ ] 33 [--------] 34 35## Overview 36 37Footer section describes the configuration of particular ZAP file. The format of footer is version-dependent, so it is necessary to check `V` field before the parsing. 38 39 |==================================================| 40 | Stored Fields | 41 |==================================================| 42 |-----> | Stored Fields Index | 43 | |==================================================| 44 | | Dictionaries + Postings + DocValues | 45 | |==================================================| 46 | |---> | DocValues Index | 47 | | |==================================================| 48 | | | Fields | 49 | | |==================================================| 50 | | |-> | Fields Index | 51 | | | |========|========|========|========|====|====|====| 52 | | | | D# | SF | F | FDV | CF | V | CC | (Footer) 53 | | | |========|====|===|====|===|====|===|====|====|====| 54 | | | | | | 55 |-+-+-----------------| | | 56 | |--------------------------| | 57 |-------------------------------------| 58 59 D#. Number of Docs. 60 SF. Stored Fields Index Offset. 61 F. Field Index Offset. 62 FDV. Field DocValue Offset. 63 CF. Chunk Factor. 64 V. Version. 65 CC. CRC32. 66 67## Stored Fields 68 69Stored Fields Index is `D#` consecutive 64-bit unsigned integers - offsets, where relevant Stored Fields Data records are located. 70 71 0 [SF] [SF + D# * 8] 72 | Stored Fields | Stored Fields Index | 73 |================================|==================================| 74 | | | 75 | |--------------------| ||--------|--------|. . .|--------|| 76 | |-> | Stored Fields Data | || 0 | 1 | | D# - 1 || 77 | | |--------------------| ||--------|----|---|. . .|--------|| 78 | | | | | 79 |===|============================|==============|===================| 80 | | 81 |-------------------------------------------| 82 83Stored Fields Data is an arbitrary size record, which consists of metadata and [Snappy](https://github.com/golang/snappy)-compressed data. 84 85 Stored Fields Data 86 |~~~~~~~~|~~~~~~~~|~~~~~~~~...~~~~~~~~|~~~~~~~~...~~~~~~~~| 87 | MDS | CDS | MD | CD | 88 |~~~~~~~~|~~~~~~~~|~~~~~~~~...~~~~~~~~|~~~~~~~~...~~~~~~~~| 89 90 MDS. Metadata size. 91 CDS. Compressed data size. 92 MD. Metadata. 93 CD. Snappy-compressed data. 94 95## Fields 96 97Fields Index section located between addresses `F` and `len(file) - len(footer)` and consist of `uint64` values (`F1`, `F2`, ...) which are offsets to records in Fields section. We have `F# = (len(file) - len(footer) - F) / sizeof(uint64)` fields. 98 99 100 (...) [F] [F + F#] 101 | Fields | Fields Index. | 102 |================================|================================| 103 | | | 104 | |~~~~~~~~|~~~~~~~~|---...---|||--------|--------|...|--------|| 105 ||->| Dict | Length | Name ||| 0 | 1 | | F# - 1 || 106 || |~~~~~~~~|~~~~~~~~|---...---|||--------|----|---|...|--------|| 107 || | | | 108 ||===============================|==============|=================| 109 | | 110 |----------------------------------------------| 111 112 113## Dictionaries + Postings 114 115Each of fields has its own dictionary, encoded in [Vellum](https://github.com/couchbase/vellum) format. Dictionary consists of pairs `(term, offset)`, where `offset` indicates the position of postings (list of documents) for this particular term. 116 117 |================================================================|- Dictionaries + 118 | | Postings + 119 | | DocValues 120 | Freq/Norm (chunked) | 121 | [~~~~~~|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~] | 122 | |->[ Freq | Norm (float32 under varint) ] | 123 | | [~~~~~~|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~] | 124 | | | 125 | |------------------------------------------------------------| | 126 | Location Details (chunked) | | 127 | [~~~~~~|~~~~~|~~~~~~~|~~~~~|~~~~~~|~~~~~~~~|~~~~~] | | 128 | |->[ Size | Pos | Start | End | Arr# | ArrPos | ... ] | | 129 | | [~~~~~~|~~~~~|~~~~~~~|~~~~~|~~~~~~|~~~~~~~~|~~~~~] | | 130 | | | | 131 | |----------------------| | | 132 | Postings List | | | 133 | |~~~~~~~~|~~~~~|~~|~~~~~~~~|-----------...--| | | 134 | |->| F/N | LD | Length | ROARING BITMAP | | | 135 | | |~~~~~|~~|~~~~~~~~|~~~~~~~~|-----------...--| | | 136 | | |----------------------------------------------| | 137 | |--------------------------------------| | 138 | Dictionary | | 139 | |~~~~~~~~|--------------------------|-...-| | 140 | |->| Length | VELLUM DATA : (TERM -> OFFSET) | | 141 | | |~~~~~~~~|----------------------------...-| | 142 | | | 143 |======|=========================================================|- DocValues Index 144 | | | 145 |======|=========================================================|- Fields 146 | | | 147 | |~~~~|~~~|~~~~~~~~|---...---| | 148 | | Dict | Length | Name | | 149 | |~~~~~~~~|~~~~~~~~|---...---| | 150 | | 151 |================================================================| 152 153## DocValues 154 155DocValues Index is `F#` pairs of varints, one pair per field. Each pair of varints indicates start and end point of DocValues slice. 156 157 |================================================================| 158 | |------...--| | 159 | |->| DocValues |<-| | 160 | | |------...--| | | 161 |==|=================|===========================================|- DocValues Index 162 ||~|~~~~~~~~~|~~~~~~~|~~| |~~~~~~~~~~~~~~|~~~~~~~~~~~~|| 163 || DV1 START | DV1 STOP | . . . . . | DV(F#) START | DV(F#) END || 164 ||~~~~~~~~~~~|~~~~~~~~~~| |~~~~~~~~~~~~~~|~~~~~~~~~~~~|| 165 |================================================================| 166 167DocValues is chunked Snappy-compressed values for each document and field. 168 169 [~~~~~~~~~~~~~~~|~~~~~~|~~~~~~~~~|-...-|~~~~~~|~~~~~~~~~|--------------------...-] 170 [ Doc# in Chunk | Doc1 | Offset1 | ... | DocN | OffsetN | SNAPPY COMPRESSED DATA ] 171 [~~~~~~~~~~~~~~~|~~~~~~|~~~~~~~~~|-...-|~~~~~~|~~~~~~~~~|--------------------...-] 172 173Last 16 bytes are description of chunks. 174 175 |~~~~~~~~~~~~...~|----------------|----------------| 176 | Chunk Sizes | Chunk Size Arr | Chunk# | 177 |~~~~~~~~~~~~...~|----------------|----------------| 178