1 2 HAMMER2 Freemap Design Notes 3 4 Overview 5 6 HAMMER2 Media is broken down into 2 GByte zones. Each 2 GByte zone 7 contains a 4 MByte header (64 x 64K blocks = 0.2% of storage). The 8 blocks in this header are reserved for various purposes. For example, 9 block #0 is reserved for a volume header in the first four zones. Most 10 of the remaining 64KB blocks in this header are reserved for use by the 11 freemap. 12 13 The freemap only uses blocks from these reserved areas. In order to 14 ensure that any of the four volume headers can be used by the mount code 15 (in case some are found to be corrupted), each freemap block in the 16 logical freemap topology will iterate through up to 8 copies whos 17 block numbers are taken the reserved area. 18 19 - Four copies, one for each of the four volume headers which H2 sequences 20 through on each flush. This ensures that a mount from any of the four 21 volume headers is handed a consistent freemap topology. 22 23 - One copy to ensure that recovery operations during mount do not modify 24 the state of the freemap topology pointed to by older volume headers 25 which are still valid. Note that the freemap for volume headers 26 indexed after the mount point being recovered may lose freemap 27 consistency, so if you choose an older mount point for a RW mount, 28 you have to stick with it. 29 30 - One copy for live operations. This allows HAMMER2 to retire the 31 related buffer asynchronously in the background (or for the OS to 32 retire the buffer cache buffer on its own) prior to the formal 33 flush. The later formal flush then has less work to do. 34 35 - The two remaining copies add robustness to the specification. For 36 example, with appropriate feature code added the filesystem can 37 tolerate a limited number of bad blocks in the reserved area. 38 39 For the moment we use a simple calculation for the freemap block. In 40 a later version I would like to mix the blocks up a bit so the blocks 41 in each set of 8 are not situated near each other. 42 43 RW Mount Restrictions 44 45 If an older volume header is explicitly selected by the mount code, any 46 newer (presumably corrupt since the mount code didn't select it) volume 47 headers will lose freemap consistency as the freemap code rotates into 48 freemap blocks that might have been used by the topology pointed to by 49 the newer (but not selected) volume headers. For a RW mount, this means 50 that if an older volume header is selected, the newer ones that were 51 not selected WILL be formally invalidated by the mount code and cannot 52 be used in a remount attempt. 53 54 During normal operation, each filesystem flush rotates to a new volume 55 header. A filesystem may have up to four volume headers spread at 2GB 56 intervals. Filesystems smaller than ~9GB or so will have fewer volume 57 headers to rotate through. 58 59 Freemap Topology 60 61 The freemap topology contains 4 levels of meta-data (blockref arrays), 62 one of which is embedded in the volume header (so only three real 63 meta-data levels), plus one level of leaf-data. Unlike normal files, 64 which use a variable-radix, the freemap topology uses a fixed radix to 65 simplify the algorithm and to ensure freemap locality to the blocks 66 under management. 67 68 Freemap blocks are allocated from the reserved area in each 2GB zone. 69 The leafs represent data in the zone. Higher levels in the freemap 70 topology will cover more area but the physical freemap meta-data blocks 71 always occur prior to the area being covered. Thus a HAMMER2 filesystem 72 of almost any size can be formatted and the related freemap blocks 73 will always exist. 74 75 Level 1 - (radix 10 + 21) 64KB representing 2GB. This is represented 76 by a hammer2_bmap_data[1024] array. Each entry represents 77 2MB worth of media storage x 1024 entries to represent 2GB. 78 Each entry contains a 128x2 bit bitmap representing 16KB 79 of storage in 2 bits (128 x 16KB = 2MB). 80 81 Level 2 - (radix 10) 64KB blockmap representing 2TB (~2GB per entry) 82 Level 3 - (radix 10) 64KB blockmap representing 2PB (~2TB per entry) 83 Level 4 - (radix 10) 64KB blockmap representing 2EB (~2PB per entry) 84 Level 5 - (radix 3) blockref x 8 in volume header representing 16EB (2^64) 85 (this conveniently eats one 512-byte 'sector' of the 64KB 86 volume header). 87 88 Each level is assign reserved blocks in the 4MB header per 2GB zone. 89 Since we use block 0 for the volume header, the first freemap reserved 90 block in the zone begins at block 1. 91 92 Freemap copy #0: 93 Level 1 uses block 1 (this is the leaf block) 94 Level 2 uses block 2 95 Level 3 uses block 3 96 Level 4 uses block 4 97 98 Freemap copy #1: 99 Level 1 uses block 5 (this is the leaf block) 100 Level 2 uses block 6 101 Level 3 uses block 7 102 Level 4 uses block 8 103 104 ... and so forth up to Freemap copy #7 using blocks 29, 30, 31, and 32. 105 106 Flushing 107 108 The freemap does not have to be flushed by fsync/sync, but should probably 109 be flushed at least once a minute by the normal filesystem sync. The 110 reason it does not have to be flushed with fsync is that freemap recovery 111 is executed on-mount and will use the last fully flushed freemap TID 112 stored in the volume header to do an incremental meta-data scan of the 113 H2 filesystem between that TID and the last flushed TID. All blocks not 114 found to have been marked allocated will be marked allocated. Simple as 115 that. Since the scan is incremental, this typically costs very little 116 time. 117 118 Freemap Granularity 119 120 The freemap granularity is 16KB (radix of 14) but the minimum 121 allocation radix is 1KB (radix of 10) (and can be in multiples of 122 1KB with some coding). 1KB inodes can hold up to 512 bytes of direct 123 data, so tiny files eat exactly 1KB of media storage inclusive of the 124 inode. 125 126 The freemap keeps track of partial allocations in-memory but not 127 on-media, so even a normal umount will cause partially allocated 128 blocks to appear fully allocated until some later date when the 129 bulk scan code defragments it. 130 131 Block Selection 132 133 Block selection is localized to be near the inode's (or nearby data) 134 blockref. The algorithmic complexity of determining locality is not 135 defined here atm. 136 137 Freemap Leaf Substructure 138 139 * linear - Linear sub-granular allocation offset. Allows ~1KB granular 140 linear allocations. 141 142 * class - Allocation clustering class ((type << 8) | radix). 143 144 * avail - Available space in bytes, currently only used by layer 1 leaf. 145 Used as an allocation clustering aid. 146 147 * bitmap - Eight 32 bit words representing ~2MB in 16KB allocation chunks 148 at 2 bits per chunk. The filesystem allocation granularity 149 can be smaller (currently ~1KB minimum), and the live 150 filesystem caches iterations when allocating multiple chunks. 151 However, on remount any partial allocations out of a 64KB 152 allocation block MAY cause the entire 64KB to be considered 153 allocated. Fragmented space can potentially be reclaimed 154 and/or relocated by the bulk block free scan. 155 156 The 2-bit bitmap fields are assigned as follows: 157 158 00 FREE 159 01 POSSIBLY FREE (type 1) 160 10 POSSIBLY FREE (type 2) 161 11 ALLOCATED 162 163 Freemap Metadata Substructure 164 (Levels 2, 3, 4, and 5) 165 166 Freemap layers 2, 3, 4, and 5 operate as arrays of blockrefs but steal 167 some of the check area (a 24-byte area) for freemap-specific meta-data. 168 We reserve a few fields to store information which allows the block 169 allocator to do its work more efficiently. 170 171 * bigmask - A mask of radixes available for allocation under this 172 blockref. Typically initialized to -1. 173 174 * avail - Total available space in bytes. 175 176 The freemap allocator uses a cylinder-group-like abstraction using 177 the localized allocation concept first implemented by UFS. In HAMMER2 178 there is no such thing as a real cylinder group, nor are there specific 179 reserved areas for inodes vs data, but we do the next best thing by 180 roughly typing leafs (each leaf representing ~2MB) to hopefully allow 181 the drive to employ its zone-cache to make both stat-only and tar-style 182 bulk accesses efficient (in addition to normal file accesses). 183 184 Levels 2, 3, and 4 contains an array blockmap[1024] (64KB total), 185 supplying 10 bits of address space each. Level 5 is a blockmap[8] 186 stored in the volume header supplying 3 bits of address space. 187 (level 0 supplies 10 + 21 bits of address space). 188 189 The Level1 blockmap is HAMMER2's idea of a 'cylinder group', thus 190 effectively fixed at multiples of ~2MB or so. 191 192 Initial Conditions 193 194 newfs_hammer2 does not need to format the freemap. Instead, newfs_hammer2 195 simply leaves the associated top-level indirect blocks empty and uses 196 the (voldata->allocator_beg) field to allocate space linearly, then 197 leaves it to the live filesystem to initialize the freemap as more space 198 gets allocated. 199 200 The freemap does NOT use a fixed 5-level radix tree. It uses the same 201 blockmap algorithm used for file blocks but restricts any recursion to 202 specific radix values. This means that small filesystems will have much 203 smaller freemap depths. 2 layers (and not counting the volume header as 204 a layer) gets us 16GB, 3 layers gets us 16TB. 205 206 How blocks are allocated and freed 207 208 The H2 freemap leaf bitmap operates in 16KB chunks, but the leaf also 209 contains a linear allocation offset that can keep track of sub-16KB 210 allocations with certain restrictions. More random sub-16KB allocations 211 are tracked in-memory, but will be lost (assumed to be a full 16KB) if 212 a crash occurs. Each 16KB chunk is denoted by a 2-bit pattern 00, 01, 10, 213 or 11. 214 215 NOTE! All operations on the freemap occur on the current live version 216 of the freemap, including bulkfree operations. 217 218 Blocks are allocated by transitioning the 2-bit pattern in the leaf 219 to 11. That is, (00, 01, 10) -> (11). 220 221 The primary mechanism used to free a block is via the asynchronous 222 bulkfree scan. This scans all filesystem meta-data in two major passes 223 (and potentially multiple sub-passes). 224 225 Pass#1 - The first pass figures which blocks might be freeable. The 226 most recently flushed meta-data topology (including all four 227 volume headers and all snapshots) is scanned and an in-memory 228 copy of the FreeMap is built from scratch. Multiple sub-scans 229 might be required to break the larger scan up into more easily 230 digested pieces based on the amount of memory available to hold 231 the temporary freemap. 232 233 Any allocated blocks in the live freemap are then transitioned 234 from (11) to either (10) or (01) if after the scan they are 235 found to not be allocated. 236 237 The blocks are still assumed to be allocated at this time and 238 any new allocations will transition them back to (11). 239 240 Pass#2 - The second pass is required to deal with races against the 241 live filesystem while the freemap scan was running. It also 242 allows the freemap scans to run asynchronously from any flush, 243 improving concurrency. However, at least one synchronous flush 244 is required between Pass#1 and Pass#2. 245 246 The second pass is a duplicate of the first pass. The meta-data 247 topology is scanned and a freemap is built in-memory and then 248 compared against the live freemap. Instead transitioning from 249 (11)->(10)/(01) this pass transitions from (10)/(01) to (00). 250 251 If a block that it thinks is free is (11), no transition occurs 252 because this could be due to a race against the live filesystem. 253 254 This pass will incidentally transition (10)/(01) back to (11) 255 if the block was found not to be allocated, but it is perfectly 256 acceptable for the block to remain in a (10)/(01) state after 257 completion. 258 259 NOTE! The meta-data scanning passes must also explicitly scan blocks 260 associated with any open files, since these might represent 261 open-but-deleted files. These blocks must not be accidentally freed 262 while the system is still using the file. Again, since this is 263 done in two passes it does not have to be synchronized against 264 frontend operations. So in total: 265 266 * Topology under all four volume headers. This includes all 267 PFSs and snapshots. 268 269 * Topology under all open hammer2 files. 270 271 The Bulk-free operation is expensive but uses a bounded amount of ram. 272 The ADVANTAGE of this mechanism is that deletions in the live filesystem 273 do not have to clean up the freemap and thus do not have to recurse 274 the topology during the deletion. In fact, a 'rm -rf' equivalent of a 275 directory topology can be handled simply by blowing away the top-level 276 directory inode. This is instantanious and thus can be dangerous but 277 you always have your snapshots to fall-back on. 278 279 The DISADVANTAGE is that all meta-data must be scanned. Twice. This 280 can be mitigated by using swapcache(8) to cache the meta-data on a SSD. 281 This is also mitigated by the fact that you can do the bulkfree scan less 282 often on very large filesystems which presumably have a lot of freespace 283 (so the interval is not as big an issue). In a sense the operation does 284 scale in that it takes longer on larger filesystems but also can be run 285 less often. 286 287 The biggest issue is that *NO* space can be freed up by the live 288 filesystem without the bulkfree process unless we optimize the case 289 where data is created and deleted from within a single snapshot. 290 This is made more difficult by the fact that each flush represents 291 a fine-grained snapshot (up to four, representing the four volume 292 headers the flush iterates through). 293 294 Snapshots and Replicated Topologies 295 296 The bulkfree code maintains information in-memory to the best of its 297 ability for a multitude of reasons, including attempting to detect 298 snapshot recursions down block chains which have already been scanned 299 via some other snapshot. Without this, a large number of snapshots 300 can cause a huge multiplication of disk I/O reads (but not writes) during 301 the topology scan. 302 303 Use of Generic indirect-block API 304 305 I decided to use the same indirect-block allocation model for the 306 freemap that normal files use, with a few special cases added to force 307 specific radix values and to 'allocate' the freemap-related blocks 308 and indirect blocks via a reserved-block calculation and (obviously) 309 not via a recursive call to the allocator. 310 311 The Freemap is defined above as a fixed 5-level scheme (level 1-5), 312 but in actual operation the radix tree can be shortcut just as it 313 is with normal files. However, unlike normal files, shorcuts will 314 be forced to use specific radix values in order to guarantee that 315 reserved block numbers can be trivially calculated. As the freemap 316 becomes more fleshed out the tree on-media will look more and more like 317 the actual specification. 318 319 One advantage of doing things this way is that smaller filesystems 320 won't actually use a 5-level scheme. A 16GB filesystem can use 8 321 blockrefs in the volume header which point directly to layer 1 leaf 322 blocks. A 16TB filesystem can be managed with only three levels 323 (layer 3, 2, and 1 only where the 8 x layer 3 blockrefs are stored in 324 the volume header). And so forth. 325 326 At the moment we have no plans to return any of the unused 4MB zone 327 header space (per 2GB of storage) back to the filesystem for general use. 328 There are lots of things we may want to use the reserved areas for in 329 the future. 330 331 Emergency Deletions 332 333 All filesystem modifications including deletions must allocate blocks 334 in order to update the main topology all the way to the root. H2 will 335 reserve roughly 5% of the available blocks in the filesystem for 336 deletions in order to allow a system operator to recover from a 337 filesystem full condition. 338 339 However, due to the snapshot capability as well as the possibility of 340 fragmentation, it is possible for the administrator to not delete enough 341 to actually be able to free up blocks. Once the reserve is used up 342 the filesystem can become unwritable. 343 344 When this situation occurs the only way to recover is to update blocks 345 in-place. Updating blocks in-place will destroy the data on any 346 related snapshots or otherwise corrupt the snapshots. Emergency recovery 347 thus recommends that all related snapshots be destroyed. You can choose 348 not to do this in which case your snapshots might wind up containing 349 broken links and generate CRC failure messages. 350 351 For the moment the spec for dealing with these situations remains 352 incomplete. 353