1 2 HAMMER2 Freemap Design Notes 3 4 Overview 5 6 HAMMER2 Media is broken down into 2 GByte zones. Each 2 GByte zone 7 contains a 4 MByte header (64 x 64K blocks = 0.2% of storage). The 8 blocks in this header are reserved for various purposes. For example, 9 block #0 is reserved for a volume headers. Most of the remaining 10 64KB blocks in this header are reserved for use by the freemap. 11 12 The freemap only uses blocks from these reserved areas. In order to 13 ensure that any of the four volume headers can be used by the mount code 14 (in case some are found to be corrupted), each freemap block in the 15 logical freemap topology will iterate through up to 8 copies whos 16 block numbers are taken the reserved area. 17 18 - Four copies, one for each of the four volume headers which H2 sequences 19 through on each flush. This ensures that a mount from any of the four 20 volume headers is handed a consistent freemap topology. 21 22 - One copy to ensure that recovery operations during mount do not modify 23 the state of the freemap topology pointed to by older volume headers 24 which are still valid. Note that the freemap for volume headers 25 indexed after the mount point being recovered may lose freemap 26 consistency, so if you choose an older mount point for a RW mount, 27 you have to stick with it. 28 29 - One copy for live operations. This allows HAMMER2 to retire the 30 related buffer (or for the OS to retire the buffer cache buffer) 31 prior to the next flush and also allows the buffers to be flushed 32 asynchronously. 33 34 - The two remaining copies add robustness to the specification. For 35 example, with appropriate feature code the filesystem can tolerate 36 a limited number of bad blocks in the reserved area. 37 38 For the moment we use a simple calculation for the freemap block. In 39 a later version I would like to mix the blocks up a bit so the blocks 40 in each set of 8 are not situated near each other. 41 42 RW Mount Restrictions 43 44 If an older volume header is explicitly selected by the mount code, any 45 newer (presumably corrupt since the mount code didn't select it) volume 46 headers will lose freemap consistency as the freemap code rotates into 47 freemap blocks that might have been used by the topology pointed to by 48 the newer (but not selected) volume headers. For a RW mount, this means 49 that if an older volume header is selected, the newer ones that were 50 not selected WILL be formally invalidated by the mount code and cannot 51 be used in a remount attempt. 52 53 During normal operation, each filesystem flush rotates to a new volume 54 header. A filesystem may have up to four volume headers spread at 2GB 55 intervals. Filesystems smaller than ~9GB or so will have fewer volume 56 headers to rotate through. 57 58 Freemap Topology 59 60 The freemap topology contains 4 levels of meta-data (blockref arrays), 61 one of which is embedded in the volume header (so only three real 62 meta-data levels), plus one level of leaf-data. Unlike normal files, 63 which use a variable-radix, the freemap topology uses a fixed radix to 64 simplify the algorithm and to ensure freemap locality to the blocks 65 under management. 66 67 Freemap blocks are allocated from the reserved area in each 2GB zone. 68 The leafs represent data in the zone. Higher levels in the freemap 69 topology will cover more area but the physical freemap meta-data blocks 70 always occur prior to the area being covered. Thus a HAMMER2 filesystem 71 of almost any size can be formatted and the related freemap blocks 72 will always exist. 73 74 Level 1 - (radix 10 + 21) 64KB representing 2GB. This is represented 75 by a hammer2_bmap_data[1024] array. Each entry represents 76 2MB worth of media storage x 1024 entries to represent 2GB. 77 Each entry contains a 128x2 bit bitmap representing 16KB 78 of storage in 2 bits (128 x 16KB = 2MB). 79 80 Level 2 - (radix 10) 64KB blockmap representing 2TB (~2GB per entry) 81 Level 3 - (radix 10) 64KB blockmap representing 2PB (~2TB per entry) 82 Level 4 - (radix 10) 64KB blockmap representing 2EB (~2PB per entry) 83 Level 5 - (radix 3) blockref x 8 in volume header representing 16EB (2^64) 84 (this conveniently eats one 512-byte 'sector' of the 64KB 85 volume header). 86 87 Each level is assign reserved blocks in the 4MB header per 2GB zone. 88 Since we use block 0 for the volume header, the first freemap reserved 89 block in the zone begins at block 1. 90 91 Freemap copy #0: 92 Level 1 uses block 1 (this is the leaf block) 93 Level 2 uses block 2 94 Level 3 uses block 3 95 Level 4 uses block 4 96 97 Freemap copy #1: 98 Level 1 uses block 5 (this is the leaf block) 99 Level 2 uses block 6 100 Level 3 uses block 7 101 Level 4 uses block 8 102 103 ... and so forth up to Freemap copy #7 using blocks 29, 30, 31, and 32. 104 105 Flushing 106 107 The freemap does not have to be flushed by fsync/sync, but should probably 108 be flushed at least once a minute by the normal filesystem sync. The 109 reason it does not have to be flushed every time is that freemap recovery 110 is executed on-mount and will use the last fully flushed freemap TID 111 stored in the volume header to do an incremental meta-data scan of the 112 H2 filesystem between that TID and the last flushed TID. All blocks not 113 found to have been marked allocated will be marked allocated. Simple as 114 that. 115 116 Freemap Granularity 117 118 The freemap granularity is 16KB (radix of 14) but the minimum 119 allocation radix is 1KB (radix of 10) (and can be in multiples of 120 1KB with some coding). 1KB inodes can hold up to 512 bytes of direct 121 data, so tiny files eat exactly 1KB of media storage inclusive of the 122 inode. 123 124 The freemap keeps track of partial allocations in-memory but not 125 on-media, so even a normal umount will cause partially allocated 126 blocks to appear fully allocated until some later date when the 127 bulk scan code defragments it. 128 129 Block Selection 130 131 Block selection is localized to be near the inode's (or nearby data) 132 blockref. The algorithmic complexity of determining locality is not 133 defined here atm. 134 135 Freemap Leaf Substructure 136 137 * linear - Linear sub-granular allocation offset. Allows ~1KB granular 138 linear allocations. 139 140 * class - Allocation clustering class ((type << 8) | radix). 141 142 * avail - Available space in bytes, currently only used by layer 1 leaf. 143 Used as an allocation clustering aid. 144 145 * bitmap - Eight 32 bit words representing ~2MB in 16KB allocation chunks 146 at 2 bits per chunk. The filesystem allocation granularity 147 can be smaller (currently ~1KB minimum), and the live 148 filesystem caches iterations when allocating multiple chunks. 149 However, on remount any partial allocations out of a 64KB 150 allocation block MAY cause the entire 64KB to be considered 151 allocated. Fragmented space can potentially be reclaimed 152 and/or relocated by the bulk block free scan. 153 154 The 2-bit bitmap fields are assigned as follows: 155 156 00 FREE 157 01 POSSIBLY FREE (type 1) 158 10 POSSIBLY FREE (type 2) 159 11 ALLOCATED 160 161 Freemap Metadata Substructure 162 (Levels 2, 3, 4, and 5) 163 164 Freemap layers 2, 3, 4, and 5 operate as arrays of blockrefs but steal 165 some of the check area (a 24-byte area) for freemap-specific meta-data. 166 We reserve a few fields to store information which allows the block 167 allocator to do its work more efficiently. 168 169 * bigmask - A mask of radixes available for allocation under this 170 blockref. Typically initialized to -1. 171 172 * avail - Total available space in bytes. 173 174 The freemap allocator uses a cylinder-group-like abstraction using 175 the localized allocation concept first implemented by UFS. In HAMMER2 176 there is no such thing as a real cylinder group, nor are there specific 177 reserved areas for inodes vs data, but we do the next best thing by 178 roughly typing leafs (each leaf representing ~2MB) to hopefully allow 179 the drive to employ its zone-cache to make both stat-only and tar-style 180 bulk accesses efficient (in addition to normal file accesses). 181 182 Levels 2, 3, and 4 contains an array blockmap[1024] (64KB total), 183 supplying 10 bits of address space each. Level 5 is a blockmap[8] 184 stored in the volume header supplying 3 bits of address space. 185 (level 0 supplies 10 + 21 bits of address space). 186 187 The Level1 blockmap is HAMMER2's idea of a 'cylinder group', thus 188 effectively fixed at multiples of ~2MB or so. 189 190 Initial Conditions 191 192 The freemap is a multi-indirect block structure but there is no real 193 reason to pre-format it in newfs_hammer2. Instead, newfs_hammer2 simply 194 leaves the associated top-level indirect blocks empty and uses the 195 (voldata->allocator_beg) field to allocate space linearly, then leaves 196 it to the live filesystem to initialize the freemap as more space gets 197 allocated. 198 199 The freemap does NOT use a fixed 5-level radix tree. It uses the same 200 blockmap algorithm used for file blocks but restricts any recursion to 201 specific radix values. This means that small filesystems will have much 202 smaller freemap depths. 2 layers (and not counting the volume header as 203 a layer) gets us 16GB, 3 layers gets us 16TB. 204 205 How blocks are allocated and freed 206 207 The H2 freemap leaf bitmap operates in 16KB chunks, but the leaf also 208 contains a linear allocation offset that can keep track of sub-16KB 209 allocations with certain restrictions. More random sub-16KB allocations 210 are tracked in-memory, but will be lost (assumed to be a full 16KB) if 211 a crash occurs. 212 213 NOTE! All operations on the freemap occur on the current live version 214 of the freemap, including bulkfree operations. 215 216 Blocks are allocated by transitioning the 2-bit pattern in the leaf 217 to 11. That is, (00, 01, 10) -> (11). This handles races between the 218 live allocator and the asynchronous bulkfree code. A live allocation 219 which occurs while the asynchronous bulkfree process is running will 220 operate race-free by transitioning the (01) an (10) states back 221 to (11), which prevents bulkfree from later marking those blocks 222 FREE (00). 223 224 Blocks can be freed several ways, but all block freeing operations 225 require at least two passes before the related blocks can actually be 226 reused. 227 228 Method #1 - Removal in the filesystem marks the related freemap bitmap 229 POSSIBLY FREE (either 01 or 10). The asynchronous bulkfree 230 process later determines that the block is actually free and 231 transitions it to FREE (00), or moves it back to 232 ALLOCATED (11). 233 234 This works whether the blocks can actually be freed or not, 235 so we don't care if the related blocks are part of some other 236 snapshot or not. bulkfree will figure it out. 237 238 Method #2 - Removal in the filesystem ignores the freemap. The freemap 239 blocks are left alone (typically ALLOCATED (11)). 240 241 In this situation bulkfree must make extra passes to determine 242 if blocks are possibly free, then transition the leaf bitmap 243 entries to POSSIBLY FREE (01 or 10). bulkfree cannot directly 244 transition the entries to FREE (00) without another pass. 245 246 However, this method has numerous advantages including making 247 snapshot manipulation (including deletions) instantanious 248 and allow whole subtrees and/or large-files to be rm -rf'd 249 with only a single disk write to update the inode in the 250 best case. 251 252 Method #3 - Brute force. *ALL* freemap bitmap entries are marked 253 POSSIBLY FREE and bulkfree then must do multiple passes 254 (particularly in order to ensure that its memory use remains 255 bounded) to again transition all the freemap bitmap entries 256 to either FREE (00) or ALLOCATED (11). 257 258 This method can be faster than #2 but wastes a considerable 259 amount of write-bandwidth (and SSD wear if the target drive 260 is a SSD). 261 262 In all cases the bulkfree code must make a final pass on the filesystem 263 to do the final transition of POSSIBLY FREE blocks to FREE (00) or 264 ALLOCATED (11). Again, races for the FREE (00) are handled by observing 265 if the bitmap code was moved to ALLOCATED (11) by the live system while 266 bulkfree ran asynchrnously and not transitioning the element to FREE (00) 267 in that situation. 268 269 All bulkfree passes are done on meta-data. Actual data blocks do not 270 need to be read unless the media is being verified. H2 uses method #2 271 by default and efficiency depends on how much ram the system has to 272 cache scan information. That said, the bulkfree process is not only 273 incremental but it is also interruptable and restartable. It does not 274 interfere with live operations other than using disk bandwidth, so 275 there are several ways to run it including in the background. 276 277 The biggest issue is that *NO* space can be freed up by the live 278 filesystem without the bulkfree process unless we optimize the case 279 where data is created and deleted from within a single snapshot. 280 This is made more difficult by the fact that each flush represents 281 a fine-grained snapshot (up to four, representing the four volume 282 headers the flush iterates through). 283 284 Snapshots and Replicated Topologies 285 286 The bulkfree code maintains information in-memory to the best of its 287 ability for a multitude of reasons, including attempting to detect 288 snapshot recursions down block chains which have already been scanned 289 via some other snapshot. Without this, a large number of snapshots 290 can cause a huge multiplication of disk I/O reads (but not writes) during 291 the topology scan. 292 293 Use of Generic indirect-block API 294 295 I decided to use the same indirect-block allocation model for the 296 freemap that normal files use, with a few special cases added to force 297 specific radix values and to 'allocate' the freemap-related blocks 298 and indirect blocks via a reserved-block calculation and (obviously) 299 not via a recursive call to the allocator. 300 301 The Freemap is defined above as a fixed 5-level scheme (level 1-5), 302 but in actual operation the radix tree can be shortcut just as it 303 is with normal files. However, unlike normal files, shorcuts will 304 be forced to use specific radix values in order to guarantee that 305 reserved block numbers can be trivially calculated. As the freemap 306 becomes more fleshed out the tree on-media will look more and more like 307 the actual specification. 308 309 One advantage of doing things this way is that smaller filesystems 310 won't actually use a 5-level scheme. A 16GB filesystem can use 8 311 blockrefs in the volume header which point directly to layer 1 leaf 312 blocks. A 16TB filesystem can be managed with only three levels 313 (layer 3, 2, and 1 only where the 8 x layer 3 blockrefs are stored in 314 the volume header). And so forth. 315 316 At the moment we have no plans to return any of the unused 4MB zone 317 header space (per 2GB of storage) back to the filesystem for general use. 318 There are lots of things we may want to use the reserved areas for in 319 the future. 320 321 Emergency Deletions 322 323 All filesystem modifications including deletions must allocate blocks 324 in order to update the main topology all the way to the root. H2 will 325 reserve roughly 5% of the available blocks in the filesystem for 326 deletions in order to allow a system operator to recover from a 327 filesystem full condition. 328 329 Despite this, situations may come up, due to having snapshots, where 330 deletions eat up available blocks but fail to create freeable space. 331 When this situation occurs the system operator may be forced to issue 332 emergency in-place deletions which replace existing blocks rather then 333 allocate new blocks. For the moment the spec for dealing with these 334 situations remains incomplete. 335