1 2 HAMMER2 Freemap Design Notes 3 4 Overview 5 6 HAMMER2 Media is broken down into 2 GByte zones. Each 2 GByte zone 7 contains a 4 MByte header (64 x 64K blocks = 0.2% of storage). The 8 blocks in this header are reserved for various purposes. For example, 9 block #0 is reserved for a volume header or volume header backup. Most 10 of the 64KB blocks in this header are reserved for use by the freemap. 11 12 The freemap only uses blocks from these reserved areas. In order to 13 ensure that any of the four volume headers can be used for the mount 14 (in case some are found to be corrupted), each freemap block in the 15 logical freemap topology iterates through 6 different copies. The 16 freemap is a 4-layer topology (+1 3-bit layer in the volume header), 17 so 6x4 = 24 of the 64 reserved blocks are dedicated to freemap 18 operations. 19 20 Any given modification of a freemap block that crosses a flush group 21 must cycle to the next copy of the freemap block. Having 6 copies 22 ensures that: 23 24 - Each of the four backup volume headers points to a consistent 25 freemap topology. This eats 4 copies. 26 27 - That recovery operations during mount do not modify the state of the 28 freemap topology pointed to by older volume headers that are still 29 valid. This eats 1 copy. 30 31 - The bulk free scan eats 1 copy to use as spool-off space if the 32 thread hits its ram limits. This copy is not part of the normal 33 rotation. 34 35 - Total is 6 copies. 36 37 However, there is one major restriction: If an older volume header is 38 selected by the mount code, any newer (presumably corrupt since the 39 mount code didn't select it) volume headers will lose freemap consistency 40 as the freemap code rotates into freemap blocks that might have been used 41 by the topology pointed to by the newer (but not selected) backup 42 volume headers. For a RW mount, this means that if an older volume 43 backup is selected, the newer ones that were not selected MUST be 44 formally invalidated and cannot be used in a remount attempt. To 45 mitigate the potential loss of data, any volume headers lost in this 46 manner can be snapshotted and the freemap recovery scan (in a RW mount) 47 can also scan the snapshots to try to ensure that the blocks are marked 48 as allocated. The system operator can then check the snapshot manually. 49 50 During normal operation, each filesystem flush rotates to a new backup 51 volume header (a filesystem has up to four) and retains full consistency 52 for the older volume headers. Each logical freemap block in the topology 53 rotates through the 6 possible versions (on-modify only). 54 55 Freemap Topology 56 57 The freemap topology contains 4 levels of meta-data (blockref arrays), 58 one of which is embedded in the volume header (so only three real 59 meta-data levels), plus one level of leaf-data. Unlike normal files, 60 which use a variable-radix, the freemap topology uses a fixed radix to 61 simplify the algorithm and to ensure freemap locality to the blocks 62 under management. 63 64 Level 1 - (radix 10 + 21) 64KB representing 2GB. This is represented 65 by a hammer2_bmap_data[1024] array. Each entry represents 66 2MB worth of media storage x 1024 entries to represent 2GB. 67 Each entry contains a 128x2 bit bitmap representing 16KB 68 of storage in 2 bits (128 x 16KB = 2MB). 69 70 Level 2 - (radix 10) 64KB blockmap representing 2TB (~2GB per entry) 71 Level 3 - (radix 10) 64KB blockmap representing 2PB (~2TB per entry) 72 Level 4 - (radix 10) 64KB blockmap representing 2EB (~2PB per entry) 73 Level 5 - (radix 3) blockref x 8 in volume header representing 16EB (2^64) 74 (this conveniently eats one 512-byte 'sector' of the 64KB 75 volume header). 76 77 Each level is assign reserved blocks in the 4MB header per 2GB zone. 78 Since we use block 0 for the volume header / volume header backup, 79 our level names above can simply also represent the relative block 80 number. Level 1 uses block 1 through level 4 using block 4. Level 5 81 is stored in the volume header. 82 83 Flushing 84 85 The freemap does not have to be flushed by fsync/sync, but should probably 86 be flushed at least once a minute by the normal filesystem sync. The 87 reason it does not have to be flushed every time is that the freemap 88 recovery (using the last fully flushed freemap TID) will simply do an 89 incremental scan of the main filesystem tree between the freemap TID 90 and the main filesystem tree's TID to ensure that blocks allocated in 91 the interim are properly allocated in the freemap. Simple as that. 92 93 Freemap Granularity 94 95 The freemap granularity is 16KB (radix of 14) but the minimum 96 allocation radix is 1KB (radix of 10) (and can be in multiples of 97 1KB with some coding). 1KB inodes can hold up to 512 bytes of direct 98 data, so tiny files eat exactly 1KB of media storage inclusive of the 99 inode. 100 101 The freemap keeps track of partial allocations in-memory but not 102 on-media, so even a normal umount will cause partially allocated 103 blocks to appear fully allocated until some later date when the 104 bulk scan code defragments it. 105 106 Block Selection 107 108 Block selection is localized to be near the inode's (or nearby data) 109 blockref. The algorithmic complexity of determining locality is not 110 defined here atm. 111 112 Leaf Substructure 113 114 * radix - Clustering radix. All allocations for any given ~2MB zone 115 are always the same size, allowing the filesystem code to 116 cluster buffer cache I/O. 117 118 * bitmap - four 32 bit words representing ~2MB in 16KB allocation chunks 119 at 2 bits per chunk. The filesystem allocation granularity 120 can be smaller (currently ~1KB minimum), and the live 121 filesystem keeps caches iterations when allocating multiple 122 chunks. However, on remount any partial allocations out of 123 a 64KB allocation block causes the entire 64KB to be 124 considered allocated. Fragmented space can potentially be 125 reclaimed and/or relocated by the bulk block free scan. 126 127 The 2-bit bitmap fields are assigned as follows: 128 129 00 FREE 130 01 ARMED for free stage (future use) 131 10 ARMED for free stage (future use) 132 11 ALLOCATED 133 134 It should be noted that in some cases, such as snapshot 135 destruction, H2 does not bother to actually ARM the related 136 blocks (which would take a long time). Instead, the bulk 137 free-scan may have to do a more exhaustive scan. 138 139 Blockref Substructure 140 141 The blockref substructure at each level steals some space from the 142 check code area (a 24-byte area). We only need 4 bytes for the check 143 code icrc. We use some of the remaining space to store information 144 that allows the block allocator to do its work more efficiently. 145 146 * bigmask - A mask of radixes available for allocation under this 147 blockref. Typically initialized to -1. 148 149 * avail - Total available space in bytes. 150 151 The freemap allocator uses a cylinder-group-like abstraction using 152 the localized allocation concept first implemented by UFS. In HAMMER2 153 there is no such thing as a real cylinder group, but we do the next 154 best thing by implementing our layer 1 blockmap representing 2GB. 155 156 Level 2, 3, 4, 5 157 158 Levels 2, 3, and 4 contains an array blockmap[1024] (64KB total), 159 supplying 10 bits of address space each. Level 5 is a blockmap[8] stored 160 in the volume header supplying 3 bits of address space. (level 0 161 supplies 10 + 21 bits of address space). 162 163 The Level1 blockmap is HAMMER2's idea of a 'cylinder group', thus 164 effectively fixed at multiples of ~2MB or so. 165 166 Initial Conditions 167 168 The freemap is a multi-indirect block structure but there is no real 169 reason to pre-format it in newfs_hammer2. Instead, newfs_hammer2 simply 170 leaves the associated top-level indirect blocks empty and uses the 171 (voldata->allocator_beg) field to allocate space linearly, then leaves 172 it to the live filesystem to initialize the freemap as more space gets 173 allocated. 174 175 The freemap does NOT use a fixed 5-level radix tree. It uses the same 176 blockmap algorithm used for file blocks but restricts any recursion to 177 specific radix values. This means that small filesystems will have much 178 smaller freemap depths. 2 layers (and not counting the volume header as 179 a layer) gets us 16GB, 3 layers gets us 16TB. 180 181 How blocks are allocated and freed 182 183 H2 keeps track of sub-16KB allocations in-memory. On a crash/reboot any 184 partial allocations effectively become full 16KB block allocations until 185 the bulk freeing code comes along and fixes it. 2-bit patterns are as 186 follows: 187 188 00 FREE 189 01 ARMED (for free) (future use) 190 10 ARMED (for free) (future use) 191 11 ALLOCATED 192 193 Currently H2 only implements 00 and 11. When a file, topology, or 194 snapshot is deleted H2 simply leaves the blocks marked allocated but 195 records the related freezone/radix(s) in memory. 196 197 At some point a background bulk free-scan will run. This code must 198 scan meta-data and has a limited cache to detect duplicative sub-trees 199 (due to snapshots). It uses the freezone/radix information recorded 200 in memory to reduce the complexity of the scan, find all references to 201 the related blocks in the meta-data, and determines what can actually 202 be freed. Once this determination is made the bulk free-scan sets 203 the related freemap bits to FREE (00). 204 205 An exhaustive free-scan is not usually required during normal operation 206 but is typically run incrementally by cron every so often to ensure, over 207 time, that all freeable blocks are actually freed. This is most useful 208 when maintaining multiple snapshots. 209 210 Use of Generic indirect-block API 211 212 I decided to use the same indirect-block allocation model for the 213 freemap that normal files use, with a few special cases added to force 214 specific radix values and to 'allocate' the freemap-related blocks 215 and indirect blocks via a reserved-block calculation and (obviously) 216 not via a recursive call to the allocator. 217 218 The Freemap is defined above as a fixed 5-level scheme (level 1-5), 219 but in actual operation the radix tree can be shortcut just as it 220 is with normal files. However, shorcuts are forced into the radix 221 values of this specification and reserved blocks are calculated based 222 on the radix level and offset, so as the freemap becomes more fleshed 223 out the tree looks more and more like the specification. 224 225 One advantage of doing things this way is that smaller filesystems 226 won't actually use a 6-level scheme. A 16GB filesystem can use 8 227 blockrefs at layer 5 (in the volume header) that point directly to 228 layer 1. A 16TB filesystem can use 8 blockrefs at layer5 that point 229 to layer 2. And so forth. 230 231 At the moment we have no plans to return any of the unused 4MB zone 232 header space (per 2GB of storage) back to the filesystem for general use. 233 There are lots of things we may want to use the reserved areas for in 234 the future. 235