1 2 HAMMER2 Freemap Design Notes 3 4 Overview 5 6 HAMMER2 Media is broken down into 2 GByte zones. Each 2 GByte zone 7 contains a 4 MByte header (64 x 64K blocks). The blocks in this header 8 are reserved for various purposes. For example, block #0 is used for 9 the volume header or for a volume header backup. 10 11 * It is very important to remember that the Freemap only uses blocks 12 from these reserved areas. Freemap blocks are NOT dynamically 13 allocated. 14 15 * On-mount, the synchronization TID for the main H2 filesystem is 16 compared against the synchronization TID of the freemap and the 17 H2 topology is incrementally iterated using mirror_tid to update 18 the freemap with any missing information. This way the freemap flush 19 does not need to be synchronized with the normal H2 flush. This 20 can be done very quickly on-mount. 21 22 * The freemap is flushed in a manner similar to the normal H2 filesystem, 23 but as mentioned above it does not have to be synchronized. One freemap 24 flush could cover several H2 flushes. A freemap flush is not necessary 25 for e.g. a fsync() or sync() to complete successfully. 26 27 * The freemap granularity is 64KB (radix of 16) but the minimum 28 allocation radix for code is 1KB (radix of 10). 1KB inodes can 29 hold up to 512 bytes of direct data, so small files eat exactly 30 1KB of media storage. 31 32 * Representation of storage is block-oriented with ~1KB granularity 33 in the filesystem topology. However, H2 also stores freemap locality 34 hints in the inode at all levels which specifies which freemap zones 35 all storage allocations made by the sub-tree are allocated from. Up 36 to four zones may be listed in each inode. The zones are power-of-sized 37 and aligned the same and use a base/radix representation (same as used 38 for blockref->data_off). 39 40 During updates higher level inodes may not have a sufficient number of 41 entries to represent the storage used on a fine-grain. In this 42 situation the representations back-off to larger radix values. 43 44 Ultimately these representations will be optimized by background scans. 45 That is, ultimately storage localization can be optimized bottom-up 46 such that it winds up being fairly optimal. This includes the ability 47 to detect when a writable snapshot has differentiated sufficiently to 48 warrant a split. This optimization should NOT attempt to dup common 49 data blocks. 50 51 * The zone oriented forward storage references in the inode (the four 52 entries) is used by the bulk free-scan to reduce the amount of 53 meta-data which must be duplicatively scanned. More specifically, 54 when the sysadmin deletes storage and/or files or even whole directory 55 subhierachies, it is possible for a bulk free-scan to incrementally 56 scan the meta-data topology that covers ONLY those areas to determine 57 if it is possible to free up any actual blocks. 58 59 * H2 does not require that a rm -rf or snapshot destruction, truncation, 60 or any other operation actually modify the freemap. That is, the 61 freemap elements can remain set to ALLOCATED (11). In fact, it is 62 possible to just delete the directory inode and not even recursively 63 scan sub-directories. The related storage will eventually be freed 64 by an exhaustive bulk free-scan anyway. 65 66 Freemap Topology 67 68 The freemap topology contains 4 levels of meta-data (blockref arrays), 69 one of which is embedded in the volume header (so only three real 70 meta-data levels), plus one level of leaf-data. 71 72 Level 1 - (radix 10) 64KB blockmap representing 2GB. There are 1024 73 entries representing ~2MB worth of media storage per entry. 74 75 Each entry maps 32 x 64KB allocations @ 2 bits per allocation, 76 plus contains additional meta-data which allows H2 to cluster 77 I/O operations. Each entry locks the allocation granularity 78 (e.g. to 1KB = radix 10 for inodes). 79 80 Level 2 - (radix 10) 64KB blockmap representing 2TB (~2GB per entry) 81 Level 3 - (radix 10) 64KB blockmap representing 2PB (~2TB per entry) 82 Level 4 - (radix 10) 64KB blockmap representing 2EB (~2PB per entry) 83 Level 5 - (radix 3) blockref x 8 in volume header representing 16EB (2^64) 84 (this conveniently eats one 512-byte 'sector' of the 64KB 85 volume header). 86 87 Each level is assign reserved blocks in the 4MB header per 2GB zone. 88 Since we use block 0 for the volume header / volume header backup, 89 our level names above can simply also represent the relative block 90 number. Level 1 uses block 1 through level 4 using block 4. Level 5 91 is stored in the volume header. 92 93 In addition there are FOUR SETS, A, B, C, and D, each containing blocks 94 for level 1-4. Hammer2 alternates between sets on a block-by-block basis 95 in order to maintain consistency when updating the freemap. 96 97 Leaf Substructure 98 99 * radix - Clustering radix. All allocations for any given ~2MB zone 100 are always the same size, allowing the filesystem code to 101 cluster buffer cache I/O. 102 103 * bitmap - four 32 bit words representing ~2MB in 64KB allocation chunks 104 at 2 bits per chunk. The filesystem allocation granularity 105 can be smaller (currently ~1KB minimum), and the live 106 filesystem keeps caches iterations when allocating multiple 107 chunks. However, on remount any partial allocations out of 108 a 64KB allocation block causes the entire 64KB to be 109 considered allocated. Fragmented space can potentially be 110 reclaimed and/or relocated by the bulk block free scan. 111 112 The 2-bit bitmap fields are assigned as follows: 113 114 00 FREE 115 01 ARMED for free stage (future use) 116 10 ARMED for free stage (future use) 117 11 ALLOCATED 118 119 It should be noted that in some cases, such as snapshot 120 destruction, H2 does not bother to actually ARM the related 121 blocks (which would take a long time). Instead, the bulk 122 free-scan may have to do a more exhaustive scan. 123 124 Blockref Substructure 125 126 The blockref substructure at each level steals some space from the 127 check code area (a 24-byte area). We only need 4 bytes for the check 128 code icrc. We use some of the remaining space to store information 129 that allows the block allocator to do its work more efficiently. 130 131 * bigmask - A mask of radixes available for allocation under this 132 blockref. Typically initialized to -1. 133 134 * avail - Total available space in bytes. 135 136 The freemap allocator uses a cylinder-group-like abstraction using 137 the localized allocation concept first implemented by UFS. In HAMMER2 138 there is no such thing as a real cylinder group, but we do the next 139 best thing by implementing our layer 1 blockmap representing 2GB. 140 141 The layer 1 blockmap is an array of 1024 blockrefs, so each blockref 142 covers 2MB worth of media storage. HAMMER2's 'cylinder group' concept 143 thus has a minimum granularity of 2MB. A typical setting might be e.g. 144 10MB. 145 146 By localizing allocations to cylinder groups based on various bits of 147 information, HAMMER2 tries to allocate space on the disk and still leave 148 some left over for localized expansion and to reduce fragmentation at 149 the same time. Not an easy task, especially considering the copy-on-write 150 nature of the filesystem. This part of the algorithm likely needs a lot 151 of work but I hope I've laid down a media format that will not have to be 152 changed down the line to accomodate better allocation strategies. 153 154 Initial Conditions 155 156 The freemap is a multi-indirect block structure but there is no real 157 reason to pre-format it in newfs_hammer2. Instead, newfs_hammer2 simply 158 leaves the associated top-level indirect blocks empty and uses the 159 (voldata->allocator_beg) field to allocate space linearly, then leaves 160 it to the live filesystem to initialize the freemap as more space gets 161 allocated. 162 163 How blocks are freed 164 165 The freemap bit patterns for each 64KB block are as follows: 166 167 00 FREE 168 01 ARMED (for free) (future use) 169 10 ARMED (for free) (future use) 170 11 ALLOCATED 171 172 Currently H2 only implements 00 and 11. When a file, topology, or 173 snapshot is deleted H2 simply leaves the blocks marked allocated but 174 records the related freezone/radix(s) in memory. 175 176 At some point a background bulk free-scan will run. This code must 177 scan meta-data and has a limited cache to detect duplicative sub-trees 178 (due to snapshots). It uses the freezone/radix information recorded 179 in memory to reduce the complexity of the scan, find all references to 180 the related blocks in the meta-data, and determines what can actually 181 be freed. Once this determination is made the bulk free-scan sets 182 the related freemap bits to FREE (00). 183 184 An exhaustive free-scan is not usually required during normal operation 185 but is typically run incrementally by cron every so often to ensure, over 186 time, that all freeable blocks are actually freed. This is most useful 187 when maintaining multiple snapshots. 188 189 Use of Generic indirect-block API 190 191 I decided to use the same indirect-block allocation model for the 192 freemap that normal files use, with a few special cases added to force 193 specific radix values and to 'allocate' the freemap-related blocks 194 and indirect blocks via a reserved-block calculation and (obviously) 195 not via a recursive call to the allocator. 196 197 The Freemap is defined above as a fixed 5-level scheme (level 1-5), 198 but in actual operation the radix tree can be shortcut just as it 199 is with normal files. However, shorcuts are forced into the radix 200 values of this specification and reserved blocks are calculated based 201 on the radix level and offset, so as the freemap becomes more fleshed 202 out the tree looks more and more like the specification. 203 204 One advantage of doing things this way is that smaller filesystems 205 won't actually use a 6-level scheme. A 16GB filesystem can use 8 206 blockrefs at layer 5 (in the volume header) that point directly to 207 layer 1. A 16TB filesystem can use 8 blockrefs at layer5 that point 208 to layer 2. And so forth. 209 210 At the moment we have no plans to return any of the unused 4MB zone 211 header space (per 2GB of storage) back to the filesystem for general use. 212 There are lots of things we may want to use the reserved areas for in 213 the future. 214