11a7cfe5aSMatthew Dillon 21a7cfe5aSMatthew Dillon HAMMER2 Freemap Design Notes 31a7cfe5aSMatthew Dillon 41a7cfe5aSMatthew Dillon Overview 51a7cfe5aSMatthew Dillon 61a7cfe5aSMatthew Dillon HAMMER2 Media is broken down into 2 GByte zones. Each 2 GByte zone 7a71db85dSMatthew Dillon contains a 4 MByte header (64 x 64K blocks = 0.2% of storage). The 8a71db85dSMatthew Dillon blocks in this header are reserved for various purposes. For example, 93148f677SMatthew Dillon block #0 is reserved for a volume header in the first four zones. Most 103148f677SMatthew Dillon of the remaining 64KB blocks in this header are reserved for use by the 113148f677SMatthew Dillon freemap. 121a7cfe5aSMatthew Dillon 13a71db85dSMatthew Dillon The freemap only uses blocks from these reserved areas. In order to 14bca9f8e6SMatthew Dillon ensure that any of the four volume headers can be used by the mount code 15a71db85dSMatthew Dillon (in case some are found to be corrupted), each freemap block in the 16bca9f8e6SMatthew Dillon logical freemap topology will iterate through up to 8 copies whos 17bca9f8e6SMatthew Dillon block numbers are taken the reserved area. 181a7cfe5aSMatthew Dillon 19bca9f8e6SMatthew Dillon - Four copies, one for each of the four volume headers which H2 sequences 20bca9f8e6SMatthew Dillon through on each flush. This ensures that a mount from any of the four 21bca9f8e6SMatthew Dillon volume headers is handed a consistent freemap topology. 221a7cfe5aSMatthew Dillon 23bca9f8e6SMatthew Dillon - One copy to ensure that recovery operations during mount do not modify 24bca9f8e6SMatthew Dillon the state of the freemap topology pointed to by older volume headers 25bca9f8e6SMatthew Dillon which are still valid. Note that the freemap for volume headers 26bca9f8e6SMatthew Dillon indexed after the mount point being recovered may lose freemap 27bca9f8e6SMatthew Dillon consistency, so if you choose an older mount point for a RW mount, 28bca9f8e6SMatthew Dillon you have to stick with it. 291a7cfe5aSMatthew Dillon 30bca9f8e6SMatthew Dillon - One copy for live operations. This allows HAMMER2 to retire the 313148f677SMatthew Dillon related buffer asynchronously in the background (or for the OS to 323148f677SMatthew Dillon retire the buffer cache buffer on its own) prior to the formal 333148f677SMatthew Dillon flush. The later formal flush then has less work to do. 341a7cfe5aSMatthew Dillon 35bca9f8e6SMatthew Dillon - The two remaining copies add robustness to the specification. For 363148f677SMatthew Dillon example, with appropriate feature code added the filesystem can 373148f677SMatthew Dillon tolerate a limited number of bad blocks in the reserved area. 3893f3933aSMatthew Dillon 39bca9f8e6SMatthew Dillon For the moment we use a simple calculation for the freemap block. In 40bca9f8e6SMatthew Dillon a later version I would like to mix the blocks up a bit so the blocks 41bca9f8e6SMatthew Dillon in each set of 8 are not situated near each other. 4293f3933aSMatthew Dillon 43bca9f8e6SMatthew Dillon RW Mount Restrictions 4493f3933aSMatthew Dillon 45bca9f8e6SMatthew Dillon If an older volume header is explicitly selected by the mount code, any 46bca9f8e6SMatthew Dillon newer (presumably corrupt since the mount code didn't select it) volume 47bca9f8e6SMatthew Dillon headers will lose freemap consistency as the freemap code rotates into 48bca9f8e6SMatthew Dillon freemap blocks that might have been used by the topology pointed to by 49bca9f8e6SMatthew Dillon the newer (but not selected) volume headers. For a RW mount, this means 50bca9f8e6SMatthew Dillon that if an older volume header is selected, the newer ones that were 51bca9f8e6SMatthew Dillon not selected WILL be formally invalidated by the mount code and cannot 52bca9f8e6SMatthew Dillon be used in a remount attempt. 53bca9f8e6SMatthew Dillon 54bca9f8e6SMatthew Dillon During normal operation, each filesystem flush rotates to a new volume 55bca9f8e6SMatthew Dillon header. A filesystem may have up to four volume headers spread at 2GB 56bca9f8e6SMatthew Dillon intervals. Filesystems smaller than ~9GB or so will have fewer volume 57bca9f8e6SMatthew Dillon headers to rotate through. 581a7cfe5aSMatthew Dillon 591a7cfe5aSMatthew Dillon Freemap Topology 601a7cfe5aSMatthew Dillon 6193f3933aSMatthew Dillon The freemap topology contains 4 levels of meta-data (blockref arrays), 6293f3933aSMatthew Dillon one of which is embedded in the volume header (so only three real 63a71db85dSMatthew Dillon meta-data levels), plus one level of leaf-data. Unlike normal files, 64a71db85dSMatthew Dillon which use a variable-radix, the freemap topology uses a fixed radix to 65a71db85dSMatthew Dillon simplify the algorithm and to ensure freemap locality to the blocks 66a71db85dSMatthew Dillon under management. 671a7cfe5aSMatthew Dillon 68bca9f8e6SMatthew Dillon Freemap blocks are allocated from the reserved area in each 2GB zone. 69bca9f8e6SMatthew Dillon The leafs represent data in the zone. Higher levels in the freemap 70bca9f8e6SMatthew Dillon topology will cover more area but the physical freemap meta-data blocks 71bca9f8e6SMatthew Dillon always occur prior to the area being covered. Thus a HAMMER2 filesystem 72bca9f8e6SMatthew Dillon of almost any size can be formatted and the related freemap blocks 73bca9f8e6SMatthew Dillon will always exist. 74bca9f8e6SMatthew Dillon 75e07becf8SMatthew Dillon Level 1 - (radix 10 + 21) 64KB representing 2GB. This is represented 76e07becf8SMatthew Dillon by a hammer2_bmap_data[1024] array. Each entry represents 77e07becf8SMatthew Dillon 2MB worth of media storage x 1024 entries to represent 2GB. 78a71db85dSMatthew Dillon Each entry contains a 128x2 bit bitmap representing 16KB 79e07becf8SMatthew Dillon of storage in 2 bits (128 x 16KB = 2MB). 801a7cfe5aSMatthew Dillon 8193f3933aSMatthew Dillon Level 2 - (radix 10) 64KB blockmap representing 2TB (~2GB per entry) 8293f3933aSMatthew Dillon Level 3 - (radix 10) 64KB blockmap representing 2PB (~2TB per entry) 8393f3933aSMatthew Dillon Level 4 - (radix 10) 64KB blockmap representing 2EB (~2PB per entry) 841a7cfe5aSMatthew Dillon Level 5 - (radix 3) blockref x 8 in volume header representing 16EB (2^64) 851a7cfe5aSMatthew Dillon (this conveniently eats one 512-byte 'sector' of the 64KB 861a7cfe5aSMatthew Dillon volume header). 871a7cfe5aSMatthew Dillon 881a7cfe5aSMatthew Dillon Each level is assign reserved blocks in the 4MB header per 2GB zone. 89bca9f8e6SMatthew Dillon Since we use block 0 for the volume header, the first freemap reserved 90bca9f8e6SMatthew Dillon block in the zone begins at block 1. 91bca9f8e6SMatthew Dillon 92bca9f8e6SMatthew Dillon Freemap copy #0: 93bca9f8e6SMatthew Dillon Level 1 uses block 1 (this is the leaf block) 94bca9f8e6SMatthew Dillon Level 2 uses block 2 95bca9f8e6SMatthew Dillon Level 3 uses block 3 96bca9f8e6SMatthew Dillon Level 4 uses block 4 97bca9f8e6SMatthew Dillon 98bca9f8e6SMatthew Dillon Freemap copy #1: 99bca9f8e6SMatthew Dillon Level 1 uses block 5 (this is the leaf block) 100bca9f8e6SMatthew Dillon Level 2 uses block 6 101bca9f8e6SMatthew Dillon Level 3 uses block 7 102bca9f8e6SMatthew Dillon Level 4 uses block 8 103bca9f8e6SMatthew Dillon 104bca9f8e6SMatthew Dillon ... and so forth up to Freemap copy #7 using blocks 29, 30, 31, and 32. 1051a7cfe5aSMatthew Dillon 106a71db85dSMatthew Dillon Flushing 107a71db85dSMatthew Dillon 108a71db85dSMatthew Dillon The freemap does not have to be flushed by fsync/sync, but should probably 109a71db85dSMatthew Dillon be flushed at least once a minute by the normal filesystem sync. The 1103148f677SMatthew Dillon reason it does not have to be flushed with fsync is that freemap recovery 111bca9f8e6SMatthew Dillon is executed on-mount and will use the last fully flushed freemap TID 112bca9f8e6SMatthew Dillon stored in the volume header to do an incremental meta-data scan of the 113bca9f8e6SMatthew Dillon H2 filesystem between that TID and the last flushed TID. All blocks not 114bca9f8e6SMatthew Dillon found to have been marked allocated will be marked allocated. Simple as 1153148f677SMatthew Dillon that. Since the scan is incremental, this typically costs very little 1163148f677SMatthew Dillon time. 117a71db85dSMatthew Dillon 118a71db85dSMatthew Dillon Freemap Granularity 119a71db85dSMatthew Dillon 120a71db85dSMatthew Dillon The freemap granularity is 16KB (radix of 14) but the minimum 121a71db85dSMatthew Dillon allocation radix is 1KB (radix of 10) (and can be in multiples of 122a71db85dSMatthew Dillon 1KB with some coding). 1KB inodes can hold up to 512 bytes of direct 123a71db85dSMatthew Dillon data, so tiny files eat exactly 1KB of media storage inclusive of the 124a71db85dSMatthew Dillon inode. 125a71db85dSMatthew Dillon 126a71db85dSMatthew Dillon The freemap keeps track of partial allocations in-memory but not 127a71db85dSMatthew Dillon on-media, so even a normal umount will cause partially allocated 128a71db85dSMatthew Dillon blocks to appear fully allocated until some later date when the 129a71db85dSMatthew Dillon bulk scan code defragments it. 130a71db85dSMatthew Dillon 131a71db85dSMatthew Dillon Block Selection 132a71db85dSMatthew Dillon 133a71db85dSMatthew Dillon Block selection is localized to be near the inode's (or nearby data) 134a71db85dSMatthew Dillon blockref. The algorithmic complexity of determining locality is not 135a71db85dSMatthew Dillon defined here atm. 13693f3933aSMatthew Dillon 137bca9f8e6SMatthew Dillon Freemap Leaf Substructure 13893f3933aSMatthew Dillon 139bca9f8e6SMatthew Dillon * linear - Linear sub-granular allocation offset. Allows ~1KB granular 140bca9f8e6SMatthew Dillon linear allocations. 14193f3933aSMatthew Dillon 142bca9f8e6SMatthew Dillon * class - Allocation clustering class ((type << 8) | radix). 143bca9f8e6SMatthew Dillon 144bca9f8e6SMatthew Dillon * avail - Available space in bytes, currently only used by layer 1 leaf. 145bca9f8e6SMatthew Dillon Used as an allocation clustering aid. 146bca9f8e6SMatthew Dillon 147bca9f8e6SMatthew Dillon * bitmap - Eight 32 bit words representing ~2MB in 16KB allocation chunks 14893f3933aSMatthew Dillon at 2 bits per chunk. The filesystem allocation granularity 14993f3933aSMatthew Dillon can be smaller (currently ~1KB minimum), and the live 150bca9f8e6SMatthew Dillon filesystem caches iterations when allocating multiple chunks. 151bca9f8e6SMatthew Dillon However, on remount any partial allocations out of a 64KB 152bca9f8e6SMatthew Dillon allocation block MAY cause the entire 64KB to be considered 153bca9f8e6SMatthew Dillon allocated. Fragmented space can potentially be reclaimed 154bca9f8e6SMatthew Dillon and/or relocated by the bulk block free scan. 15593f3933aSMatthew Dillon 15693f3933aSMatthew Dillon The 2-bit bitmap fields are assigned as follows: 15793f3933aSMatthew Dillon 15893f3933aSMatthew Dillon 00 FREE 159bca9f8e6SMatthew Dillon 01 POSSIBLY FREE (type 1) 160bca9f8e6SMatthew Dillon 10 POSSIBLY FREE (type 2) 16193f3933aSMatthew Dillon 11 ALLOCATED 16293f3933aSMatthew Dillon 163bca9f8e6SMatthew Dillon Freemap Metadata Substructure 164bca9f8e6SMatthew Dillon (Levels 2, 3, 4, and 5) 1651a7cfe5aSMatthew Dillon 166bca9f8e6SMatthew Dillon Freemap layers 2, 3, 4, and 5 operate as arrays of blockrefs but steal 167bca9f8e6SMatthew Dillon some of the check area (a 24-byte area) for freemap-specific meta-data. 168bca9f8e6SMatthew Dillon We reserve a few fields to store information which allows the block 169bca9f8e6SMatthew Dillon allocator to do its work more efficiently. 1701a7cfe5aSMatthew Dillon 17193f3933aSMatthew Dillon * bigmask - A mask of radixes available for allocation under this 17293f3933aSMatthew Dillon blockref. Typically initialized to -1. 1731a7cfe5aSMatthew Dillon 1741a7cfe5aSMatthew Dillon * avail - Total available space in bytes. 1751a7cfe5aSMatthew Dillon 1761a7cfe5aSMatthew Dillon The freemap allocator uses a cylinder-group-like abstraction using 1771a7cfe5aSMatthew Dillon the localized allocation concept first implemented by UFS. In HAMMER2 178bca9f8e6SMatthew Dillon there is no such thing as a real cylinder group, nor are there specific 179bca9f8e6SMatthew Dillon reserved areas for inodes vs data, but we do the next best thing by 180bca9f8e6SMatthew Dillon roughly typing leafs (each leaf representing ~2MB) to hopefully allow 181bca9f8e6SMatthew Dillon the drive to employ its zone-cache to make both stat-only and tar-style 182bca9f8e6SMatthew Dillon bulk accesses efficient (in addition to normal file accesses). 1831a7cfe5aSMatthew Dillon 184a71db85dSMatthew Dillon Levels 2, 3, and 4 contains an array blockmap[1024] (64KB total), 185bca9f8e6SMatthew Dillon supplying 10 bits of address space each. Level 5 is a blockmap[8] 186bca9f8e6SMatthew Dillon stored in the volume header supplying 3 bits of address space. 187bca9f8e6SMatthew Dillon (level 0 supplies 10 + 21 bits of address space). 188a71db85dSMatthew Dillon 189a71db85dSMatthew Dillon The Level1 blockmap is HAMMER2's idea of a 'cylinder group', thus 190a71db85dSMatthew Dillon effectively fixed at multiples of ~2MB or so. 1911a7cfe5aSMatthew Dillon 1921a7cfe5aSMatthew Dillon Initial Conditions 1931a7cfe5aSMatthew Dillon 1943148f677SMatthew Dillon newfs_hammer2 does not need to format the freemap. Instead, newfs_hammer2 1953148f677SMatthew Dillon simply leaves the associated top-level indirect blocks empty and uses 1963148f677SMatthew Dillon the (voldata->allocator_beg) field to allocate space linearly, then 1973148f677SMatthew Dillon leaves it to the live filesystem to initialize the freemap as more space 1983148f677SMatthew Dillon gets allocated. 1991a7cfe5aSMatthew Dillon 200a71db85dSMatthew Dillon The freemap does NOT use a fixed 5-level radix tree. It uses the same 201a71db85dSMatthew Dillon blockmap algorithm used for file blocks but restricts any recursion to 202a71db85dSMatthew Dillon specific radix values. This means that small filesystems will have much 203a71db85dSMatthew Dillon smaller freemap depths. 2 layers (and not counting the volume header as 204a71db85dSMatthew Dillon a layer) gets us 16GB, 3 layers gets us 16TB. 20593f3933aSMatthew Dillon 206a71db85dSMatthew Dillon How blocks are allocated and freed 207a71db85dSMatthew Dillon 208bca9f8e6SMatthew Dillon The H2 freemap leaf bitmap operates in 16KB chunks, but the leaf also 209bca9f8e6SMatthew Dillon contains a linear allocation offset that can keep track of sub-16KB 210bca9f8e6SMatthew Dillon allocations with certain restrictions. More random sub-16KB allocations 211bca9f8e6SMatthew Dillon are tracked in-memory, but will be lost (assumed to be a full 16KB) if 2123148f677SMatthew Dillon a crash occurs. Each 16KB chunk is denoted by a 2-bit pattern 00, 01, 10, 2133148f677SMatthew Dillon or 11. 21493f3933aSMatthew Dillon 215bca9f8e6SMatthew Dillon NOTE! All operations on the freemap occur on the current live version 216bca9f8e6SMatthew Dillon of the freemap, including bulkfree operations. 21793f3933aSMatthew Dillon 218bca9f8e6SMatthew Dillon Blocks are allocated by transitioning the 2-bit pattern in the leaf 2193148f677SMatthew Dillon to 11. That is, (00, 01, 10) -> (11). 22093f3933aSMatthew Dillon 2213148f677SMatthew Dillon The primary mechanism used to free a block is via the asynchronous 2223148f677SMatthew Dillon bulkfree scan. This scans all filesystem meta-data in two major passes 2233148f677SMatthew Dillon (and potentially multiple sub-passes). 22493f3933aSMatthew Dillon 2253148f677SMatthew Dillon Pass#1 - The first pass figures which blocks might be freeable. The 2263148f677SMatthew Dillon most recently flushed meta-data topology (including all four 2273148f677SMatthew Dillon volume headers and all snapshots) is scanned and an in-memory 2283148f677SMatthew Dillon copy of the FreeMap is built from scratch. Multiple sub-scans 2293148f677SMatthew Dillon might be required to break the larger scan up into more easily 2303148f677SMatthew Dillon digested pieces based on the amount of memory available to hold 2313148f677SMatthew Dillon the temporary freemap. 232bca9f8e6SMatthew Dillon 2333148f677SMatthew Dillon Any allocated blocks in the live freemap are then transitioned 2343148f677SMatthew Dillon from (11) to either (10) or (01) if after the scan they are 2353148f677SMatthew Dillon found to not be allocated. 236bca9f8e6SMatthew Dillon 2373148f677SMatthew Dillon The blocks are still assumed to be allocated at this time and 2383148f677SMatthew Dillon any new allocations will transition them back to (11). 239bca9f8e6SMatthew Dillon 2403148f677SMatthew Dillon Pass#2 - The second pass is required to deal with races against the 2413148f677SMatthew Dillon live filesystem while the freemap scan was running. It also 2423148f677SMatthew Dillon allows the freemap scans to run asynchronously from any flush, 2433148f677SMatthew Dillon improving concurrency. However, at least one synchronous flush 2443148f677SMatthew Dillon is required between Pass#1 and Pass#2. 245bca9f8e6SMatthew Dillon 2463148f677SMatthew Dillon The second pass is a duplicate of the first pass. The meta-data 2473148f677SMatthew Dillon topology is scanned and a freemap is built in-memory and then 2483148f677SMatthew Dillon compared against the live freemap. Instead transitioning from 2493148f677SMatthew Dillon (11)->(10)/(01) this pass transitions from (10)/(01) to (00). 250bca9f8e6SMatthew Dillon 2513148f677SMatthew Dillon If a block that it thinks is free is (11), no transition occurs 2523148f677SMatthew Dillon because this could be due to a race against the live filesystem. 253bca9f8e6SMatthew Dillon 2543148f677SMatthew Dillon This pass will incidentally transition (10)/(01) back to (11) 2553148f677SMatthew Dillon if the block was found not to be allocated, but it is perfectly 2563148f677SMatthew Dillon acceptable for the block to remain in a (10)/(01) state after 2573148f677SMatthew Dillon completion. 258bca9f8e6SMatthew Dillon 2593148f677SMatthew Dillon NOTE! The meta-data scanning passes must also explicitly scan blocks 2603148f677SMatthew Dillon associated with any open files, since these might represent 261*bbb35c81SSascha Wildner open-but-deleted files. These blocks must not be accidentally freed 2623148f677SMatthew Dillon while the system is still using the file. Again, since this is 2633148f677SMatthew Dillon done in two passes it does not have to be synchronized against 2643148f677SMatthew Dillon frontend operations. So in total: 265bca9f8e6SMatthew Dillon 2663148f677SMatthew Dillon * Topology under all four volume headers. This includes all 2673148f677SMatthew Dillon PFSs and snapshots. 2683148f677SMatthew Dillon 2693148f677SMatthew Dillon * Topology under all open hammer2 files. 2703148f677SMatthew Dillon 2713148f677SMatthew Dillon The Bulk-free operation is expensive but uses a bounded amount of ram. 2723148f677SMatthew Dillon The ADVANTAGE of this mechanism is that deletions in the live filesystem 2733148f677SMatthew Dillon do not have to clean up the freemap and thus do not have to recurse 2743148f677SMatthew Dillon the topology during the deletion. In fact, a 'rm -rf' equivalent of a 2753148f677SMatthew Dillon directory topology can be handled simply by blowing away the top-level 2763148f677SMatthew Dillon directory inode. This is instantanious and thus can be dangerous but 2773148f677SMatthew Dillon you always have your snapshots to fall-back on. 2783148f677SMatthew Dillon 2793148f677SMatthew Dillon The DISADVANTAGE is that all meta-data must be scanned. Twice. This 2803148f677SMatthew Dillon can be mitigated by using swapcache(8) to cache the meta-data on a SSD. 2813148f677SMatthew Dillon This is also mitigated by the fact that you can do the bulkfree scan less 2823148f677SMatthew Dillon often on very large filesystems which presumably have a lot of freespace 2833148f677SMatthew Dillon (so the interval is not as big an issue). In a sense the operation does 2843148f677SMatthew Dillon scale in that it takes longer on larger filesystems but also can be run 2853148f677SMatthew Dillon less often. 286bca9f8e6SMatthew Dillon 287bca9f8e6SMatthew Dillon The biggest issue is that *NO* space can be freed up by the live 288bca9f8e6SMatthew Dillon filesystem without the bulkfree process unless we optimize the case 289bca9f8e6SMatthew Dillon where data is created and deleted from within a single snapshot. 290bca9f8e6SMatthew Dillon This is made more difficult by the fact that each flush represents 291bca9f8e6SMatthew Dillon a fine-grained snapshot (up to four, representing the four volume 292bca9f8e6SMatthew Dillon headers the flush iterates through). 293bca9f8e6SMatthew Dillon 294bca9f8e6SMatthew Dillon Snapshots and Replicated Topologies 295bca9f8e6SMatthew Dillon 296bca9f8e6SMatthew Dillon The bulkfree code maintains information in-memory to the best of its 297bca9f8e6SMatthew Dillon ability for a multitude of reasons, including attempting to detect 298bca9f8e6SMatthew Dillon snapshot recursions down block chains which have already been scanned 299bca9f8e6SMatthew Dillon via some other snapshot. Without this, a large number of snapshots 300bca9f8e6SMatthew Dillon can cause a huge multiplication of disk I/O reads (but not writes) during 301bca9f8e6SMatthew Dillon the topology scan. 3021a7cfe5aSMatthew Dillon 3031a7cfe5aSMatthew Dillon Use of Generic indirect-block API 3041a7cfe5aSMatthew Dillon 3051a7cfe5aSMatthew Dillon I decided to use the same indirect-block allocation model for the 3061a7cfe5aSMatthew Dillon freemap that normal files use, with a few special cases added to force 3071a7cfe5aSMatthew Dillon specific radix values and to 'allocate' the freemap-related blocks 3081a7cfe5aSMatthew Dillon and indirect blocks via a reserved-block calculation and (obviously) 3091a7cfe5aSMatthew Dillon not via a recursive call to the allocator. 3101a7cfe5aSMatthew Dillon 31193f3933aSMatthew Dillon The Freemap is defined above as a fixed 5-level scheme (level 1-5), 3121a7cfe5aSMatthew Dillon but in actual operation the radix tree can be shortcut just as it 313bca9f8e6SMatthew Dillon is with normal files. However, unlike normal files, shorcuts will 314bca9f8e6SMatthew Dillon be forced to use specific radix values in order to guarantee that 315bca9f8e6SMatthew Dillon reserved block numbers can be trivially calculated. As the freemap 316bca9f8e6SMatthew Dillon becomes more fleshed out the tree on-media will look more and more like 317bca9f8e6SMatthew Dillon the actual specification. 3181a7cfe5aSMatthew Dillon 3191a7cfe5aSMatthew Dillon One advantage of doing things this way is that smaller filesystems 320bca9f8e6SMatthew Dillon won't actually use a 5-level scheme. A 16GB filesystem can use 8 321bca9f8e6SMatthew Dillon blockrefs in the volume header which point directly to layer 1 leaf 322bca9f8e6SMatthew Dillon blocks. A 16TB filesystem can be managed with only three levels 323bca9f8e6SMatthew Dillon (layer 3, 2, and 1 only where the 8 x layer 3 blockrefs are stored in 324bca9f8e6SMatthew Dillon the volume header). And so forth. 3251a7cfe5aSMatthew Dillon 3261a7cfe5aSMatthew Dillon At the moment we have no plans to return any of the unused 4MB zone 32793f3933aSMatthew Dillon header space (per 2GB of storage) back to the filesystem for general use. 32893f3933aSMatthew Dillon There are lots of things we may want to use the reserved areas for in 32993f3933aSMatthew Dillon the future. 330bca9f8e6SMatthew Dillon 331bca9f8e6SMatthew Dillon Emergency Deletions 332bca9f8e6SMatthew Dillon 333bca9f8e6SMatthew Dillon All filesystem modifications including deletions must allocate blocks 334bca9f8e6SMatthew Dillon in order to update the main topology all the way to the root. H2 will 335bca9f8e6SMatthew Dillon reserve roughly 5% of the available blocks in the filesystem for 336bca9f8e6SMatthew Dillon deletions in order to allow a system operator to recover from a 337bca9f8e6SMatthew Dillon filesystem full condition. 338bca9f8e6SMatthew Dillon 3393148f677SMatthew Dillon However, due to the snapshot capability as well as the possibility of 3403148f677SMatthew Dillon fragmentation, it is possible for the administrator to not delete enough 3413148f677SMatthew Dillon to actually be able to free up blocks. Once the reserve is used up 3423148f677SMatthew Dillon the filesystem can become unwritable. 3433148f677SMatthew Dillon 3443148f677SMatthew Dillon When this situation occurs the only way to recover is to update blocks 3453148f677SMatthew Dillon in-place. Updating blocks in-place will destroy the data on any 3463148f677SMatthew Dillon related snapshots or otherwise corrupt the snapshots. Emergency recovery 3473148f677SMatthew Dillon thus recommends that all related snapshots be destroyed. You can choose 3483148f677SMatthew Dillon not to do this in which case your snapshots might wind up containing 3493148f677SMatthew Dillon broken links and generate CRC failure messages. 3503148f677SMatthew Dillon 3513148f677SMatthew Dillon For the moment the spec for dealing with these situations remains 3523148f677SMatthew Dillon incomplete. 353