xref: /dragonfly/sys/vfs/hammer2/FREEMAP (revision bbb35c81)
11a7cfe5aSMatthew Dillon
21a7cfe5aSMatthew Dillon			HAMMER2 Freemap Design Notes
31a7cfe5aSMatthew Dillon
41a7cfe5aSMatthew Dillon				Overview
51a7cfe5aSMatthew Dillon
61a7cfe5aSMatthew Dillon    HAMMER2 Media is broken down into 2 GByte zones.  Each 2 GByte zone
7a71db85dSMatthew Dillon    contains a 4 MByte header (64 x 64K blocks = 0.2% of storage).  The
8a71db85dSMatthew Dillon    blocks in this header are reserved for various purposes.  For example,
93148f677SMatthew Dillon    block #0 is reserved for a volume header in the first four zones.  Most
103148f677SMatthew Dillon    of the remaining 64KB blocks in this header are reserved for use by the
113148f677SMatthew Dillon    freemap.
121a7cfe5aSMatthew Dillon
13a71db85dSMatthew Dillon    The freemap only uses blocks from these reserved areas.  In order to
14bca9f8e6SMatthew Dillon    ensure that any of the four volume headers can be used by the mount code
15a71db85dSMatthew Dillon    (in case some are found to be corrupted), each freemap block in the
16bca9f8e6SMatthew Dillon    logical freemap topology will iterate through up to 8 copies whos
17bca9f8e6SMatthew Dillon    block numbers are taken the reserved area.
181a7cfe5aSMatthew Dillon
19bca9f8e6SMatthew Dillon    - Four copies, one for each of the four volume headers which H2 sequences
20bca9f8e6SMatthew Dillon      through on each flush.  This ensures that a mount from any of the four
21bca9f8e6SMatthew Dillon      volume headers is handed a consistent freemap topology.
221a7cfe5aSMatthew Dillon
23bca9f8e6SMatthew Dillon    - One copy to ensure that recovery operations during mount do not modify
24bca9f8e6SMatthew Dillon      the state of the freemap topology pointed to by older volume headers
25bca9f8e6SMatthew Dillon      which are still valid.  Note that the freemap for volume headers
26bca9f8e6SMatthew Dillon      indexed after the mount point being recovered may lose freemap
27bca9f8e6SMatthew Dillon      consistency, so if you choose an older mount point for a RW mount,
28bca9f8e6SMatthew Dillon      you have to stick with it.
291a7cfe5aSMatthew Dillon
30bca9f8e6SMatthew Dillon    - One copy for live operations.  This allows HAMMER2 to retire the
313148f677SMatthew Dillon      related buffer asynchronously in the background (or for the OS to
323148f677SMatthew Dillon      retire the buffer cache buffer on its own) prior to the formal
333148f677SMatthew Dillon      flush.  The later formal flush then has less work to do.
341a7cfe5aSMatthew Dillon
35bca9f8e6SMatthew Dillon    - The two remaining copies add robustness to the specification.  For
363148f677SMatthew Dillon      example, with appropriate feature code added the filesystem can
373148f677SMatthew Dillon      tolerate a limited number of bad blocks in the reserved area.
3893f3933aSMatthew Dillon
39bca9f8e6SMatthew Dillon    For the moment we use a simple calculation for the freemap block.  In
40bca9f8e6SMatthew Dillon    a later version I would like to mix the blocks up a bit so the blocks
41bca9f8e6SMatthew Dillon    in each set of 8 are not situated near each other.
4293f3933aSMatthew Dillon
43bca9f8e6SMatthew Dillon			    RW Mount Restrictions
4493f3933aSMatthew Dillon
45bca9f8e6SMatthew Dillon    If an older volume header is explicitly selected by the mount code, any
46bca9f8e6SMatthew Dillon    newer (presumably corrupt since the mount code didn't select it) volume
47bca9f8e6SMatthew Dillon    headers will lose freemap consistency as the freemap code rotates into
48bca9f8e6SMatthew Dillon    freemap blocks that might have been used by the topology pointed to by
49bca9f8e6SMatthew Dillon    the newer (but not selected) volume headers.  For a RW mount, this means
50bca9f8e6SMatthew Dillon    that if an older volume header is selected, the newer ones that were
51bca9f8e6SMatthew Dillon    not selected WILL be formally invalidated by the mount code and cannot
52bca9f8e6SMatthew Dillon    be used in a remount attempt.
53bca9f8e6SMatthew Dillon
54bca9f8e6SMatthew Dillon    During normal operation, each filesystem flush rotates to a new volume
55bca9f8e6SMatthew Dillon    header.  A filesystem may have up to four volume headers spread at 2GB
56bca9f8e6SMatthew Dillon    intervals.  Filesystems smaller than ~9GB or so will have fewer volume
57bca9f8e6SMatthew Dillon    headers to rotate through.
581a7cfe5aSMatthew Dillon
591a7cfe5aSMatthew Dillon				Freemap Topology
601a7cfe5aSMatthew Dillon
6193f3933aSMatthew Dillon    The freemap topology contains 4 levels of meta-data (blockref arrays),
6293f3933aSMatthew Dillon    one of which is embedded in the volume header (so only three real
63a71db85dSMatthew Dillon    meta-data levels), plus one level of leaf-data.  Unlike normal files,
64a71db85dSMatthew Dillon    which use a variable-radix, the freemap topology uses a fixed radix to
65a71db85dSMatthew Dillon    simplify the algorithm and to ensure freemap locality to the blocks
66a71db85dSMatthew Dillon    under management.
671a7cfe5aSMatthew Dillon
68bca9f8e6SMatthew Dillon    Freemap blocks are allocated from the reserved area in each 2GB zone.
69bca9f8e6SMatthew Dillon    The leafs represent data in the zone.  Higher levels in the freemap
70bca9f8e6SMatthew Dillon    topology will cover more area but the physical freemap meta-data blocks
71bca9f8e6SMatthew Dillon    always occur prior to the area being covered.  Thus a HAMMER2 filesystem
72bca9f8e6SMatthew Dillon    of almost any size can be formatted and the related freemap blocks
73bca9f8e6SMatthew Dillon    will always exist.
74bca9f8e6SMatthew Dillon
75e07becf8SMatthew Dillon    Level 1 - (radix 10 + 21) 64KB representing 2GB.  This is represented
76e07becf8SMatthew Dillon	      by a hammer2_bmap_data[1024] array.  Each entry represents
77e07becf8SMatthew Dillon	      2MB worth of media storage x 1024 entries to represent 2GB.
78a71db85dSMatthew Dillon	      Each entry contains a 128x2 bit bitmap representing 16KB
79e07becf8SMatthew Dillon	      of storage in 2 bits (128 x 16KB = 2MB).
801a7cfe5aSMatthew Dillon
8193f3933aSMatthew Dillon    Level 2 - (radix 10) 64KB blockmap representing 2TB (~2GB per entry)
8293f3933aSMatthew Dillon    Level 3 - (radix 10) 64KB blockmap representing 2PB (~2TB per entry)
8393f3933aSMatthew Dillon    Level 4 - (radix 10) 64KB blockmap representing 2EB (~2PB per entry)
841a7cfe5aSMatthew Dillon    Level 5 - (radix 3) blockref x 8 in volume header representing 16EB (2^64)
851a7cfe5aSMatthew Dillon	      (this conveniently eats one 512-byte 'sector' of the 64KB
861a7cfe5aSMatthew Dillon	      volume header).
871a7cfe5aSMatthew Dillon
881a7cfe5aSMatthew Dillon    Each level is assign reserved blocks in the 4MB header per 2GB zone.
89bca9f8e6SMatthew Dillon    Since we use block 0 for the volume header, the first freemap reserved
90bca9f8e6SMatthew Dillon    block in the zone begins at block 1.
91bca9f8e6SMatthew Dillon
92bca9f8e6SMatthew Dillon    Freemap copy #0:
93bca9f8e6SMatthew Dillon	Level 1 uses block 1 (this is the leaf block)
94bca9f8e6SMatthew Dillon	Level 2 uses block 2
95bca9f8e6SMatthew Dillon	Level 3 uses block 3
96bca9f8e6SMatthew Dillon	Level 4 uses block 4
97bca9f8e6SMatthew Dillon
98bca9f8e6SMatthew Dillon    Freemap copy #1:
99bca9f8e6SMatthew Dillon	Level 1 uses block 5 (this is the leaf block)
100bca9f8e6SMatthew Dillon	Level 2 uses block 6
101bca9f8e6SMatthew Dillon	Level 3 uses block 7
102bca9f8e6SMatthew Dillon	Level 4 uses block 8
103bca9f8e6SMatthew Dillon
104bca9f8e6SMatthew Dillon    ... and so forth up to Freemap copy #7 using blocks 29, 30, 31, and 32.
1051a7cfe5aSMatthew Dillon
106a71db85dSMatthew Dillon				    Flushing
107a71db85dSMatthew Dillon
108a71db85dSMatthew Dillon    The freemap does not have to be flushed by fsync/sync, but should probably
109a71db85dSMatthew Dillon    be flushed at least once a minute by the normal filesystem sync.  The
1103148f677SMatthew Dillon    reason it does not have to be flushed with fsync is that freemap recovery
111bca9f8e6SMatthew Dillon    is executed on-mount and will use the last fully flushed freemap TID
112bca9f8e6SMatthew Dillon    stored in the volume header to do an incremental meta-data scan of the
113bca9f8e6SMatthew Dillon    H2 filesystem between that TID and the last flushed TID.  All blocks not
114bca9f8e6SMatthew Dillon    found to have been marked allocated will be marked allocated.  Simple as
1153148f677SMatthew Dillon    that.  Since the scan is incremental, this typically costs very little
1163148f677SMatthew Dillon    time.
117a71db85dSMatthew Dillon
118a71db85dSMatthew Dillon				Freemap Granularity
119a71db85dSMatthew Dillon
120a71db85dSMatthew Dillon    The freemap granularity is 16KB (radix of 14) but the minimum
121a71db85dSMatthew Dillon    allocation radix is 1KB (radix of 10) (and can be in multiples of
122a71db85dSMatthew Dillon    1KB with some coding).  1KB inodes can hold up to 512 bytes of direct
123a71db85dSMatthew Dillon    data, so tiny files eat exactly 1KB of media storage inclusive of the
124a71db85dSMatthew Dillon    inode.
125a71db85dSMatthew Dillon
126a71db85dSMatthew Dillon    The freemap keeps track of partial allocations in-memory but not
127a71db85dSMatthew Dillon    on-media, so even a normal umount will cause partially allocated
128a71db85dSMatthew Dillon    blocks to appear fully allocated until some later date when the
129a71db85dSMatthew Dillon    bulk scan code defragments it.
130a71db85dSMatthew Dillon
131a71db85dSMatthew Dillon				 Block Selection
132a71db85dSMatthew Dillon
133a71db85dSMatthew Dillon    Block selection is localized to be near the inode's (or nearby data)
134a71db85dSMatthew Dillon    blockref.  The algorithmic complexity of determining locality is not
135a71db85dSMatthew Dillon    defined here atm.
13693f3933aSMatthew Dillon
137bca9f8e6SMatthew Dillon			     Freemap Leaf Substructure
13893f3933aSMatthew Dillon
139bca9f8e6SMatthew Dillon    * linear - Linear sub-granular allocation offset.  Allows ~1KB granular
140bca9f8e6SMatthew Dillon	       linear allocations.
14193f3933aSMatthew Dillon
142bca9f8e6SMatthew Dillon    * class  - Allocation clustering class ((type << 8) | radix).
143bca9f8e6SMatthew Dillon
144bca9f8e6SMatthew Dillon    * avail  - Available space in bytes, currently only used by layer 1 leaf.
145bca9f8e6SMatthew Dillon	       Used as an allocation clustering aid.
146bca9f8e6SMatthew Dillon
147bca9f8e6SMatthew Dillon    * bitmap - Eight 32 bit words representing ~2MB in 16KB allocation chunks
14893f3933aSMatthew Dillon	       at 2 bits per chunk.  The filesystem allocation granularity
14993f3933aSMatthew Dillon	       can be smaller (currently ~1KB minimum), and the live
150bca9f8e6SMatthew Dillon	       filesystem caches iterations when allocating multiple chunks.
151bca9f8e6SMatthew Dillon	       However, on remount any partial allocations out of a 64KB
152bca9f8e6SMatthew Dillon	       allocation block MAY cause the entire 64KB to be considered
153bca9f8e6SMatthew Dillon	       allocated.  Fragmented space can potentially be reclaimed
154bca9f8e6SMatthew Dillon	       and/or relocated by the bulk block free scan.
15593f3933aSMatthew Dillon
15693f3933aSMatthew Dillon	       The 2-bit bitmap fields are assigned as follows:
15793f3933aSMatthew Dillon
15893f3933aSMatthew Dillon	       00	FREE
159bca9f8e6SMatthew Dillon	       01	POSSIBLY FREE (type 1)
160bca9f8e6SMatthew Dillon	       10	POSSIBLY FREE (type 2)
16193f3933aSMatthew Dillon	       11	ALLOCATED
16293f3933aSMatthew Dillon
163bca9f8e6SMatthew Dillon			  Freemap Metadata Substructure
164bca9f8e6SMatthew Dillon			     (Levels 2, 3, 4, and 5)
1651a7cfe5aSMatthew Dillon
166bca9f8e6SMatthew Dillon    Freemap layers 2, 3, 4, and 5 operate as arrays of blockrefs but steal
167bca9f8e6SMatthew Dillon    some of the check area (a 24-byte area) for freemap-specific meta-data.
168bca9f8e6SMatthew Dillon    We reserve a few fields to store information which allows the block
169bca9f8e6SMatthew Dillon    allocator to do its work more efficiently.
1701a7cfe5aSMatthew Dillon
17193f3933aSMatthew Dillon    * bigmask - A mask of radixes available for allocation under this
17293f3933aSMatthew Dillon		blockref.  Typically initialized to -1.
1731a7cfe5aSMatthew Dillon
1741a7cfe5aSMatthew Dillon    * avail   - Total available space in bytes.
1751a7cfe5aSMatthew Dillon
1761a7cfe5aSMatthew Dillon    The freemap allocator uses a cylinder-group-like abstraction using
1771a7cfe5aSMatthew Dillon    the localized allocation concept first implemented by UFS.  In HAMMER2
178bca9f8e6SMatthew Dillon    there is no such thing as a real cylinder group, nor are there specific
179bca9f8e6SMatthew Dillon    reserved areas for inodes vs data, but we do the next best thing by
180bca9f8e6SMatthew Dillon    roughly typing leafs (each leaf representing ~2MB) to hopefully allow
181bca9f8e6SMatthew Dillon    the drive to employ its zone-cache to make both stat-only and tar-style
182bca9f8e6SMatthew Dillon    bulk accesses efficient (in addition to normal file accesses).
1831a7cfe5aSMatthew Dillon
184a71db85dSMatthew Dillon    Levels 2, 3, and 4 contains an array blockmap[1024] (64KB total),
185bca9f8e6SMatthew Dillon    supplying 10 bits of address space each.  Level 5 is a blockmap[8]
186bca9f8e6SMatthew Dillon    stored in the volume header supplying 3 bits of address space.
187bca9f8e6SMatthew Dillon    (level 0 supplies 10 + 21 bits of address space).
188a71db85dSMatthew Dillon
189a71db85dSMatthew Dillon    The Level1 blockmap is HAMMER2's idea of a 'cylinder group', thus
190a71db85dSMatthew Dillon    effectively fixed at multiples of ~2MB or so.
1911a7cfe5aSMatthew Dillon
1921a7cfe5aSMatthew Dillon			        Initial Conditions
1931a7cfe5aSMatthew Dillon
1943148f677SMatthew Dillon    newfs_hammer2 does not need to format the freemap.  Instead, newfs_hammer2
1953148f677SMatthew Dillon    simply leaves the associated top-level indirect blocks empty and uses
1963148f677SMatthew Dillon    the (voldata->allocator_beg) field to allocate space linearly, then
1973148f677SMatthew Dillon    leaves it to the live filesystem to initialize the freemap as more space
1983148f677SMatthew Dillon    gets allocated.
1991a7cfe5aSMatthew Dillon
200a71db85dSMatthew Dillon    The freemap does NOT use a fixed 5-level radix tree.  It uses the same
201a71db85dSMatthew Dillon    blockmap algorithm used for file blocks but restricts any recursion to
202a71db85dSMatthew Dillon    specific radix values.  This means that small filesystems will have much
203a71db85dSMatthew Dillon    smaller freemap depths.  2 layers (and not counting the volume header as
204a71db85dSMatthew Dillon    a layer) gets us 16GB, 3 layers gets us 16TB.
20593f3933aSMatthew Dillon
206a71db85dSMatthew Dillon			How blocks are allocated and freed
207a71db85dSMatthew Dillon
208bca9f8e6SMatthew Dillon    The H2 freemap leaf bitmap operates in 16KB chunks, but the leaf also
209bca9f8e6SMatthew Dillon    contains a linear allocation offset that can keep track of sub-16KB
210bca9f8e6SMatthew Dillon    allocations with certain restrictions.  More random sub-16KB allocations
211bca9f8e6SMatthew Dillon    are tracked in-memory, but will be lost (assumed to be a full 16KB) if
2123148f677SMatthew Dillon    a crash occurs.  Each 16KB chunk is denoted by a 2-bit pattern 00, 01, 10,
2133148f677SMatthew Dillon    or 11.
21493f3933aSMatthew Dillon
215bca9f8e6SMatthew Dillon    NOTE!  All operations on the freemap occur on the current live version
216bca9f8e6SMatthew Dillon	   of the freemap, including bulkfree operations.
21793f3933aSMatthew Dillon
218bca9f8e6SMatthew Dillon    Blocks are allocated by transitioning the 2-bit pattern in the leaf
2193148f677SMatthew Dillon    to 11.  That is, (00, 01, 10) -> (11).
22093f3933aSMatthew Dillon
2213148f677SMatthew Dillon    The primary mechanism used to free a block is via the asynchronous
2223148f677SMatthew Dillon    bulkfree scan.  This scans all filesystem meta-data in two major passes
2233148f677SMatthew Dillon    (and potentially multiple sub-passes).
22493f3933aSMatthew Dillon
2253148f677SMatthew Dillon    Pass#1 - The first pass figures which blocks might be freeable.  The
2263148f677SMatthew Dillon	     most recently flushed meta-data topology (including all four
2273148f677SMatthew Dillon	     volume headers and all snapshots) is scanned and an in-memory
2283148f677SMatthew Dillon	     copy of the FreeMap is built from scratch.  Multiple sub-scans
2293148f677SMatthew Dillon	     might be required to break the larger scan up into more easily
2303148f677SMatthew Dillon	     digested pieces based on the amount of memory available to hold
2313148f677SMatthew Dillon	     the temporary freemap.
232bca9f8e6SMatthew Dillon
2333148f677SMatthew Dillon	     Any allocated blocks in the live freemap are then transitioned
2343148f677SMatthew Dillon	     from (11) to either (10) or (01) if after the scan they are
2353148f677SMatthew Dillon	     found to not be allocated.
236bca9f8e6SMatthew Dillon
2373148f677SMatthew Dillon	     The blocks are still assumed to be allocated at this time and
2383148f677SMatthew Dillon	     any new allocations will transition them back to (11).
239bca9f8e6SMatthew Dillon
2403148f677SMatthew Dillon    Pass#2 - The second pass is required to deal with races against the
2413148f677SMatthew Dillon	     live filesystem while the freemap scan was running. It also
2423148f677SMatthew Dillon	     allows the freemap scans to run asynchronously from any flush,
2433148f677SMatthew Dillon	     improving concurrency.  However, at least one synchronous flush
2443148f677SMatthew Dillon	     is required between Pass#1 and Pass#2.
245bca9f8e6SMatthew Dillon
2463148f677SMatthew Dillon	     The second pass is a duplicate of the first pass.  The meta-data
2473148f677SMatthew Dillon	     topology is scanned and a freemap is built in-memory and then
2483148f677SMatthew Dillon	     compared against the live freemap.  Instead transitioning from
2493148f677SMatthew Dillon	     (11)->(10)/(01) this pass transitions from (10)/(01) to (00).
250bca9f8e6SMatthew Dillon
2513148f677SMatthew Dillon	     If a block that it thinks is free is (11), no transition occurs
2523148f677SMatthew Dillon	     because this could be due to a race against the live filesystem.
253bca9f8e6SMatthew Dillon
2543148f677SMatthew Dillon	     This pass will incidentally transition (10)/(01) back to (11)
2553148f677SMatthew Dillon	     if the block was found not to be allocated, but it is perfectly
2563148f677SMatthew Dillon	     acceptable for the block to remain in a (10)/(01) state after
2573148f677SMatthew Dillon	     completion.
258bca9f8e6SMatthew Dillon
2593148f677SMatthew Dillon    NOTE! The meta-data scanning passes must also explicitly scan blocks
2603148f677SMatthew Dillon	  associated with any open files, since these might represent
261*bbb35c81SSascha Wildner	  open-but-deleted files.  These blocks must not be accidentally freed
2623148f677SMatthew Dillon	  while the system is still using the file.  Again, since this is
2633148f677SMatthew Dillon	  done in two passes it does not have to be synchronized against
2643148f677SMatthew Dillon	  frontend operations.  So in total:
265bca9f8e6SMatthew Dillon
2663148f677SMatthew Dillon	  * Topology under all four volume headers.  This includes all
2673148f677SMatthew Dillon	    PFSs and snapshots.
2683148f677SMatthew Dillon
2693148f677SMatthew Dillon	  * Topology under all open hammer2 files.
2703148f677SMatthew Dillon
2713148f677SMatthew Dillon    The Bulk-free operation is expensive but uses a bounded amount of ram.
2723148f677SMatthew Dillon    The ADVANTAGE of this mechanism is that deletions in the live filesystem
2733148f677SMatthew Dillon    do not have to clean up the freemap and thus do not have to recurse
2743148f677SMatthew Dillon    the topology during the deletion.  In fact, a 'rm -rf' equivalent of a
2753148f677SMatthew Dillon    directory topology can be handled simply by blowing away the top-level
2763148f677SMatthew Dillon    directory inode.  This is instantanious and thus can be dangerous but
2773148f677SMatthew Dillon    you always have your snapshots to fall-back on.
2783148f677SMatthew Dillon
2793148f677SMatthew Dillon    The DISADVANTAGE is that all meta-data must be scanned.  Twice.  This
2803148f677SMatthew Dillon    can be mitigated by using swapcache(8) to cache the meta-data on a SSD.
2813148f677SMatthew Dillon    This is also mitigated by the fact that you can do the bulkfree scan less
2823148f677SMatthew Dillon    often on very large filesystems which presumably have a lot of freespace
2833148f677SMatthew Dillon    (so the interval is not as big an issue).  In a sense the operation does
2843148f677SMatthew Dillon    scale in that it takes longer on larger filesystems but also can be run
2853148f677SMatthew Dillon    less often.
286bca9f8e6SMatthew Dillon
287bca9f8e6SMatthew Dillon    The biggest issue is that *NO* space can be freed up by the live
288bca9f8e6SMatthew Dillon    filesystem without the bulkfree process unless we optimize the case
289bca9f8e6SMatthew Dillon    where data is created and deleted from within a single snapshot.
290bca9f8e6SMatthew Dillon    This is made more difficult by the fact that each flush represents
291bca9f8e6SMatthew Dillon    a fine-grained snapshot (up to four, representing the four volume
292bca9f8e6SMatthew Dillon    headers the flush iterates through).
293bca9f8e6SMatthew Dillon
294bca9f8e6SMatthew Dillon		      Snapshots and Replicated Topologies
295bca9f8e6SMatthew Dillon
296bca9f8e6SMatthew Dillon    The bulkfree code maintains information in-memory to the best of its
297bca9f8e6SMatthew Dillon    ability for a multitude of reasons, including attempting to detect
298bca9f8e6SMatthew Dillon    snapshot recursions down block chains which have already been scanned
299bca9f8e6SMatthew Dillon    via some other snapshot.  Without this, a large number of snapshots
300bca9f8e6SMatthew Dillon    can cause a huge multiplication of disk I/O reads (but not writes) during
301bca9f8e6SMatthew Dillon    the topology scan.
3021a7cfe5aSMatthew Dillon
3031a7cfe5aSMatthew Dillon			Use of Generic indirect-block API
3041a7cfe5aSMatthew Dillon
3051a7cfe5aSMatthew Dillon    I decided to use the same indirect-block allocation model for the
3061a7cfe5aSMatthew Dillon    freemap that normal files use, with a few special cases added to force
3071a7cfe5aSMatthew Dillon    specific radix values and to 'allocate' the freemap-related blocks
3081a7cfe5aSMatthew Dillon    and indirect blocks via a reserved-block calculation and (obviously)
3091a7cfe5aSMatthew Dillon    not via a recursive call to the allocator.
3101a7cfe5aSMatthew Dillon
31193f3933aSMatthew Dillon    The Freemap is defined above as a fixed 5-level scheme (level 1-5),
3121a7cfe5aSMatthew Dillon    but in actual operation the radix tree can be shortcut just as it
313bca9f8e6SMatthew Dillon    is with normal files.  However, unlike normal files, shorcuts will
314bca9f8e6SMatthew Dillon    be forced to use specific radix values in order to guarantee that
315bca9f8e6SMatthew Dillon    reserved block numbers can be trivially calculated.  As the freemap
316bca9f8e6SMatthew Dillon    becomes more fleshed out the tree on-media will look more and more like
317bca9f8e6SMatthew Dillon    the actual specification.
3181a7cfe5aSMatthew Dillon
3191a7cfe5aSMatthew Dillon    One advantage of doing things this way is that smaller filesystems
320bca9f8e6SMatthew Dillon    won't actually use a 5-level scheme.  A 16GB filesystem can use 8
321bca9f8e6SMatthew Dillon    blockrefs in the volume header which point directly to layer 1 leaf
322bca9f8e6SMatthew Dillon    blocks.  A 16TB filesystem can be managed with only three levels
323bca9f8e6SMatthew Dillon    (layer 3, 2, and 1 only where the 8 x layer 3 blockrefs are stored in
324bca9f8e6SMatthew Dillon    the volume header).  And so forth.
3251a7cfe5aSMatthew Dillon
3261a7cfe5aSMatthew Dillon    At the moment we have no plans to return any of the unused 4MB zone
32793f3933aSMatthew Dillon    header space (per 2GB of storage) back to the filesystem for general use.
32893f3933aSMatthew Dillon    There are lots of things we may want to use the reserved areas for in
32993f3933aSMatthew Dillon    the future.
330bca9f8e6SMatthew Dillon
331bca9f8e6SMatthew Dillon				Emergency Deletions
332bca9f8e6SMatthew Dillon
333bca9f8e6SMatthew Dillon    All filesystem modifications including deletions must allocate blocks
334bca9f8e6SMatthew Dillon    in order to update the main topology all the way to the root.  H2 will
335bca9f8e6SMatthew Dillon    reserve roughly 5% of the available blocks in the filesystem for
336bca9f8e6SMatthew Dillon    deletions in order to allow a system operator to recover from a
337bca9f8e6SMatthew Dillon    filesystem full condition.
338bca9f8e6SMatthew Dillon
3393148f677SMatthew Dillon    However, due to the snapshot capability as well as the possibility of
3403148f677SMatthew Dillon    fragmentation, it is possible for the administrator to not delete enough
3413148f677SMatthew Dillon    to actually be able to free up blocks.  Once the reserve is used up
3423148f677SMatthew Dillon    the filesystem can become unwritable.
3433148f677SMatthew Dillon
3443148f677SMatthew Dillon    When this situation occurs the only way to recover is to update blocks
3453148f677SMatthew Dillon    in-place.  Updating blocks in-place will destroy the data on any
3463148f677SMatthew Dillon    related snapshots or otherwise corrupt the snapshots.  Emergency recovery
3473148f677SMatthew Dillon    thus recommends that all related snapshots be destroyed.  You can choose
3483148f677SMatthew Dillon    not to do this in which case your snapshots might wind up containing
3493148f677SMatthew Dillon    broken links and generate CRC failure messages.
3503148f677SMatthew Dillon
3513148f677SMatthew Dillon    For the moment the spec for dealing with these situations remains
3523148f677SMatthew Dillon    incomplete.
353