xref: /dragonfly/sys/vfs/hammer2/FREEMAP (revision bbb35c81)
1
2			HAMMER2 Freemap Design Notes
3
4				Overview
5
6    HAMMER2 Media is broken down into 2 GByte zones.  Each 2 GByte zone
7    contains a 4 MByte header (64 x 64K blocks = 0.2% of storage).  The
8    blocks in this header are reserved for various purposes.  For example,
9    block #0 is reserved for a volume header in the first four zones.  Most
10    of the remaining 64KB blocks in this header are reserved for use by the
11    freemap.
12
13    The freemap only uses blocks from these reserved areas.  In order to
14    ensure that any of the four volume headers can be used by the mount code
15    (in case some are found to be corrupted), each freemap block in the
16    logical freemap topology will iterate through up to 8 copies whos
17    block numbers are taken the reserved area.
18
19    - Four copies, one for each of the four volume headers which H2 sequences
20      through on each flush.  This ensures that a mount from any of the four
21      volume headers is handed a consistent freemap topology.
22
23    - One copy to ensure that recovery operations during mount do not modify
24      the state of the freemap topology pointed to by older volume headers
25      which are still valid.  Note that the freemap for volume headers
26      indexed after the mount point being recovered may lose freemap
27      consistency, so if you choose an older mount point for a RW mount,
28      you have to stick with it.
29
30    - One copy for live operations.  This allows HAMMER2 to retire the
31      related buffer asynchronously in the background (or for the OS to
32      retire the buffer cache buffer on its own) prior to the formal
33      flush.  The later formal flush then has less work to do.
34
35    - The two remaining copies add robustness to the specification.  For
36      example, with appropriate feature code added the filesystem can
37      tolerate a limited number of bad blocks in the reserved area.
38
39    For the moment we use a simple calculation for the freemap block.  In
40    a later version I would like to mix the blocks up a bit so the blocks
41    in each set of 8 are not situated near each other.
42
43			    RW Mount Restrictions
44
45    If an older volume header is explicitly selected by the mount code, any
46    newer (presumably corrupt since the mount code didn't select it) volume
47    headers will lose freemap consistency as the freemap code rotates into
48    freemap blocks that might have been used by the topology pointed to by
49    the newer (but not selected) volume headers.  For a RW mount, this means
50    that if an older volume header is selected, the newer ones that were
51    not selected WILL be formally invalidated by the mount code and cannot
52    be used in a remount attempt.
53
54    During normal operation, each filesystem flush rotates to a new volume
55    header.  A filesystem may have up to four volume headers spread at 2GB
56    intervals.  Filesystems smaller than ~9GB or so will have fewer volume
57    headers to rotate through.
58
59				Freemap Topology
60
61    The freemap topology contains 4 levels of meta-data (blockref arrays),
62    one of which is embedded in the volume header (so only three real
63    meta-data levels), plus one level of leaf-data.  Unlike normal files,
64    which use a variable-radix, the freemap topology uses a fixed radix to
65    simplify the algorithm and to ensure freemap locality to the blocks
66    under management.
67
68    Freemap blocks are allocated from the reserved area in each 2GB zone.
69    The leafs represent data in the zone.  Higher levels in the freemap
70    topology will cover more area but the physical freemap meta-data blocks
71    always occur prior to the area being covered.  Thus a HAMMER2 filesystem
72    of almost any size can be formatted and the related freemap blocks
73    will always exist.
74
75    Level 1 - (radix 10 + 21) 64KB representing 2GB.  This is represented
76	      by a hammer2_bmap_data[1024] array.  Each entry represents
77	      2MB worth of media storage x 1024 entries to represent 2GB.
78	      Each entry contains a 128x2 bit bitmap representing 16KB
79	      of storage in 2 bits (128 x 16KB = 2MB).
80
81    Level 2 - (radix 10) 64KB blockmap representing 2TB (~2GB per entry)
82    Level 3 - (radix 10) 64KB blockmap representing 2PB (~2TB per entry)
83    Level 4 - (radix 10) 64KB blockmap representing 2EB (~2PB per entry)
84    Level 5 - (radix 3) blockref x 8 in volume header representing 16EB (2^64)
85	      (this conveniently eats one 512-byte 'sector' of the 64KB
86	      volume header).
87
88    Each level is assign reserved blocks in the 4MB header per 2GB zone.
89    Since we use block 0 for the volume header, the first freemap reserved
90    block in the zone begins at block 1.
91
92    Freemap copy #0:
93	Level 1 uses block 1 (this is the leaf block)
94	Level 2 uses block 2
95	Level 3 uses block 3
96	Level 4 uses block 4
97
98    Freemap copy #1:
99	Level 1 uses block 5 (this is the leaf block)
100	Level 2 uses block 6
101	Level 3 uses block 7
102	Level 4 uses block 8
103
104    ... and so forth up to Freemap copy #7 using blocks 29, 30, 31, and 32.
105
106				    Flushing
107
108    The freemap does not have to be flushed by fsync/sync, but should probably
109    be flushed at least once a minute by the normal filesystem sync.  The
110    reason it does not have to be flushed with fsync is that freemap recovery
111    is executed on-mount and will use the last fully flushed freemap TID
112    stored in the volume header to do an incremental meta-data scan of the
113    H2 filesystem between that TID and the last flushed TID.  All blocks not
114    found to have been marked allocated will be marked allocated.  Simple as
115    that.  Since the scan is incremental, this typically costs very little
116    time.
117
118				Freemap Granularity
119
120    The freemap granularity is 16KB (radix of 14) but the minimum
121    allocation radix is 1KB (radix of 10) (and can be in multiples of
122    1KB with some coding).  1KB inodes can hold up to 512 bytes of direct
123    data, so tiny files eat exactly 1KB of media storage inclusive of the
124    inode.
125
126    The freemap keeps track of partial allocations in-memory but not
127    on-media, so even a normal umount will cause partially allocated
128    blocks to appear fully allocated until some later date when the
129    bulk scan code defragments it.
130
131				 Block Selection
132
133    Block selection is localized to be near the inode's (or nearby data)
134    blockref.  The algorithmic complexity of determining locality is not
135    defined here atm.
136
137			     Freemap Leaf Substructure
138
139    * linear - Linear sub-granular allocation offset.  Allows ~1KB granular
140	       linear allocations.
141
142    * class  - Allocation clustering class ((type << 8) | radix).
143
144    * avail  - Available space in bytes, currently only used by layer 1 leaf.
145	       Used as an allocation clustering aid.
146
147    * bitmap - Eight 32 bit words representing ~2MB in 16KB allocation chunks
148	       at 2 bits per chunk.  The filesystem allocation granularity
149	       can be smaller (currently ~1KB minimum), and the live
150	       filesystem caches iterations when allocating multiple chunks.
151	       However, on remount any partial allocations out of a 64KB
152	       allocation block MAY cause the entire 64KB to be considered
153	       allocated.  Fragmented space can potentially be reclaimed
154	       and/or relocated by the bulk block free scan.
155
156	       The 2-bit bitmap fields are assigned as follows:
157
158	       00	FREE
159	       01	POSSIBLY FREE (type 1)
160	       10	POSSIBLY FREE (type 2)
161	       11	ALLOCATED
162
163			  Freemap Metadata Substructure
164			     (Levels 2, 3, 4, and 5)
165
166    Freemap layers 2, 3, 4, and 5 operate as arrays of blockrefs but steal
167    some of the check area (a 24-byte area) for freemap-specific meta-data.
168    We reserve a few fields to store information which allows the block
169    allocator to do its work more efficiently.
170
171    * bigmask - A mask of radixes available for allocation under this
172		blockref.  Typically initialized to -1.
173
174    * avail   - Total available space in bytes.
175
176    The freemap allocator uses a cylinder-group-like abstraction using
177    the localized allocation concept first implemented by UFS.  In HAMMER2
178    there is no such thing as a real cylinder group, nor are there specific
179    reserved areas for inodes vs data, but we do the next best thing by
180    roughly typing leafs (each leaf representing ~2MB) to hopefully allow
181    the drive to employ its zone-cache to make both stat-only and tar-style
182    bulk accesses efficient (in addition to normal file accesses).
183
184    Levels 2, 3, and 4 contains an array blockmap[1024] (64KB total),
185    supplying 10 bits of address space each.  Level 5 is a blockmap[8]
186    stored in the volume header supplying 3 bits of address space.
187    (level 0 supplies 10 + 21 bits of address space).
188
189    The Level1 blockmap is HAMMER2's idea of a 'cylinder group', thus
190    effectively fixed at multiples of ~2MB or so.
191
192			        Initial Conditions
193
194    newfs_hammer2 does not need to format the freemap.  Instead, newfs_hammer2
195    simply leaves the associated top-level indirect blocks empty and uses
196    the (voldata->allocator_beg) field to allocate space linearly, then
197    leaves it to the live filesystem to initialize the freemap as more space
198    gets allocated.
199
200    The freemap does NOT use a fixed 5-level radix tree.  It uses the same
201    blockmap algorithm used for file blocks but restricts any recursion to
202    specific radix values.  This means that small filesystems will have much
203    smaller freemap depths.  2 layers (and not counting the volume header as
204    a layer) gets us 16GB, 3 layers gets us 16TB.
205
206			How blocks are allocated and freed
207
208    The H2 freemap leaf bitmap operates in 16KB chunks, but the leaf also
209    contains a linear allocation offset that can keep track of sub-16KB
210    allocations with certain restrictions.  More random sub-16KB allocations
211    are tracked in-memory, but will be lost (assumed to be a full 16KB) if
212    a crash occurs.  Each 16KB chunk is denoted by a 2-bit pattern 00, 01, 10,
213    or 11.
214
215    NOTE!  All operations on the freemap occur on the current live version
216	   of the freemap, including bulkfree operations.
217
218    Blocks are allocated by transitioning the 2-bit pattern in the leaf
219    to 11.  That is, (00, 01, 10) -> (11).
220
221    The primary mechanism used to free a block is via the asynchronous
222    bulkfree scan.  This scans all filesystem meta-data in two major passes
223    (and potentially multiple sub-passes).
224
225    Pass#1 - The first pass figures which blocks might be freeable.  The
226	     most recently flushed meta-data topology (including all four
227	     volume headers and all snapshots) is scanned and an in-memory
228	     copy of the FreeMap is built from scratch.  Multiple sub-scans
229	     might be required to break the larger scan up into more easily
230	     digested pieces based on the amount of memory available to hold
231	     the temporary freemap.
232
233	     Any allocated blocks in the live freemap are then transitioned
234	     from (11) to either (10) or (01) if after the scan they are
235	     found to not be allocated.
236
237	     The blocks are still assumed to be allocated at this time and
238	     any new allocations will transition them back to (11).
239
240    Pass#2 - The second pass is required to deal with races against the
241	     live filesystem while the freemap scan was running. It also
242	     allows the freemap scans to run asynchronously from any flush,
243	     improving concurrency.  However, at least one synchronous flush
244	     is required between Pass#1 and Pass#2.
245
246	     The second pass is a duplicate of the first pass.  The meta-data
247	     topology is scanned and a freemap is built in-memory and then
248	     compared against the live freemap.  Instead transitioning from
249	     (11)->(10)/(01) this pass transitions from (10)/(01) to (00).
250
251	     If a block that it thinks is free is (11), no transition occurs
252	     because this could be due to a race against the live filesystem.
253
254	     This pass will incidentally transition (10)/(01) back to (11)
255	     if the block was found not to be allocated, but it is perfectly
256	     acceptable for the block to remain in a (10)/(01) state after
257	     completion.
258
259    NOTE! The meta-data scanning passes must also explicitly scan blocks
260	  associated with any open files, since these might represent
261	  open-but-deleted files.  These blocks must not be accidentally freed
262	  while the system is still using the file.  Again, since this is
263	  done in two passes it does not have to be synchronized against
264	  frontend operations.  So in total:
265
266	  * Topology under all four volume headers.  This includes all
267	    PFSs and snapshots.
268
269	  * Topology under all open hammer2 files.
270
271    The Bulk-free operation is expensive but uses a bounded amount of ram.
272    The ADVANTAGE of this mechanism is that deletions in the live filesystem
273    do not have to clean up the freemap and thus do not have to recurse
274    the topology during the deletion.  In fact, a 'rm -rf' equivalent of a
275    directory topology can be handled simply by blowing away the top-level
276    directory inode.  This is instantanious and thus can be dangerous but
277    you always have your snapshots to fall-back on.
278
279    The DISADVANTAGE is that all meta-data must be scanned.  Twice.  This
280    can be mitigated by using swapcache(8) to cache the meta-data on a SSD.
281    This is also mitigated by the fact that you can do the bulkfree scan less
282    often on very large filesystems which presumably have a lot of freespace
283    (so the interval is not as big an issue).  In a sense the operation does
284    scale in that it takes longer on larger filesystems but also can be run
285    less often.
286
287    The biggest issue is that *NO* space can be freed up by the live
288    filesystem without the bulkfree process unless we optimize the case
289    where data is created and deleted from within a single snapshot.
290    This is made more difficult by the fact that each flush represents
291    a fine-grained snapshot (up to four, representing the four volume
292    headers the flush iterates through).
293
294		      Snapshots and Replicated Topologies
295
296    The bulkfree code maintains information in-memory to the best of its
297    ability for a multitude of reasons, including attempting to detect
298    snapshot recursions down block chains which have already been scanned
299    via some other snapshot.  Without this, a large number of snapshots
300    can cause a huge multiplication of disk I/O reads (but not writes) during
301    the topology scan.
302
303			Use of Generic indirect-block API
304
305    I decided to use the same indirect-block allocation model for the
306    freemap that normal files use, with a few special cases added to force
307    specific radix values and to 'allocate' the freemap-related blocks
308    and indirect blocks via a reserved-block calculation and (obviously)
309    not via a recursive call to the allocator.
310
311    The Freemap is defined above as a fixed 5-level scheme (level 1-5),
312    but in actual operation the radix tree can be shortcut just as it
313    is with normal files.  However, unlike normal files, shorcuts will
314    be forced to use specific radix values in order to guarantee that
315    reserved block numbers can be trivially calculated.  As the freemap
316    becomes more fleshed out the tree on-media will look more and more like
317    the actual specification.
318
319    One advantage of doing things this way is that smaller filesystems
320    won't actually use a 5-level scheme.  A 16GB filesystem can use 8
321    blockrefs in the volume header which point directly to layer 1 leaf
322    blocks.  A 16TB filesystem can be managed with only three levels
323    (layer 3, 2, and 1 only where the 8 x layer 3 blockrefs are stored in
324    the volume header).  And so forth.
325
326    At the moment we have no plans to return any of the unused 4MB zone
327    header space (per 2GB of storage) back to the filesystem for general use.
328    There are lots of things we may want to use the reserved areas for in
329    the future.
330
331				Emergency Deletions
332
333    All filesystem modifications including deletions must allocate blocks
334    in order to update the main topology all the way to the root.  H2 will
335    reserve roughly 5% of the available blocks in the filesystem for
336    deletions in order to allow a system operator to recover from a
337    filesystem full condition.
338
339    However, due to the snapshot capability as well as the possibility of
340    fragmentation, it is possible for the administrator to not delete enough
341    to actually be able to free up blocks.  Once the reserve is used up
342    the filesystem can become unwritable.
343
344    When this situation occurs the only way to recover is to update blocks
345    in-place.  Updating blocks in-place will destroy the data on any
346    related snapshots or otherwise corrupt the snapshots.  Emergency recovery
347    thus recommends that all related snapshots be destroyed.  You can choose
348    not to do this in which case your snapshots might wind up containing
349    broken links and generate CRC failure messages.
350
351    For the moment the spec for dealing with these situations remains
352    incomplete.
353