xref: /dragonfly/sys/vfs/hammer2/FREEMAP (revision ef3ac1d1)
1
2			HAMMER2 Freemap Design Notes
3
4				Overview
5
6    HAMMER2 Media is broken down into 2 GByte zones.  Each 2 GByte zone
7    contains a 4 MByte header (64 x 64K blocks = 0.2% of storage).  The
8    blocks in this header are reserved for various purposes.  For example,
9    block #0 is reserved for a volume headers.  Most of the remaining
10    64KB blocks in this header are reserved for use by the freemap.
11
12    The freemap only uses blocks from these reserved areas.  In order to
13    ensure that any of the four volume headers can be used by the mount code
14    (in case some are found to be corrupted), each freemap block in the
15    logical freemap topology will iterate through up to 8 copies whos
16    block numbers are taken the reserved area.
17
18    - Four copies, one for each of the four volume headers which H2 sequences
19      through on each flush.  This ensures that a mount from any of the four
20      volume headers is handed a consistent freemap topology.
21
22    - One copy to ensure that recovery operations during mount do not modify
23      the state of the freemap topology pointed to by older volume headers
24      which are still valid.  Note that the freemap for volume headers
25      indexed after the mount point being recovered may lose freemap
26      consistency, so if you choose an older mount point for a RW mount,
27      you have to stick with it.
28
29    - One copy for live operations.  This allows HAMMER2 to retire the
30      related buffer (or for the OS to retire the buffer cache buffer)
31      prior to the next flush and also allows the buffers to be flushed
32      asynchronously.
33
34    - The two remaining copies add robustness to the specification.  For
35      example, with appropriate feature code the filesystem can tolerate
36      a limited number of bad blocks in the reserved area.
37
38    For the moment we use a simple calculation for the freemap block.  In
39    a later version I would like to mix the blocks up a bit so the blocks
40    in each set of 8 are not situated near each other.
41
42				RW Mount Restrictions
43
44    If an older volume header is explicitly selected by the mount code, any
45    newer (presumably corrupt since the mount code didn't select it) volume
46    headers will lose freemap consistency as the freemap code rotates into
47    freemap blocks that might have been used by the topology pointed to by
48    the newer (but not selected) volume headers.  For a RW mount, this means
49    that if an older volume header is selected, the newer ones that were
50    not selected WILL be formally invalidated by the mount code and cannot
51    be used in a remount attempt.
52
53    During normal operation, each filesystem flush rotates to a new volume
54    header.  A filesystem may have up to four volume headers spread at 2GB
55    intervals.  Filesystems smaller than ~9GB or so will have fewer volume
56    headers to rotate through.
57
58				Freemap Topology
59
60    The freemap topology contains 4 levels of meta-data (blockref arrays),
61    one of which is embedded in the volume header (so only three real
62    meta-data levels), plus one level of leaf-data.  Unlike normal files,
63    which use a variable-radix, the freemap topology uses a fixed radix to
64    simplify the algorithm and to ensure freemap locality to the blocks
65    under management.
66
67    Freemap blocks are allocated from the reserved area in each 2GB zone.
68    The leafs represent data in the zone.  Higher levels in the freemap
69    topology will cover more area but the physical freemap meta-data blocks
70    always occur prior to the area being covered.  Thus a HAMMER2 filesystem
71    of almost any size can be formatted and the related freemap blocks
72    will always exist.
73
74    Level 1 - (radix 10 + 21) 64KB representing 2GB.  This is represented
75	      by a hammer2_bmap_data[1024] array.  Each entry represents
76	      2MB worth of media storage x 1024 entries to represent 2GB.
77	      Each entry contains a 128x2 bit bitmap representing 16KB
78	      of storage in 2 bits (128 x 16KB = 2MB).
79
80    Level 2 - (radix 10) 64KB blockmap representing 2TB (~2GB per entry)
81    Level 3 - (radix 10) 64KB blockmap representing 2PB (~2TB per entry)
82    Level 4 - (radix 10) 64KB blockmap representing 2EB (~2PB per entry)
83    Level 5 - (radix 3) blockref x 8 in volume header representing 16EB (2^64)
84	      (this conveniently eats one 512-byte 'sector' of the 64KB
85	      volume header).
86
87    Each level is assign reserved blocks in the 4MB header per 2GB zone.
88    Since we use block 0 for the volume header, the first freemap reserved
89    block in the zone begins at block 1.
90
91    Freemap copy #0:
92	Level 1 uses block 1 (this is the leaf block)
93	Level 2 uses block 2
94	Level 3 uses block 3
95	Level 4 uses block 4
96
97    Freemap copy #1:
98	Level 1 uses block 5 (this is the leaf block)
99	Level 2 uses block 6
100	Level 3 uses block 7
101	Level 4 uses block 8
102
103    ... and so forth up to Freemap copy #7 using blocks 29, 30, 31, and 32.
104
105				    Flushing
106
107    The freemap does not have to be flushed by fsync/sync, but should probably
108    be flushed at least once a minute by the normal filesystem sync.  The
109    reason it does not have to be flushed every time is that freemap recovery
110    is executed on-mount and will use the last fully flushed freemap TID
111    stored in the volume header to do an incremental meta-data scan of the
112    H2 filesystem between that TID and the last flushed TID.  All blocks not
113    found to have been marked allocated will be marked allocated.  Simple as
114    that.
115
116				Freemap Granularity
117
118    The freemap granularity is 16KB (radix of 14) but the minimum
119    allocation radix is 1KB (radix of 10) (and can be in multiples of
120    1KB with some coding).  1KB inodes can hold up to 512 bytes of direct
121    data, so tiny files eat exactly 1KB of media storage inclusive of the
122    inode.
123
124    The freemap keeps track of partial allocations in-memory but not
125    on-media, so even a normal umount will cause partially allocated
126    blocks to appear fully allocated until some later date when the
127    bulk scan code defragments it.
128
129				 Block Selection
130
131    Block selection is localized to be near the inode's (or nearby data)
132    blockref.  The algorithmic complexity of determining locality is not
133    defined here atm.
134
135			     Freemap Leaf Substructure
136
137    * linear - Linear sub-granular allocation offset.  Allows ~1KB granular
138	       linear allocations.
139
140    * class  - Allocation clustering class ((type << 8) | radix).
141
142    * avail  - Available space in bytes, currently only used by layer 1 leaf.
143	       Used as an allocation clustering aid.
144
145    * bitmap - Eight 32 bit words representing ~2MB in 16KB allocation chunks
146	       at 2 bits per chunk.  The filesystem allocation granularity
147	       can be smaller (currently ~1KB minimum), and the live
148	       filesystem caches iterations when allocating multiple chunks.
149	       However, on remount any partial allocations out of a 64KB
150	       allocation block MAY cause the entire 64KB to be considered
151	       allocated.  Fragmented space can potentially be reclaimed
152	       and/or relocated by the bulk block free scan.
153
154	       The 2-bit bitmap fields are assigned as follows:
155
156	       00	FREE
157	       01	POSSIBLY FREE (type 1)
158	       10	POSSIBLY FREE (type 2)
159	       11	ALLOCATED
160
161			  Freemap Metadata Substructure
162			     (Levels 2, 3, 4, and 5)
163
164    Freemap layers 2, 3, 4, and 5 operate as arrays of blockrefs but steal
165    some of the check area (a 24-byte area) for freemap-specific meta-data.
166    We reserve a few fields to store information which allows the block
167    allocator to do its work more efficiently.
168
169    * bigmask - A mask of radixes available for allocation under this
170		blockref.  Typically initialized to -1.
171
172    * avail   - Total available space in bytes.
173
174    The freemap allocator uses a cylinder-group-like abstraction using
175    the localized allocation concept first implemented by UFS.  In HAMMER2
176    there is no such thing as a real cylinder group, nor are there specific
177    reserved areas for inodes vs data, but we do the next best thing by
178    roughly typing leafs (each leaf representing ~2MB) to hopefully allow
179    the drive to employ its zone-cache to make both stat-only and tar-style
180    bulk accesses efficient (in addition to normal file accesses).
181
182    Levels 2, 3, and 4 contains an array blockmap[1024] (64KB total),
183    supplying 10 bits of address space each.  Level 5 is a blockmap[8]
184    stored in the volume header supplying 3 bits of address space.
185    (level 0 supplies 10 + 21 bits of address space).
186
187    The Level1 blockmap is HAMMER2's idea of a 'cylinder group', thus
188    effectively fixed at multiples of ~2MB or so.
189
190			        Initial Conditions
191
192    The freemap is a multi-indirect block structure but there is no real
193    reason to pre-format it in newfs_hammer2.  Instead, newfs_hammer2 simply
194    leaves the associated top-level indirect blocks empty and uses the
195    (voldata->allocator_beg) field to allocate space linearly, then leaves
196    it to the live filesystem to initialize the freemap as more space gets
197    allocated.
198
199    The freemap does NOT use a fixed 5-level radix tree.  It uses the same
200    blockmap algorithm used for file blocks but restricts any recursion to
201    specific radix values.  This means that small filesystems will have much
202    smaller freemap depths.  2 layers (and not counting the volume header as
203    a layer) gets us 16GB, 3 layers gets us 16TB.
204
205			How blocks are allocated and freed
206
207    The H2 freemap leaf bitmap operates in 16KB chunks, but the leaf also
208    contains a linear allocation offset that can keep track of sub-16KB
209    allocations with certain restrictions.  More random sub-16KB allocations
210    are tracked in-memory, but will be lost (assumed to be a full 16KB) if
211    a crash occurs.
212
213    NOTE!  All operations on the freemap occur on the current live version
214	   of the freemap, including bulkfree operations.
215
216    Blocks are allocated by transitioning the 2-bit pattern in the leaf
217    to 11.  That is, (00, 01, 10) -> (11).  This handles races between the
218    live allocator and the asynchronous bulkfree code.  A live allocation
219    which occurs while the asynchronous bulkfree process is running will
220    operate race-free by transitioning the (01) an (10) states back
221    to (11), which prevents bulkfree from later marking those blocks
222    FREE (00).
223
224    Blocks can be freed several ways, but all block freeing operations
225    require at least two passes before the related blocks can actually be
226    reused.
227
228    Method #1 - Removal in the filesystem marks the related freemap bitmap
229		POSSIBLY FREE (either 01 or 10).  The asynchronous bulkfree
230		process later determines that the block is actually free and
231		transitions it to FREE (00), or moves it back to
232		ALLOCATED (11).
233
234		This works whether the blocks can actually be freed or not,
235		so we don't care if the related blocks are part of some other
236		snapshot or not.  bulkfree will figure it out.
237
238    Method #2 - Removal in the filesystem ignores the freemap.  The freemap
239		blocks are left alone (typically ALLOCATED (11)).
240
241		In this situation bulkfree must make extra passes to determine
242		if blocks are possibly free, then transition the leaf bitmap
243		entries to POSSIBLY FREE (01 or 10).  bulkfree cannot directly
244		transition the entries to FREE (00) without another pass.
245
246		However, this method has numerous advantages including making
247		snapshot manipulation (including deletions) instantanious
248		and allow whole subtrees and/or large-files to be rm -rf'd
249		with only a single disk write to update the inode in the
250		best case.
251
252    Method #3 - Brute force.  *ALL* freemap bitmap entries are marked
253		POSSIBLY FREE and bulkfree then must do multiple passes
254		(particularly in order to ensure that its memory use remains
255		bounded) to again transition all the freemap bitmap entries
256		to either FREE (00) or ALLOCATED (11).
257
258		This method can be faster than #2 but wastes a considerable
259		amount of write-bandwidth (and SSD wear if the target drive
260		is a SSD).
261
262    In all cases the bulkfree code must make a final pass on the filesystem
263    to do the final transition of POSSIBLY FREE blocks to FREE (00) or
264    ALLOCATED (11).  Again, races for the FREE (00) are handled by observing
265    if the bitmap code was moved to ALLOCATED (11) by the live system while
266    bulkfree ran asynchrnously and not transitioning the element to FREE (00)
267    in that situation.
268
269    All bulkfree passes are done on meta-data.  Actual data blocks do not
270    need to be read unless the media is being verified.  H2 uses method #2
271    by default and efficiency depends on how much ram the system has to
272    cache scan information.  That said, the bulkfree process is not only
273    incremental but it is also interruptable and restartable.  It does not
274    interfere with live operations other than using disk bandwidth, so
275    there are several ways to run it including in the background.
276
277    The biggest issue is that *NO* space can be freed up by the live
278    filesystem without the bulkfree process unless we optimize the case
279    where data is created and deleted from within a single snapshot.
280    This is made more difficult by the fact that each flush represents
281    a fine-grained snapshot (up to four, representing the four volume
282    headers the flush iterates through).
283
284		      Snapshots and Replicated Topologies
285
286    The bulkfree code maintains information in-memory to the best of its
287    ability for a multitude of reasons, including attempting to detect
288    snapshot recursions down block chains which have already been scanned
289    via some other snapshot.  Without this, a large number of snapshots
290    can cause a huge multiplication of disk I/O reads (but not writes) during
291    the topology scan.
292
293			Use of Generic indirect-block API
294
295    I decided to use the same indirect-block allocation model for the
296    freemap that normal files use, with a few special cases added to force
297    specific radix values and to 'allocate' the freemap-related blocks
298    and indirect blocks via a reserved-block calculation and (obviously)
299    not via a recursive call to the allocator.
300
301    The Freemap is defined above as a fixed 5-level scheme (level 1-5),
302    but in actual operation the radix tree can be shortcut just as it
303    is with normal files.  However, unlike normal files, shorcuts will
304    be forced to use specific radix values in order to guarantee that
305    reserved block numbers can be trivially calculated.  As the freemap
306    becomes more fleshed out the tree on-media will look more and more like
307    the actual specification.
308
309    One advantage of doing things this way is that smaller filesystems
310    won't actually use a 5-level scheme.  A 16GB filesystem can use 8
311    blockrefs in the volume header which point directly to layer 1 leaf
312    blocks.  A 16TB filesystem can be managed with only three levels
313    (layer 3, 2, and 1 only where the 8 x layer 3 blockrefs are stored in
314    the volume header).  And so forth.
315
316    At the moment we have no plans to return any of the unused 4MB zone
317    header space (per 2GB of storage) back to the filesystem for general use.
318    There are lots of things we may want to use the reserved areas for in
319    the future.
320
321				Emergency Deletions
322
323    All filesystem modifications including deletions must allocate blocks
324    in order to update the main topology all the way to the root.  H2 will
325    reserve roughly 5% of the available blocks in the filesystem for
326    deletions in order to allow a system operator to recover from a
327    filesystem full condition.
328
329    Despite this, situations may come up, due to having snapshots, where
330    deletions eat up available blocks but fail to create freeable space.
331    When this situation occurs the system operator may be forced to issue
332    emergency in-place deletions which replace existing blocks rather then
333    allocate new blocks.  For the moment the spec for dealing with these
334    situations remains incomplete.
335