xref: /dragonfly/sys/vfs/hammer2/FREEMAP (revision 3a25be87)
1
2			HAMMER2 Freemap Design Notes
3
4				Overview
5
6    HAMMER2 Media is broken down into 2 GByte zones.  Each 2 GByte zone
7    contains a 4 MByte header (64 x 64K blocks = 0.2% of storage).  The
8    blocks in this header are reserved for various purposes.  For example,
9    block #0 is reserved for a volume header or volume header backup.  Most
10    of the 64KB blocks in this header are reserved for use by the freemap.
11
12    The freemap only uses blocks from these reserved areas.  In order to
13    ensure that any of the four volume headers can be used for the mount
14    (in case some are found to be corrupted), each freemap block in the
15    logical freemap topology iterates through 6 different copies.  The
16    freemap is a 4-layer topology (+1 3-bit layer in the volume header),
17    so 6x4 = 24 of the 64 reserved blocks are dedicated to freemap
18    operations.
19
20    Any given modification of a freemap block that crosses a flush group
21    must cycle to the next copy of the freemap block.  Having 6 copies
22    ensures that:
23
24    - Each of the four backup volume headers points to a consistent
25      freemap topology.  This eats 4 copies.
26
27    - That recovery operations during mount do not modify the state of the
28      freemap topology pointed to by older volume headers that are still
29      valid.  This eats 1 copy.
30
31    - The bulk free scan eats 1 copy to use as spool-off space if the
32      thread hits its ram limits.  This copy is not part of the normal
33      rotation.
34
35    - Total is 6 copies.
36
37    However, there is one major restriction: If an older volume header is
38    selected by the mount code, any newer (presumably corrupt since the
39    mount code didn't select it) volume headers will lose freemap consistency
40    as the freemap code rotates into freemap blocks that might have been used
41    by the topology pointed to by the newer (but not selected) backup
42    volume headers.  For a RW mount, this means that if an older volume
43    backup is selected, the newer ones that were not selected MUST be
44    formally invalidated and cannot be used in a remount attempt.  To
45    mitigate the potential loss of data, any volume headers lost in this
46    manner can be snapshotted and the freemap recovery scan (in a RW mount)
47    can also scan the snapshots to try to ensure that the blocks are marked
48    as allocated.  The system operator can then check the snapshot manually.
49
50    During normal operation, each filesystem flush rotates to a new backup
51    volume header (a filesystem has up to four) and retains full consistency
52    for the older volume headers.  Each logical freemap block in the topology
53    rotates through the 6 possible versions (on-modify only).
54
55				Freemap Topology
56
57    The freemap topology contains 4 levels of meta-data (blockref arrays),
58    one of which is embedded in the volume header (so only three real
59    meta-data levels), plus one level of leaf-data.  Unlike normal files,
60    which use a variable-radix, the freemap topology uses a fixed radix to
61    simplify the algorithm and to ensure freemap locality to the blocks
62    under management.
63
64    Level 1 - (radix 10 + 21) 64KB representing 2GB.  This is represented
65	      by a hammer2_bmap_data[1024] array.  Each entry represents
66	      2MB worth of media storage x 1024 entries to represent 2GB.
67	      Each entry contains a 128x2 bit bitmap representing 16KB
68	      of storage in 2 bits (128 x 16KB = 2MB).
69
70    Level 2 - (radix 10) 64KB blockmap representing 2TB (~2GB per entry)
71    Level 3 - (radix 10) 64KB blockmap representing 2PB (~2TB per entry)
72    Level 4 - (radix 10) 64KB blockmap representing 2EB (~2PB per entry)
73    Level 5 - (radix 3) blockref x 8 in volume header representing 16EB (2^64)
74	      (this conveniently eats one 512-byte 'sector' of the 64KB
75	      volume header).
76
77    Each level is assign reserved blocks in the 4MB header per 2GB zone.
78    Since we use block 0 for the volume header / volume header backup,
79    our level names above can simply also represent the relative block
80    number.  Level 1 uses block 1 through level 4 using block 4.  Level 5
81    is stored in the volume header.
82
83				    Flushing
84
85    The freemap does not have to be flushed by fsync/sync, but should probably
86    be flushed at least once a minute by the normal filesystem sync.  The
87    reason it does not have to be flushed every time is that the freemap
88    recovery (using the last fully flushed freemap TID) will simply do an
89    incremental scan of the main filesystem tree between the freemap TID
90    and the main filesystem tree's TID to ensure that blocks allocated in
91    the interim are properly allocated in the freemap.  Simple as that.
92
93				Freemap Granularity
94
95    The freemap granularity is 16KB (radix of 14) but the minimum
96    allocation radix is 1KB (radix of 10) (and can be in multiples of
97    1KB with some coding).  1KB inodes can hold up to 512 bytes of direct
98    data, so tiny files eat exactly 1KB of media storage inclusive of the
99    inode.
100
101    The freemap keeps track of partial allocations in-memory but not
102    on-media, so even a normal umount will cause partially allocated
103    blocks to appear fully allocated until some later date when the
104    bulk scan code defragments it.
105
106				   Block Selection
107
108    Block selection is localized to be near the inode's (or nearby data)
109    blockref.  The algorithmic complexity of determining locality is not
110    defined here atm.
111
112				Leaf Substructure
113
114    * radix  - Clustering radix.  All allocations for any given ~2MB zone
115	       are always the same size, allowing the filesystem code to
116	       cluster buffer cache I/O.
117
118    * bitmap - four 32 bit words representing ~2MB in 16KB allocation chunks
119	       at 2 bits per chunk.  The filesystem allocation granularity
120	       can be smaller (currently ~1KB minimum), and the live
121	       filesystem keeps caches iterations when allocating multiple
122	       chunks.  However, on remount any partial allocations out of
123	       a 64KB allocation block causes the entire 64KB to be
124	       considered allocated.  Fragmented space can potentially be
125	       reclaimed and/or relocated by the bulk block free scan.
126
127	       The 2-bit bitmap fields are assigned as follows:
128
129	       00	FREE
130	       01	ARMED for free stage (future use)
131	       10	ARMED for free stage (future use)
132	       11	ALLOCATED
133
134	       It should be noted that in some cases, such as snapshot
135	       destruction, H2 does not bother to actually ARM the related
136	       blocks (which would take a long time).  Instead, the bulk
137	       free-scan may have to do a more exhaustive scan.
138
139			      Blockref Substructure
140
141    The blockref substructure at each level steals some space from the
142    check code area (a 24-byte area).  We only need 4 bytes for the check
143    code icrc.  We use some of the remaining space to store information
144    that allows the block allocator to do its work more efficiently.
145
146    * bigmask - A mask of radixes available for allocation under this
147		blockref.  Typically initialized to -1.
148
149    * avail   - Total available space in bytes.
150
151    The freemap allocator uses a cylinder-group-like abstraction using
152    the localized allocation concept first implemented by UFS.  In HAMMER2
153    there is no such thing as a real cylinder group, but we do the next
154    best thing by implementing our layer 1 blockmap representing 2GB.
155
156				Level 2, 3, 4, 5
157
158    Levels 2, 3, and 4 contains an array blockmap[1024] (64KB total),
159    supplying 10 bits of address space each.  Level 5 is a blockmap[8] stored
160    in the volume header supplying 3 bits of address space.  (level 0
161    supplies 10 + 21 bits of address space).
162
163    The Level1 blockmap is HAMMER2's idea of a 'cylinder group', thus
164    effectively fixed at multiples of ~2MB or so.
165
166			        Initial Conditions
167
168    The freemap is a multi-indirect block structure but there is no real
169    reason to pre-format it in newfs_hammer2.  Instead, newfs_hammer2 simply
170    leaves the associated top-level indirect blocks empty and uses the
171    (voldata->allocator_beg) field to allocate space linearly, then leaves
172    it to the live filesystem to initialize the freemap as more space gets
173    allocated.
174
175    The freemap does NOT use a fixed 5-level radix tree.  It uses the same
176    blockmap algorithm used for file blocks but restricts any recursion to
177    specific radix values.  This means that small filesystems will have much
178    smaller freemap depths.  2 layers (and not counting the volume header as
179    a layer) gets us 16GB, 3 layers gets us 16TB.
180
181			How blocks are allocated and freed
182
183    H2 keeps track of sub-16KB allocations in-memory.  On a crash/reboot any
184    partial allocations effectively become full 16KB block allocations until
185    the bulk freeing code comes along and fixes it.  2-bit patterns are as
186    follows:
187
188       00	FREE
189       01	ARMED (for free) (future use)
190       10	ARMED (for free) (future use)
191       11	ALLOCATED
192
193    Currently H2 only implements 00 and 11.  When a file, topology, or
194    snapshot is deleted H2 simply leaves the blocks marked allocated but
195    records the related freezone/radix(s) in memory.
196
197    At some point a background bulk free-scan will run.  This code must
198    scan meta-data and has a limited cache to detect duplicative sub-trees
199    (due to snapshots).  It uses the freezone/radix information recorded
200    in memory to reduce the complexity of the scan, find all references to
201    the related blocks in the meta-data, and determines what can actually
202    be freed.  Once this determination is made the bulk free-scan sets
203    the related freemap bits to FREE (00).
204
205    An exhaustive free-scan is not usually required during normal operation
206    but is typically run incrementally by cron every so often to ensure, over
207    time, that all freeable blocks are actually freed.  This is most useful
208    when maintaining multiple snapshots.
209
210			Use of Generic indirect-block API
211
212    I decided to use the same indirect-block allocation model for the
213    freemap that normal files use, with a few special cases added to force
214    specific radix values and to 'allocate' the freemap-related blocks
215    and indirect blocks via a reserved-block calculation and (obviously)
216    not via a recursive call to the allocator.
217
218    The Freemap is defined above as a fixed 5-level scheme (level 1-5),
219    but in actual operation the radix tree can be shortcut just as it
220    is with normal files.  However, shorcuts are forced into the radix
221    values of this specification and reserved blocks are calculated based
222    on the radix level and offset, so as the freemap becomes more fleshed
223    out the tree looks more and more like the specification.
224
225    One advantage of doing things this way is that smaller filesystems
226    won't actually use a 6-level scheme.  A 16GB filesystem can use 8
227    blockrefs at layer 5 (in the volume header) that point directly to
228    layer 1.  A 16TB filesystem can use 8 blockrefs at layer5 that point
229    to layer 2.  And so forth.
230
231    At the moment we have no plans to return any of the unused 4MB zone
232    header space (per 2GB of storage) back to the filesystem for general use.
233    There are lots of things we may want to use the reserved areas for in
234    the future.
235