xref: /dragonfly/sys/vfs/hammer2/FREEMAP (revision a1113e4e)
1
2			HAMMER2 Freemap Design Notes
3
4				Overview
5
6    HAMMER2 Media is broken down into 2 GByte zones.  Each 2 GByte zone
7    contains a 4 MByte header (64 x 64K blocks).  The blocks in this header
8    are reserved for various purposes.  For example, block #0 is used for
9    the volume header or for a volume header backup.
10
11    * It is very important to remember that the Freemap only uses blocks
12      from these reserved areas.  Freemap blocks are NOT dynamically
13      allocated.
14
15    * On-mount, the synchronization TID for the main H2 filesystem is
16      compared against the synchronization TID of the freemap and the
17      H2 topology is incrementally iterated using mirror_tid to update
18      the freemap with any missing information.  This way the freemap flush
19      does not need to be synchronized with the normal H2 flush.  This
20      can be done very quickly on-mount.
21
22    * The freemap is flushed in a manner similar to the normal H2 filesystem,
23      but as mentioned above it does not have to be synchronized.  One freemap
24      flush could cover several H2 flushes.  A freemap flush is not necessary
25      for e.g. a fsync() or sync() to complete successfully.
26
27    * The freemap granularity is 64KB (radix of 16) but the minimum
28      allocation radix for code is 1KB (radix of 10).  1KB inodes can
29      hold up to 512 bytes of direct data, so small files eat exactly
30      1KB of media storage.
31
32    * Representation of storage is block-oriented with ~1KB granularity
33      in the filesystem topology.  However, H2 also stores freemap locality
34      hints in the inode at all levels which specifies which freemap zones
35      all storage allocations made by the sub-tree are allocated from.  Up
36      to four zones may be listed in each inode.  The zones are power-of-sized
37      and aligned the same and use a base/radix representation (same as used
38      for blockref->data_off).
39
40      During updates higher level inodes may not have a sufficient number of
41      entries to represent the storage used on a fine-grain.  In this
42      situation the representations back-off to larger radix values.
43
44      Ultimately these representations will be optimized by background scans.
45      That is, ultimately storage localization can be optimized bottom-up
46      such that it winds up being fairly optimal.  This includes the ability
47      to detect when a writable snapshot has differentiated sufficiently to
48      warrant a split.  This optimization should NOT attempt to dup common
49      data blocks.
50
51    * The zone oriented forward storage references in the inode (the four
52      entries) is used by the bulk free-scan to reduce the amount of
53      meta-data which must be duplicatively scanned.  More specifically,
54      when the sysadmin deletes storage and/or files or even whole directory
55      subhierachies, it is possible for a bulk free-scan to incrementally
56      scan the meta-data topology that covers ONLY those areas to determine
57      if it is possible to free up any actual blocks.
58
59    * H2 does not require that a rm -rf or snapshot destruction, truncation,
60      or any other operation actually modify the freemap.  That is, the
61      freemap elements can remain set to ALLOCATED (11).  In fact, it is
62      possible to just delete the directory inode and not even recursively
63      scan sub-directories.  The related storage will eventually be freed
64      by an exhaustive bulk free-scan anyway.
65
66				Freemap Topology
67
68    The freemap topology contains 4 levels of meta-data (blockref arrays),
69    one of which is embedded in the volume header (so only three real
70    meta-data levels), plus one level of leaf-data.
71
72    Level 1 - (radix 10) 64KB blockmap representing 2GB.  There are 1024
73	      entries representing ~2MB worth of media storage per entry.
74
75	      Each entry maps 32 x 64KB allocations @ 2 bits per allocation,
76	      plus contains additional meta-data which allows H2 to cluster
77	      I/O operations.  Each entry locks the allocation granularity
78	      (e.g. to 1KB = radix 10 for inodes).
79
80    Level 2 - (radix 10) 64KB blockmap representing 2TB (~2GB per entry)
81    Level 3 - (radix 10) 64KB blockmap representing 2PB (~2TB per entry)
82    Level 4 - (radix 10) 64KB blockmap representing 2EB (~2PB per entry)
83    Level 5 - (radix 3) blockref x 8 in volume header representing 16EB (2^64)
84	      (this conveniently eats one 512-byte 'sector' of the 64KB
85	      volume header).
86
87    Each level is assign reserved blocks in the 4MB header per 2GB zone.
88    Since we use block 0 for the volume header / volume header backup,
89    our level names above can simply also represent the relative block
90    number.  Level 1 uses block 1 through level 4 using block 4.  Level 5
91    is stored in the volume header.
92
93    In addition there are FOUR SETS, A, B, C, and D, each containing blocks
94    for level 1-4.  Hammer2 alternates between sets on a block-by-block basis
95    in order to maintain consistency when updating the freemap.
96
97				Leaf Substructure
98
99    * radix  - Clustering radix.  All allocations for any given ~2MB zone
100	       are always the same size, allowing the filesystem code to
101	       cluster buffer cache I/O.
102
103    * bitmap - four 32 bit words representing ~2MB in 64KB allocation chunks
104	       at 2 bits per chunk.  The filesystem allocation granularity
105	       can be smaller (currently ~1KB minimum), and the live
106	       filesystem keeps caches iterations when allocating multiple
107	       chunks.  However, on remount any partial allocations out of
108	       a 64KB allocation block causes the entire 64KB to be
109	       considered allocated.  Fragmented space can potentially be
110	       reclaimed and/or relocated by the bulk block free scan.
111
112	       The 2-bit bitmap fields are assigned as follows:
113
114	       00	FREE
115	       01	ARMED for free stage (future use)
116	       10	ARMED for free stage (future use)
117	       11	ALLOCATED
118
119	       It should be noted that in some cases, such as snapshot
120	       destruction, H2 does not bother to actually ARM the related
121	       blocks (which would take a long time).  Instead, the bulk
122	       free-scan may have to do a more exhaustive scan.
123
124			      Blockref Substructure
125
126    The blockref substructure at each level steals some space from the
127    check code area (a 24-byte area).  We only need 4 bytes for the check
128    code icrc.  We use some of the remaining space to store information
129    that allows the block allocator to do its work more efficiently.
130
131    * bigmask - A mask of radixes available for allocation under this
132		blockref.  Typically initialized to -1.
133
134    * avail   - Total available space in bytes.
135
136    The freemap allocator uses a cylinder-group-like abstraction using
137    the localized allocation concept first implemented by UFS.  In HAMMER2
138    there is no such thing as a real cylinder group, but we do the next
139    best thing by implementing our layer 1 blockmap representing 2GB.
140
141    The layer 1 blockmap is an array of 1024 blockrefs, so each blockref
142    covers 2MB worth of media storage.  HAMMER2's 'cylinder group' concept
143    thus has a minimum granularity of 2MB.  A typical setting might be e.g.
144    10MB.
145
146    By localizing allocations to cylinder groups based on various bits of
147    information, HAMMER2 tries to allocate space on the disk and still leave
148    some left over for localized expansion and to reduce fragmentation at
149    the same time.  Not an easy task, especially considering the copy-on-write
150    nature of the filesystem.  This part of the algorithm likely needs a lot
151    of work but I hope I've laid down a media format that will not have to be
152    changed down the line to accomodate better allocation strategies.
153
154			        Initial Conditions
155
156    The freemap is a multi-indirect block structure but there is no real
157    reason to pre-format it in newfs_hammer2.  Instead, newfs_hammer2 simply
158    leaves the associated top-level indirect blocks empty and uses the
159    (voldata->allocator_beg) field to allocate space linearly, then leaves
160    it to the live filesystem to initialize the freemap as more space gets
161    allocated.
162
163				How blocks are freed
164
165    The freemap bit patterns for each 64KB block are as follows:
166
167       00	FREE
168       01	ARMED (for free) (future use)
169       10	ARMED (for free) (future use)
170       11	ALLOCATED
171
172    Currently H2 only implements 00 and 11.  When a file, topology, or
173    snapshot is deleted H2 simply leaves the blocks marked allocated but
174    records the related freezone/radix(s) in memory.
175
176    At some point a background bulk free-scan will run.  This code must
177    scan meta-data and has a limited cache to detect duplicative sub-trees
178    (due to snapshots).  It uses the freezone/radix information recorded
179    in memory to reduce the complexity of the scan, find all references to
180    the related blocks in the meta-data, and determines what can actually
181    be freed.  Once this determination is made the bulk free-scan sets
182    the related freemap bits to FREE (00).
183
184    An exhaustive free-scan is not usually required during normal operation
185    but is typically run incrementally by cron every so often to ensure, over
186    time, that all freeable blocks are actually freed.  This is most useful
187    when maintaining multiple snapshots.
188
189			Use of Generic indirect-block API
190
191    I decided to use the same indirect-block allocation model for the
192    freemap that normal files use, with a few special cases added to force
193    specific radix values and to 'allocate' the freemap-related blocks
194    and indirect blocks via a reserved-block calculation and (obviously)
195    not via a recursive call to the allocator.
196
197    The Freemap is defined above as a fixed 5-level scheme (level 1-5),
198    but in actual operation the radix tree can be shortcut just as it
199    is with normal files.  However, shorcuts are forced into the radix
200    values of this specification and reserved blocks are calculated based
201    on the radix level and offset, so as the freemap becomes more fleshed
202    out the tree looks more and more like the specification.
203
204    One advantage of doing things this way is that smaller filesystems
205    won't actually use a 6-level scheme.  A 16GB filesystem can use 8
206    blockrefs at layer 5 (in the volume header) that point directly to
207    layer 1.  A 16TB filesystem can use 8 blockrefs at layer5 that point
208    to layer 2.  And so forth.
209
210    At the moment we have no plans to return any of the unused 4MB zone
211    header space (per 2GB of storage) back to the filesystem for general use.
212    There are lots of things we may want to use the reserved areas for in
213    the future.
214