xref: /dragonfly/sys/vfs/hammer2/DESIGN (revision a4f37ab4)
1
2			    HAMMER2 DESIGN DOCUMENT
3
4				Matthew Dillon
5			     dillon@backplane.com
6
7			       24-Jul-2017 (v5)
8			       09-Jul-2016 (v4)
9			       03-Apr-2015 (v3)
10			       14-May-2013 (v2)
11			       08-Feb-2012 (v1)
12
13			Current Status as of document date
14
15* Filesystem Core	- operational
16  - bulkfree		- operational
17  - Compression		- operational
18  - Snapshots		- operational
19  - Deduper		- live operational, batch specced
20  - Subhierarchy quotas - scrapped (still possible on a limited basis)
21  - Logical Encryption	- not specced yet
22  - Copies		- not specced yet
23  - fsync bypass	- not specced yet
24
25* Clustering core
26  - Network msg core	- operational
27  - Network blk device	- operational
28  - Error handling	- under development
29  - Quorum Protocol	- under development
30  - Synchronization	- under development
31  - Transaction replay	- not specced yet
32  - Cache coherency	- not specced yet
33
34			    Intended Feature List
35		       (not all features yet implemented)
36
37* Standard filesystem semantics with full hardlink and softlink support.
38
39* Filesystems can be constructed across multiple nodes.  Each low-level
40  H2 device can accomodate nodes belonging to multiple cluster components
41  as well as nodes that are simply local to the device or machine.
42
43* A dynamic radix tree on each formatted device is used to index the
44  topology, with 64-bit keys.  Elements can be ranged in powers of 2.
45  The tree is built efficiently, bottom-up.  Indirect blocks are only
46  created when a layer fills up.
47
48* Utilizes a copy-on-write block mechanic for both the main topology
49  and the freemap.  Media-level block frees are delayed and flushes rotate
50  between 4 volume headers (maxes out at 4 if the filesystem is > ~8GB).
51  Flushes will allocate new blocks up to the root in order to propagate
52  block table changes and transaction ids.  Recovery will choose the most
53  recent valid volume root and can thus work around failures which cause
54  partial volume header writes.
55
56* Utilizes a fat blockref structure (128 bytes) which can store up to
57  64 bytes (512 bits) of check code data.  Defaults to simpler 64-bit
58  hashes.
59
60* 1024 byte fat inode structure also contains helpful meta-data for
61  debugging catastrophic recovery, up to 512 bytes of direct-data for
62  small files, or 4 indirect blocks (instead of the direct-data) as
63  the data root.
64
65  Inodes are stored as hidden elements under each node directory in the
66  super-root.  H2 originally tried to embed inodes in the directories in
67  which they resides, or in the nearest common parent directory when multiple
68  hardlinks were present, but this wound up being too difficult to get
69  right and made NFS support impossible (would have required adding more
70  complexity to index inode numbers, which I didn't want to do).
71
72* Directory entries are indexed in the radix tree just like everything
73  else based on a hash of the filename plus an iterator to deal with
74  collisions nicely.  Directory entries with filenames <= 64 bytes
75  fit entirely within the 128-byte blockref structure without requiring
76  a data reference for highly optimal directory scans.  In the optimal
77  case the directory entry simply uses the 64-byte check field to store
78  the filename (since there is no data reference).
79
80  Directory entries record inode number, file type, and filename.
81
82* The top-level for each formatted device is called the super-root.  The
83  top-level cannot be directly mounted.  Instead, it contains inodes
84  representing pieces of nodes in the cluster which can be individually
85  mounted.
86
87  This means that H2 filesystems can create multiple roots if desired.
88  In fact, snapshots are stored as discretely mountable nodes at this
89  level, so it is possible to boot from or mount root from a snapshot.
90
91  (HAMMER1 had only one root, but multiple 'PFS's, but they weren't nearly
92  as flexible).
93
94  Each formatted H2 device can hold pieces of or whole nodes belonging
95  to multiple filesystems.  This allows independent cluster components to
96  be configured within a single formatted H2 filesystem.  Each component is
97  a super-root entry, a cluster identifier, and a unique identifier.  The
98  network protocl integrates the component into the cluster when it is
99  created
100
101* Snapshots are implemented as nodes under the super-root.  Snapshots
102  are writable and easy to create (in HAMMER1 snapshots were read-only).
103  However, HAMMER2 snapshots are not as fine-grained as HAMMER1 snapshots,
104  and also not automatic.  They must be explicitly created but are cheap
105  enough that fine-grained H2 snapshots can be created on a schedule if
106  desired.
107
108* Utilizes a Frontend-Backend operational design with multiple dedicated
109  threads for each cluster component.  This allows the frontend to dispatch
110  parallel ops to backend components and then finalize the frontend operation
111  the instant a sufficient number of components agree (depending on the
112  replication mode), even if other nodes in the cluster are stalled.
113
114  H2 can deal with stalled backend threads without stalling the frontend.
115
116* Flush handling is really difficult to deal with because we want to
117  utilize the system's normal buffer cache for efficiency.  This means
118  that some flush elements (dirty data buffer cache buffers) are not
119  necessarily in sync with dirty meta-data tracked by the filesystem code.
120
121  If H2 locks the entire filesystem during a flush, then many front-end
122  operations can wind up stalling for a very long time (depending on how
123  much dirty data the filesystem and operating system lets build up).
124
125  Currently HAMMER2 tries to deal with by allowing for an almost-fully
126  asynchronous flush.  Essentially, everything related to data and meta-data
127  except the volume header itself can be flushed asynchronously.  This
128  means it can also be flushed concurrently with front-end operations.
129
130  In order to make the 'final' flush of the volume header itself meaningful,
131  the flush code will first attempt to asynchronously flush all pending
132  buffers and meta-data, then will lock the filesystem and do a second
133  flush of anything that slipped through while the first flush was running.
134  And then will flush the volume header.
135
136  CURRENT STATUS: This work is still in progress and there are still
137  stall issues in the handling of flushes.
138
139* Low memory footprint.  Except for the volume header, the buffer cache
140  is completely asynchronous and dirty buffers can be retired by the OS
141  directly to backing store with no further interactions with the filesystem.
142
143* Compression support.  Multiple algorithms are supported and can be
144  configured on a subdirectory hierarchy or individual file basis.
145  Block compression up to 64KB will be used.  Only compression ratios at
146  powers of 2 that are at least 2:1 (e.g. 2:1, 4:1, 8:1, etc) will work in
147  this scheme because physical block allocations in HAMMER2 are always
148  power-of-2.  Modest compression can be achieved with low overhead, is
149  turned on by default, and is compatible with deduplication.
150
151* De-duplication support.  HAMMER2 uses a relatively simple freemap
152  scheme that allows the filesystem to discard block references
153  asynchronously, and the same scheme allows essentially unlimited
154  references to the same data block in the hierarchy.  Thus, both live
155  de-duplication and bulk deduplication is relatively easy to implement.
156
157* Zero detection on write (writing all-zeros), which requires the data
158  buffer to be scanned, is fully supported.  This allows the writing of 0's
159  to create holes.
160
161  Generally speaking pre-writing zerod blocks to reserve space doesn't work
162  well on copy-on-write filesystems.  However, if both compression and
163  check codes are disabled on a file, H2 will also disable zero-detection,
164  allow pre-reservation of file blocks (by pre-zeroing), and allow data
165  overwrites to write to the same sector.
166
167* In addition to the above, sector overwrites (avoiding the copy-on-write)
168  are also allowed when multiple writes to the same block occur in-between
169  flush operations.
170
171* Incremental synchronization via highest-transaction id propagation
172  within the radix tree.  This is a queueless, incremental design.
173
174  CURRENT STATUS: Due to the flat inode hierarchy now being employed,
175  the current synchronization code which silently recurses indirect nodes
176  will be inefficient due to the fact that all the inodes are at the
177  same logical level in the topology.  To fix this, the code will need
178  to explicitly iterate indirect nodes and keep track of the related
179  key ranges to match them up on an indirect-block basis, which would
180  be incredibly efficient.
181
182* Background synchronization and mirroring occurs at the logical layer
183  rather than the physical layer.  This allows cluster components to
184  have differing storage arrangements.
185
186  In addition, this mechanism will fully correct any out of sync nodes
187  in the cluster as long as a sufficient number of other nodes agree on
188  what the proper state should be.
189
190			DESIGN PENDING ON THESE FEATURES
191
192* Encryption.  Whole-disk encryption is supported by another layer, but I
193  intend to give H2 an encryption feature at the logical layer which works
194  approximately as follows:
195
196  - Encryption controlled by the client on an inode/sub-tree basis.
197  - Server has no visibility to decrypted data.
198  - Encrypt filenames in directory entries.  Since the filename[] array
199    is 256 bytes wide, client can add random bytes after the normal
200    terminator to make it virtually impossible for an attacker to figure
201    out the filename.
202  - Encrypt file size and most inode contents.
203  - Encrypt file data (holes are not encrypted).
204  - Encryption occurs after compression, with random filler.
205  - Check codes calculated after encryption & compression (not before).
206
207  - Blockrefs are not encrypted.
208  - Directory and File Topology is not encrypted.
209  - Encryption is not sub-topology validation.  Client would have to keep
210    track of that itself.  Server or other clients can still e.g. remove
211    files, rename, etc.
212
213  In particular, note that even though the file size field can be encrypted,
214  the server does have visibility on the block topology and thus has a pretty
215  good idea how big the file is.  However, a client could add junk blocks
216  at the end of a file to make this less apparent, at the cost of space.
217
218  If a client really wants a fully validated H2-encrypted space the easiest
219  solution is to format a filesystem within an encrypted file by treating it
220  as a block device, but I digress.
221
222* Device ganging, copies for redundancy, and file splitting.
223
224  Device ganging - The idea here is not to gang devices into a single
225  physical volume but to instead format each device independently
226  and allow crossover-references in the blockref to other devices in
227  the set.
228
229  One of the things we want to accomplish is to ensure that a failed
230  device does not prevent access to radix tree elements in other devices
231  in the gang, and that the failed device can be reconstructed.  To do
232  this, each device implements complete reachability from the node root
233  to all elements underneath it.  When a device fails, the sychronization
234  code can theoretically reconstruct the missing material in other
235  devices making up the gang.  New devices can be added to the gang and
236  existing devices can be removed from the gang.
237
238  Redundant copies - This is actually a fairly tough problem.  The
239  solution I would like to implement is to use the device ganging feature
240  to also implement redundancy, that way if a device fails within the
241  gang there's a good chance that it can still remain completely functional
242  without having to resynchronize.  But making this work is difficult to say
243  the least.
244
245* MESI Cache coherency for multi-master/multi-client clustering operations.
246  The servers hosting the MASTERs are also responsible for keeping track of
247  the cache state.
248
249  This is a feature that we would need to implement coherent cross-machine
250  multi-threading and migration.
251
252* Implement unverified de-duplication (where only the check code is tested,
253  avoiding having to actually read data blocks to calculate a de-duplication.
254  This would make use of the blockref structure's widest check field
255  (512 bits).
256
257  Out of necessity this type of feature would be settable on a file or
258  recursive directory tree basis, but should only be used when the data
259  is throw-away or can be reconstructed since data corruption (mismatched
260  duplicates with the same hash) is still possible even with a 512-bit
261  check code.
262
263  The Unverified dedup feature is intended only for those files where
264  occassional corruption is ok, such as in a web-crawler data store or
265  other situations where the data content is not critically important
266  or can be externally recovered if it becomes corrupt.
267
268				GENERAL DESIGN
269
270HAMMER2 generally implements a copy-on-write block design for the filesystem,
271which is very different from HAMMER1's B-Tree design.  Because the design
272is copy-on-write it can be trivially snapshotted simply by making a copy
273of the block table we desire to snapshot.  Snapshotting the root inode
274effectively snapshots the entire filesystem, whereas snapshotting a file
275inode only snapshots that one file.  Snapshotting a directory inode is
276generally unhelpful since it only contains directory entries and the
277underlying files are not arranged under it in the radix tree.
278
279The copy-on-write design implements a block table as a radix-tree,
280with a small fan-out in the volume header and inode (typically 4x) and
281a large fan-out for indirect blocks (typically 128x and 512x depending).
282The table is built bottom-up.  Intermediate radii are only created when
283necessary so small files and directories will have a much shallower radix
284tree.
285
286HAMMER2 implements several space optimizations:
287
288  1. Directory entries with filenames <= 64 characters will fit entirely
289     in the 128-byte blockref structure and do not require additional data
290     block references.  Since blockrefs are the core elements making up
291     block tables, most directories should have good locality of reference
292     for directory scans.
293
294  2. Inodes embed 4 blockrefs, so files up to 256KB and directories with
295     up to four directory entries (not including "." or "..") can be
296     accomodated without requiring any indirecct blocks.
297
298  3. Indirect blocks can be sized to any power of two up to 65536 bytes,
299     and H2 typically uses 16384 and 65536 bytes.  The smaller size is
300     used for initial indirect blocks to reduce storage overhead for
301     medium-sized files and directories.
302
303  4. The File inode itself can directly hold the data for small
304     files <= 512 bytes in size.
305
306  5. The last block in a file will have a storage allocation in powers
307     of 2 from 1KB to 64KB as needed.  Thus a small file in excess of
308     512 bytes but less than 64KB will not waste a full 64KB block.
309
310  6. When compression is enabled, small physical blocks will be allocated
311     when possible.  However, only reductions in powers of 2 are supported.
312     So if a 64KB data block can be compressed to (16KB+1) to 32KB, then
313     a 32KB block will be used.  This gives H2 modest compression at very
314     low cost without too much added complexity.
315
316  7. Live de-dup will attempt to share data blocks when file copying is
317     detected, significantly reducing actual physical writes to storage
318     and the storage used.  Bulk de-dup (when implemented), will catch
319     other cases of de-duplication.
320
321Directories contain directory entries which are indexed using a hash of
322their filename.  The hash is carefully designed to maintain some natural
323sort ordering.
324
325The copy-on-write nature of the filesystem means that any modification
326whatsoever will have to eventually synchronize new disk blocks all the way
327to the super-root of the filesystem and the volume header itself.  This forms
328the basis for crash recovery and also ensures that recovery occurs on a
329completed high-level transaction boundary.  All disk writes are to new blocks
330except for the volume header (which cycles through 4 copies), thus allowing
331all writes to run asynchronously and concurrently prior to and during a flush,
332and then just doing a final synchronization and volume header update at the
333end.  Many of HAMMER2s features are enabled by this core design feature.
334
335The Freemap is also implemented using a radix tree via a set of pre-reserved
336blocks (approximately 4MB for every 2GB of storage), and also cycles through
337multiple copies to ensure that crash recovery can restore the state of the
338filesystem quickly at mount time.
339
340HAMMER2 tries to maintain a small footprint and one way it does this is
341by using the normal buffer cache for data and meta-data, and allowing the
342kernel to asynchronously flush device buffers at any time (even during
343synchronization).  The volume root is flushed separately, separated from
344the asynchronous flushes by a synchronizing BUF_CMD_FLUSH op.  This means
345that HAMMER2 has very low resource overhead from the point of view of the
346operating system and is very much unlike HAMMER1 which had to lock dirty
347buffers into memory for long periods of time.  HAMMER2 has no such
348requirement.
349
350Buffer cache overhead is very well bounded and can handle filesystem
351operations of any complexity, even on boxes with very small amounts
352of physical memory.  Buffer cache overhead is significantly lower with H2
353than with H1 (and orders of magnitude lower than ZFS).
354
355At some point I intend to implement a shortcut to make fsync()'s run fast,
356and that is to allow deep updates to blockrefs to shortcut to auxillary
357space in the volume header to satisfy the fsync requirement.  The related
358blockref is then recorded when the filesystem is mounted after a crash and
359the update chain is reconstituted when a matching blockref is encountered
360again during normal operation of the filesystem.
361
362			MIRROR_TID, MODIFY_TID, UPDATE_TID
363
364In HAMMER2, the core block reference is a 128-byte structure called a blockref.
365The blockref contains various bits of information including the 64-bit radix
366key (typically a directory hash if a directory entry, inode number if a
367hidden hardlink target, or file offset if a file block), number of significant
368key bits for ranged recursion of indirect blocks, a 64-bit device seek that
369encodes the radix of the physical block size in the low bits (physical block
370size can be different from logical block size due to compression),
371three 64-bit transaction ids, type information, and up to 512 bits worth
372of check data for the block being reference which can be anything from
373a simple CRC to a strong cryptographic hash.
374
375mirror_tid - This is a media-centric (as in physical disk partition)
376	     transaction id which tracks media-level updates.  The mirror_tid
377	     can be different at the same point on different nodes in a
378	     cluster.
379
380	     Whenever any block in the media topology is modified, its
381	     mirror_tid is updated with the flush id and will propagate
382	     upward during the flush all the way to the volume header.
383
384	     mirror_tid is monotonic.  It is primarily used for on-mount
385	     recovery and volume root validation.  The name is historical
386	     from H1, it is not used for nominal mirroring.
387
388modify_tid - This is a cluster-centric (as in across all the nodes used
389	     to build a cluster) transaction id which tracks filesystem-level
390	     updates.
391
392	     modify_tid is updated when the front-end of the filesystem makes
393	     a change to an inode or data block.  It does NOT propagate upward
394	     during a flush.
395
396update_tid - This is a cluster synchronization transaction id.  Modifications
397	     made to the topology will clear this field to 0 as they propagate
398	     up to the root.  This gives the synchronizer an easy way to
399	     determine what needs revalidation.
400
401	     The synchronizer revalidates the cluster bottom-up by validating
402	     a sub-topology and propagating the highest modify_tid in the
403	     validated sub-topology up via the update_tid field.
404
405	     Update to this field may be optimized by the HAMMER2 VFS to
406	     avoid the double-transition.
407
408The synchronization code updates an out-of-sync node bottom-up and will
409dynamically set update_tid as it goes, but media flushes can occur at any
410time and these flushes will use mirror_tid for flush and freemap management.
411The mirror_tid for each flush propagates upward to the volume header on each
412flush.  modify_tid is set for any chains modified by a cluster op but does
413not propagate up, instead serving as a seed for update_tid.
414
415* The synchronization code is able to determine that a sub-tree is
416  synchronized simply by observing the update_tid at the root of the sub-tree,
417  on an inode-by-inode basis and also on a data-block-by-data-block basis.
418
419* The synchronization code is able to do an incremental update of an
420  out-of-sync node simply by skipping elements with a matching update_tid
421  (when not 0).
422
423* The synchronization code can be interrupted and restarted at any time,
424  and is able to pick up where it left off with very low overhead.
425
426* The synchronization code does not inhibit media flushes.  Media flushes
427  can occur (and must occur) while synchronization is ongoing.
428
429There are several other stored transaction ids in HAMMER2.  There is a
430separate freemap_tid in the volume header that is used to allow freemap
431flushes to be deferred, and inodes have a pfs_psnap_tid which is used in
432conjuction with CHECK_NONE to allow blocks without a check code which do
433not violate the most recent snapshot to be overwritten in-place.
434
435Remember that since this is a copy-on-write filesystem, we can propagate
436a considerable amount of information up the tree to the volume header
437without adding to the I/O we already have to do.
438
439			    DIRECTORIES AND INODES
440
441Directories are hashed.  In HAMMER2, a directory can contain a mix of
442directory entries AND embedded inodes.  In the first iteration of HAMMER2
443I tried really hard to embed inodes (since most files are not usually
444hardlinked), but this created huge problems for NFS exports.  At the
445moment the super-root directory utilizes embedded inodes and filesystems
446under the super-root typically use normal directory entries.  The real
447inodes are in the mounted root directory as hidden entries.
448
449However, I reserve the right to implement embedded inodes within a
450mount to create export domains which can serve as mini-roots.  Such
451mini-roots would be able to have their own quotas and would be separately
452snapshottable, but would also have to be exported separately from the
453primary mount they exist under.
454
455Hardlinks are implemented normally, with directory entries and the maintenance
456of a nlinks count in the target inode.
457
458				    RECOVERY
459
460H2 allows freemap flushes to lag behind topology flushes.  The freemap flush
461tracks a separate transaction id (via mirror_tid) in the volume header.
462
463On mount, HAMMER2 will first locate the highest-sequenced check-code-validated
464volume header from the 4 copies available (if the filesystem is big enough,
465e.g. > ~10GB or so, there will be 4 copies of the volume header).
466
467HAMMER2 will then run an incremental scan of the topology for mirror_tid
468transaction ids between the last freemap flush tid and the last topology
469flush tid in order to synchronize the freemap.  Because this scan is
470incremental the time it takes to run will be relatively short and well-bounded
471at mount-time.  This is NOT an fsck.  Freemap flushes can be avoided for any
472number of normal topology flushes but should still occur frequently enough
473to avoid long recovery times in case of a crash.
474
475The filesystem is then ready for use.
476
477			    DISK I/O OPTIMIZATIONS
478
479The freemap implements a 1KB allocation resolution.  Each 2MB segment managed
480by the freemap is zoned and has a tendancy to collect inodes, small data,
481indirect blocks, and larger data blocks into separate segments.  The idea is
482to greatly improve I/O performance (particularly by laying inodes down next
483to each other which has a huge effect on directory scans).
484
485The current implementation of HAMMER2 implements a fixed I/O block size
486of 64KB in order to allow the mapping of hammer2_dio's in its IO subsystem
487to conumers that might desire different sizes.  This way we don't have to
488worry about matching the buffer cache / DIO cache to the variable block
489size of underlying elements.  In addition, 64KB I/Os allow compatibility
490with physical sector sizes up to 64KB in the underlying physical storage
491with no change in the byte-by-byte format of the filesystem.
492
493The biggest issue we are avoiding by having a fixed 64KB I/O size is not
494actually to help nominal front-end access issue but instead to reduce the
495complexity of having to deal with mixed block sizes in the buffer cache,
496particularly when blocks are freed and then later reused with a different
497block size.  HAMMER1 had to have specialized code to check for and
498invalidate buffer cache buffers in the free/reuse case.  HAMMER2 does not
499need such code.
500
501That said, HAMMER2 places no major restrictions on mixing block sizes within
502a 64KB block.  The only restriction is that a HAMMER2 block cannot cross
503a 64KB boundary.  The soft restrictions the block allocator puts in place
504exist primarily for performance reasons (i.e. try to collect 1K inodes
505together).  The 2MB freemap zone granularity should work very well in this
506regard.
507
508HAMMER2 also allows OS support for ganging buffers together into even
509larger blocks for I/O (OS buffer cache 'clustering'), OS-supported read-ahead,
510OS-driven asynchronous retirement, and other performance features typically
511provided by the OS at the block-level to ensure smooth system operation.
512
513By avoiding wiring buffers/memory and allowing these features to run normally,
514HAMMER2 winds up with very low OS overhead.
515
516				FREEMAP NOTES
517
518The freemap is stored in the reserved blocks situated in the ~4MB reserved
519area at the baes of every ~1GB level-1 zone.  The current implementation
520reserves 8 copies of every freemap block and cycles through them in order
521to make the freemap operate in a copy-on-write fashion.
522
523    - Freemap is copy-on-write.
524    - Freemap operations are transactional, same as everything else.
525    - All backup volume headers are consistent on-mount.
526
527The Freemap is organized using the same radix blockmap algorithm used for
528files and directories, but with fixed radix values.  For a maximally-sized
529filesystem the Freemap will wind up being a 5-level-deep radix blockmap,
530but the top-level is embedded in the volume header so insofar as performance
531goes it is really just a 4-level blockmap.
532
533The freemap radix allocation mechanism is also the same, meaning that it is
534bottom-up and will not allocate unnecessary intermediate levels for smaller
535filesystems.  The number of blockmap levels not including the volume header
536for various filesystem sizes is as follows:
537
538	up-to		#of freemap levels
539	1GB		1-level
540	256GB		2-level
541	64TB		3-level
542	16PB		4-level
543	4EB		5-level
544	16EB		6-level
545
546The Freemap has bitmap granularity down to 16KB and a linear iterator that
547can linearly allocate space down to 1KB.  Due to fragmentation it is possible
548for the linear allocator to become marginalized, but it is relatively easy
549to for a reallocation of small blocks every once in a while (like once a year
550if you care at all) and once the old data cycles out of the snapshots, or you
551also rewrite the snapshots (which you can do), the freemap should wind up
552relatively optimal again.  Generally speaking I believe that algorithms can
553be developed to make this a non-problem without requiring any media structure
554changes.
555
556In order to implement fast snapshots (and writable snapshots for that
557matter), HAMMER2 does NOT ref-count allocations.  All the freemap does is
558keep track of 100% free blocks plus some extra bits for staging the bulkfree
559scan.  The lack of ref-counting makes it possible to:
560
561    - Completely trivialize HAMMER2s snapshot operations.
562    - Allows any volume header backup to be used trivially.
563    - Allows whole sub-trees to be destroyed without having to scan them.
564    - Simplifies normal crash recovery operations.
565    - Simplifies catastrophic recovery operations.
566
567Normal crash recovery is simply a matter of doing an incremental scan
568of the topology between the last flushed freemap TID and the last flushed
569topology TID.  This usually takes only a few seconds and allows:
570
571    - Freemap flushes to be be deferred for any number of topology flush
572      cycles.
573    - Does not have to be flushed for fsync, reducing fsync overhead.
574
575				FREEMAP - BULKFREE
576
577Blocks are freed via a bulkfree scan, which is a two-stage meta-data scan.
578Blocks are first marked as being possibly free and then finalized in the
579second scan.  Live filesystem operations are allowed to run during these
580scans and any freemap block that is allocated or adjusted after the first
581scan will simply be re-marked as allocated and the second scan will not
582transition it to being free.
583
584The cost of not doing ref-count tracking is that HAMMER2 must perform two
585bulkfree scans of the meta-data to determine which blocks can actually be
586freed.  This can be complicated by the volume header backups and snapshots
587which cause the same meta-data topology to be scanned over and over again,
588but mitigated somewhat by keeping a cache of higher-level nodes to detect
589when we would scan a sub-topology that we have already scanned.  Due to the
590copy-on-write nature of the filesystem, such detection is easy to implement.
591
592Part of the ongoing design work is finding ways to reduce the scope of this
593meta-data scan so the entire filesystem's meta-data does not need to be
594scanned (though in tests with HAMMER1, even full meta-data scans have
595turned out to be fairly low cost).  In other words, its an area where
596improvements can be made without any media format changes.
597
598Another advantage of operating the freemap like this is that some future
599version of HAMMER2 might decide to completely change how the freemap works
600and would be able to make the change with relatively low downtime.
601
602				  CLUSTERING
603
604Clustering, as always, is the most difficult bit but we have some advantages
605with HAMMER2 that we did not have with HAMMER1.  First, HAMMER2's media
606structures generally follow the kernel's filesystem hiearchy which allows
607cluster operations to use topology cache and lock state.  Second,
608HAMMER2's writable snapshots make it possible to implement several forms
609of multi-master clustering.
610
611The mount device path you specify serves to bootstrap your entry into
612the cluster.  This is typically local media.  It can even be a ram-disk
613that only contains placemarkers that help HAMMER2 connect to a fully
614networked cluster.
615
616With HAMMER2 you mount a directory entry under the super-root.  This entry
617will contain a cluster identifier that helps HAMMER2 identify and integrate
618with the nodes making up the cluster.  HAMMER2 will automatically integrate
619*all* entries under the super-root when you mount one of them.  You have to
620mount at least one for HAMMER2 to integrate the block device in the larger
621cluster.
622
623For cluster servers every HAMMER2-formatted partition has a "LOCAL" MASTER
624which can be mounted in order to make the rest of the elements under the
625super-root available to the network.  (In a prior specification I emplaced
626the cluster connections in the volume header's configuration space but I no
627longer do that).
628
629Connecting to the wider networked cluster involves setting up the /etc/hammer2
630directory with appropriate IP addresses and keys.  The user-mode hammer2
631service daemon maintains the connections and performs graph operations
632via libdmsg.
633
634Node types within the cluster:
635
636    DUMMY	- Used as a local placeholder (typically in ramdisk)
637    CACHE	- Used as a local placeholder and cache (typically on a SSD)
638    SLAVE	- A SLAVE in the cluster, can source data on quorum agreement.
639    MASTER	- A MASTER in the cluster, can source and sink data on quorum
640		  agreement.
641    SOFT_SLAVE	- A SLAVE in the cluster, can source data locally without
642		  quorum agreement (must be directly mounted).
643    SOFT_MASTER	- A local MASTER but *not* a MASTER in the cluster.  Can source
644		  and sink data locally without quorum agreement, intended to
645		  be synchronized with the real MASTERs when connectivity
646		  allows.  Operations are not coherent with the real MASTERS
647		  even when they are available.
648
649    NOTE: SNAPSHOT, AUTOSNAP, etc represent sub-types, typically under a
650	  SLAVE.  A SNAPSHOT or AUTOSNAP is a SLAVE sub-type that is no longer
651	  synchronized against current masters.
652
653    NOTE: Any SLAVE or other copy can be turned into its own writable MASTER
654	  by giving it a unique cluster id, taking it out of the cluster that
655	  originally spawned it.
656
657There are four major protocols:
658
659    Quorum protocol
660
661	This protocol is used between MASTER nodes to vote on operations
662	and resolve deadlocks.
663
664	This protocol is used between SOFT_MASTER nodes in a sub-cluster
665	to vote on operations, resolve deadlocks, determine what the latest
666	transaction id for an element is, and to perform commits.
667
668    Cache sub-protocol
669
670	This is the MESI sub-protocol which runs under the Quorum
671	protocol.  This protocol is used to maintain cache state for
672	sub-trees to ensure that operations remain cache coherent.
673
674	Depending on administrative rights this protocol may or may
675	not allow a leaf node in the cluster to hold a cache element
676	indefinitely.  The administrative controller may preemptively
677	downgrade a leaf with insufficient administrative rights
678	without giving it a chance to synchronize any modified state
679	back to the cluster.
680
681    Proxy protocol
682
683	The Quorum and Cache protocols only operate between MASTER
684	and SOFT_MASTER nodes.  All other node types must use the
685	Proxy protocol to perform similar actions.  This protocol
686	differs in that proxy requests are typically sent to just
687	one adjacent node and that node then maintains state and
688	forwards the request or performs the required operation.
689	When the link is lost to the proxy, the proxy automatically
690	forwards a deletion of the state to the other nodes based on
691	what it has recorded.
692
693	If a leaf has insufficient administrative rights it may not
694	be allowed to actually initiate a quorum operation and may only
695	be allowed to maintain partial MESI cache state or perhaps none
696	at all (since cache state can block other machines in the
697	cluster).  Instead a leaf with insufficient rights will have to
698	make due with a preemptive loss of cache state and any allowed
699	modifying operations will have to be forwarded to the proxy which
700	continues forwarding it until a node with sufficient administrative
701	rights is encountered.
702
703	To reduce issues and give the cluster more breath, sub-clusters
704	made up of SOFT_MASTERs can be formed in order to provide full
705	cache coherent within a subset of machines and yet still tie them
706	into a greater cluster that they normally would not have such
707	access to.  This effectively makes it possible to create a two
708	or three-tier fan-out of groups of machines which are cache-coherent
709	within the group, but perhaps not between groups, and use other
710	means to synchronize between the groups.
711
712    Media protocol
713
714	This is basically the physical media protocol.
715
716		       MASTER & SLAVE SYNCHRONIZATION
717
718With HAMMER2 I really want to be hard-nosed about the consistency of the
719filesystem, including the consistency of SLAVEs (snapshots, etc).  In order
720to guarantee consistency we take advantage of the copy-on-write nature of
721the filesystem by forking consistent nodes and using the forked copy as the
722source for synchronization.
723
724Similarly, the target for synchronization is not updated on the fly but instead
725is also forked and the forked copy is updated.  When synchronization is
726complete, forked sources can be thrown away and forked copies can replace
727the original synchronization target.
728
729This may seem complex, but 'forking a copy' is actually a virtually free
730operation.  The top-level inode (under the super-root), on-media, is simply
731copied to a new inode and poof, we have an unchanging snapshot to work with.
732
733	- Making a snapshot is fast... almost instantanious.
734
735	- Snapshots are used for various purposes, including synchronization
736	  of out-of-date nodes.
737
738	- A snapshot can be converted into a MASTER or some other PFS type.
739
740	- A snapshot can be forked off from its parent cluster entirely and
741	  turned into its own writable filesystem, either as a single MASTER
742	  or this can be done across the cluster by forking a quorum+ of
743	  existing MASTERs and transfering them all to a new cluster id.
744
745More complex is reintegrating the target once the synchronization is complete.
746For SLAVEs we just delete the old SLAVE and rename the copy to the same name.
747However, if the SLAVE is mounted and not optioned as a static mount (that is
748the mounter wants to see updates as they are synchronized), a reconciliation
749must occur on the live mount to clean up the vnode, inode, and chain caches
750and shift any remaining vnodes over to the updated copy.
751
752	- A mounted SLAVE can track updates made to the SLAVE but the
753	  actual mechanism is that the SLAVE PFS is replaced with an
754	  updated copy, typically every 30-60 seconds.
755
756Reintegrating a MASTER which has fallen out of the quorum due to being out
757of date is also somewhat more complex.  The same updating mechanic is used,
758we actually have to throw the 'old' MASTER away once the new one has been
759updated.  However if the cluster is undergoing heavy modifications the
760updated MASTER will be out of date almost the instant its source is
761snapshotted.  Reintegrating a MASTER thus requires a somewhat more complex
762interaction.
763
764	- If a MASTER is really out of date we can run one or more
765	  synchronization passes concurrent with modifying operations.
766	  The quorum can remain live.
767
768	- A final synchronization pass is required with quorum operations
769	  blocked to reintegrate the now up-to-date MASTER into the cluster.
770
771
772				QUORUM OPERATIONS
773
774Quorum operations can be broken down into HARD BLOCK operations and NETWORK
775operations.  If your MASTERs are all local mounts, then failures and
776sequencing is easy to deal with.
777
778Quorum operations on a networked cluster are more complex.  The problems:
779
780    - Masters cannot rely on clients to moderate quorum transactions.
781      Apart from the reliance being unsafe, the client could also
782      lose contact with one or more masters during the transaction and
783      leave one or more masters out-of-sync without the master(s) knowing
784      they are out of sync.
785
786    - When many clients are present, we do not want a flakey network
787      link from one to cause one or more masters to go out of
788      synchronization and potentially stall the whole works.
789
790    - Normal hammer2 mounts allow a virtually unlimited number of modifying
791      transactions between actual flushes.  The media flush rolls everything
792      up into a single transaction id per flush.  Detection of 'missing'
793      transactions in a concurrent multi-client setup when one or more client
794      temporarily loses connectivity is thus difficult.
795
796    - Clients have a limited amount of time to reconnect to a cluster after
797      a network disconnect before their MESI cache states are lost.
798
799    - Clients may proceed with several transactions before knowing for sure
800      that earlier transactions were completely successful.  Performance is
801      important, we won't be waiting for a full quorum-verified synchronous
802      flush to media before allowing a system call to return.
803
804    - Masters can decide that a client's MESI cache states were lost (i.e.
805      that the transaction was too slow) as well.
806
807The solutions (for modifying transactions):
808
809    - Masters handle quorum confirmation amongst themselves and do not rely
810      on the client for that purpose.
811
812    - A client can connect to one or more masters regardless of the size of
813      the quorum and can submit modifying operations to a single master if
814      desired.  The master will take care of the rest.
815
816      A client must still validate the quorum (and obtain MESI cache states)
817      when doing read-only operations in order to present the correct data
818      to the user process for the VOP.
819
820    - Masters will run a 2-phase commit amongst themselves, often concurrent
821      with other non-conflicting transactions, and will serialize operations
822      and/or enforce synchronization points for 2-phase completion on
823      serialized transactions from the same client or when cache state
824      ownership is shifted from one client to another.
825
826    - Clients will usually allow operations to run asynchronously and return
827      from system calls more or less ASAP once they own the necessary cache
828      coherency locks.  The client can select the validation mode to wait for
829      with mount options:
830
831      (1) Fully async		(mount -o async)
832      (2) Wait for phase-1 ack	(mount)
833      (3) Wait for phase-2 ack	(mount -o sync)		(fsync - wait p2ack)
834      (4) Wait for flush	(mount -o sync)		(fsync - wait flush)
835
836      Modifying system calls cannot be told to wait for a full media
837      flush, as full media flushes are prohibitively expensive.  You
838      still have to fsync().
839
840      The fsync wait mode for network links can be selected, either to
841      return after the phase-2 ack or to return after the media flush.
842      The default is to wait for the phase-2 ack, which at least guarantees
843      that a network failure after that point will not disrupt operations
844      issued before the fsync.
845
846    - Clients must adjust the chain state for modifying operations prior to
847      releasing chain locks / returning from the system call, even if the
848      masters have not finished the transaction.  A late failure by the
849      cluster will result in desynchronized state which requires erroring
850      out the whole filesystem or resynchronizing somehow.
851
852    - Clients can opt to keep a record of transactions through the phase-2
853      ack or the actual media flush on the masters.
854
855      However, replaying/revalidating the log cannot necessarily guarantee
856      success.  If the masters lose synchronization due to network issues
857      between masters (or if the client was mounted fully-async), or if enough
858      masters crash simultaniously such that a quorum fails to flush even
859      after the phase-2 ack, then it is possible that by the time a client
860      is able to replay/revalidate, some other client has squeeded in and
861      committed something that would conflict.
862
863      If the client crashes it works similarly to a crash with a local storage
864      mount... many dirty buffers might be lost.  And the same happens in
865      the cluster case.
866
867				TRANSACTION LOG
868
869Keeping a short-term transaction log, much less being able to properly replay
870it, is fraught with difficulty and I've made it a separate development task.
871For now HAMMER2 does not have one.
872
873