xref: /dragonfly/sys/vfs/hammer2/DESIGN (revision 279dd846)
1
2			    HAMMER2 DESIGN DOCUMENT
3
4				Matthew Dillon
5			     dillon@backplane.com
6
7			       03-Apr-2015 (v3)
8			       14-May-2013 (v2)
9			       08-Feb-2012 (v1)
10
11			Current Status as of document date
12
13* Filesystem Core	- operational
14  - bulkfree		- operational
15  - Compression		- operational
16  - Snapshots		- operational
17  - Deduper		- specced
18  - Subhierarchy quotas - specced
19  - Logical Encryption	- not specced yet
20  - Copies		- not specced yet
21  - fsync bypass	- not specced yet
22
23* Clustering core
24  - Network msg core	- operational
25  - Network blk device	- operational
26  - Error handling	- under development
27  - Quorum Protocol	- under development
28  - Synchronization	- under development
29  - Transaction replay	- not specced yet
30  - Cache coherency	- not specced yet
31
32				    Feature List
33
34* Block topology (both the main topology and the freemap) use a copy-on-write
35  design.  Media-level block frees are delayed and flushes rotate between
36  4 volume headers (maxes out at 4 if the filesystem is > ~8GB).  Flushes
37  will allocate new blocks up to the root in order to propagate block table
38  changes and transaction ids.
39
40* Incremental synchronization is queueless and trivial by design.
41
42* Multiple roots, with many features.  This is implemented via the super-root
43  concept.  When mounting a HAMMER2 filesystem you specify a device path and
44  a directory name in the super-root.  (HAMMER1 had only one root).
45
46* All cluster types and multiple PFSs (belonging to the same or different
47  clusters) can be mixed on one physical filesystem.
48
49  This allows independent cluster components to be configured within a
50  single formatted H2 filesystem.  Each component is a super-root entry,
51  a cluster identifier, and a unique identifier.  The network protocl
52  integrates the component into the cluster when it is created
53
54* Roots are really no different from snapshots (HAMMER1 distinguished between
55  its root mount and its PFS's.  HAMMER2 does not).
56
57* I/O and chain locking thread separation.  I/O stalls and lock stalls can
58  cause any filesystem which purports to operate over multiple physical and
59  network devices to implode.  HAMMER2 incorporates a frontend/backend design
60  which separates media operations into support threads and allows the
61  frontend to validate the cluster, proceed with an operation, and disconnect
62  any remaining running operation even when backend ops have not completed
63  on all nodes.  This allows the frontend to return 'early' (so to speak).
64
65* Early return on best data-path supported by virtue of the above.  In a
66  multi-master system, frontend ops will issue I/O on all cluster elements
67  concurrently and will return the instant incoming data validates the
68  cluster.
69
70* Snapshots are writable (in HAMMER1 snapshots were read-only).
71
72* Snapshots are explicit but trivial to create.  In HAMMER1 snapshots were
73  both explicit and fine-grained/automatic.  HAMMER2 does not implement
74  automatic fine-grained snapshots.  H2 snapshots are cheap enough that you
75  can create fine-grained snapshots if you desire.
76
77* HAMMER2 formalizes a synchronization point for the flush, does a pre-flush
78  that does not update the volume root, then waits for all running modifying
79  operations to complete to memory (not to disk) while temporarily stalling
80  new modifying operation initiations.  The final flush is then executed.
81
82  At the moment we do not allow concurrent modifying operations during the
83  final flush phase.  Ultimately I would like to, but doing so can be complex.
84
85* HAMMER2 flushes and synchronization points do not bisect VOPs (system calls).
86  (HAMMER1 flushes could wind up bisecting VOPs).  This means the H2 flushes
87  leave the filesystem in a far more consistent state than H1 flushes did.
88
89* Directory sub-hierarchy-based quotas for space and inode usage tracking.
90  Any directory can be used.
91
92* Low memory footprint.  Except for the volume header, the buffer cache
93  is completely asynchronous and dirty buffers can be retired by the OS
94  directly to backing store with no further interactions with the filesystem.
95
96* Background synchronization and mirroring occurs at the logical level.
97  When a failure occurs or a normal validation scan comes up with
98  discrepancies, the synchronization thread will use the quorum to figure
99  out which information is not correct and update accordingly.
100
101* Support for multiple compression algorithms configured on a subdirectory
102  tree basis and on a file basis.  Block compression up to 64KB will be used.
103  Only compression ratios at powers of 2 that are at least 2:1 (e.g. 2:1,
104  4:1, 8:1, etc) will work in this scheme because physical block allocations
105  in HAMMER2 are always power-of-2.  Modest compression can be achieved with
106  low overhead, is turned on by default, and is compatible with deduplication.
107
108* Encryption.  Whole-disk encryption is supported by another layer, but I
109  intend to give H2 an encryption feature at the logical layer which works
110  approximately as follows:
111
112  - Encryption controlled by the client on an inode/sub-tree basis.
113  - Server has no visibility to decrypted data.
114  - Encrypt filenames in directory entries.  Since the filename[] array
115    is 256 bytes wide, client can add random bytes after the normal
116    terminator to make it virtually impossible for an attacker to figure
117    out the filename.
118  - Encrypt file size and most inode contents.
119  - Encrypt file data (holes are not encrypted).
120  - Encryption occurs after compression, with random filler.
121  - Check codes calculated after encryption & compression (not before).
122
123  - Blockrefs are not encrypted.
124  - Directory and File Topology is not encrypted.
125  - Encryption is not sub-topology validation.  Client would have to keep
126    track of that itself.  Server or other clients can still e.g. remove
127    files, rename, etc.
128
129  In particular, note that even though the file size field can be encrypted,
130  the server does have visibility on the block topology and thus has a pretty
131  good idea how big the file is.  However, a client could add junk blocks
132  at the end of a file to make this less apparent, at the cost of space.
133
134  If a client really wants a fully validated H2-encrypted space the easiest
135  solution is to format a filesystem within an encrypted file by treating it
136  as a block device, but I digress.
137
138* Zero detection on write (writing all-zeros), which requires the data
139  buffer to be scanned, is fully supported.  This allows the writing of 0's
140  to create holes.
141
142* Copies support for redundancy within a single physical filesystem.
143  Up to 256 physical disks and/or partitions can be ganged to form a
144  single physical filesystem.  If you use a disk or RAID aggregation
145  layer then the actual number of physical disks that can be associated
146  with a single H2 filesystem is unbounded.
147
148  H2 puts an 8-bit copyid in the blockref structure to represent potentially
149  multiple copies of a block.  The copyid corresponds to a configuration
150  specification in the volume header.  The full algorithm has not been
151  specced yet.
152
153  Copies support is implemented by having multiple blockref entries for
154  the same key, each with a different copyid.  The copyid represents which
155  of the 256 slots is used.  Meta-data is also subject to the copies
156  mechanism.  However, for both meta-data and data, each copy should be
157  identical so the check fields in the blockref for all copies should wind
158  up being the same, and any valid copy can be used by the block-level
159  hammer2_chain code to access the filesystem.  File accesses will attempt
160  to use the same copy.  If an I/O read error occurs, a different copy will
161  be chosen.  Modifying operations must update all copies and/or create
162  new copies as needed.  If a write error occurs on a copy and other copies
163  are available, the errored target will be taken offline.
164
165  It is possible to configure H2 to write out fewer copies on-write and then
166  use a background scan to beef-up the number of copies to improve real-time
167  throughput.
168
169* MESI Cache coherency for multi-master/multi-client clustering operations.
170  The servers hosting the MASTERs are also responsible for keeping track of
171  the cache state.
172
173* Hardlinks and softlinks are supported.  Hardlinks are somewhat complex to
174  deal with and there is still an edge case.  I am trying to avoid storing
175  the hardlinks at the root level because that messes up my concept for
176  sub-tree quotas and is unnecessarily burdensome in terms of SMP collisions
177  under heavy loads.
178
179* The media blockref structure is now large enough to support up to a 192-bit
180  check value, which would typically be a cryptographic hash of some sort.
181  Multiple check value algorithms will be supported with the default being
182  a simple 32-bit iSCSI CRC.
183
184* Fully verified deduplication will be supported and automatic (and
185  necessary in many respects).
186
187* Unverified de-duplication will be supported as a configurable option on a
188  file or subdirectory tree.  Unverified deduplication must use the largest
189  available check code (192 bits).  It will not verify that data content with
190  the same check code is actually identical during the dedup pass, resulting
191  in approximately 100x to 1000x the deduplication performance but at the cost
192  of potentially corrupting some data.
193
194  The Unverified dedup feature is intended only for those files where
195  occassional corruption is ok, such as in a web-crawler data store or
196  other situations where the data content is not critically important
197  or can be externally recovered if it becomes corrupt.
198
199				GENERAL DESIGN
200
201HAMMER2 generally implements a copy-on-write block design for the filesystem,
202which is very different from HAMMER1's B-Tree design.  Because the design
203is copy-on-write it can be trivially snapshotted simply by referencing an
204existing block, and because the media structures logically match a standard
205filesystem directory/file hierarchy snapshots and other similar operations
206can be trivially performed on an entire subdirectory tree at any level in
207the filesystem.
208
209The copy-on-write design implements a block table in a radix-tree format,
210with a small 8x fan-out in the volume header and inode and a large 256x or
2111024x fan-out for indirect blocks.  The table is built bottom-up.
212Intermediate radii are only created when necessary so small files will use
213much shallower radix block trees.  The inode itself can accomodate files
214up 512KB (65536x8).  Directories also use a radix block table and directory
215inodes can accomodate up to 8 entries before pushing an indirect radix block.
216
217The copy-on-write nature of the filesystem implies that any modification
218whatsoever will have to eventually synchronize new disk blocks all the way
219to the super-root of the filesystem and the volume header itself.  This forms
220the basis for crash recovery and also ensures that recovery occurs on a
221completed high-level transaction boundary.  All disk writes are to new blocks
222except for the volume header (which cycles through 4 copies), thus allowing
223all writes to run asynchronously and concurrently prior to and during a flush,
224and then just doing a final synchronization and volume header update at the
225end.  Many of HAMMER2s features are enabled by this core design feature.
226
227Clearly this method requires intermediate modifications to the chain to be
228cached so multiple modifications can be aggregated prior to being
229synchronized.  One advantage, however, is that the normal buffer cache can
230be used and intermediate elements can be retired to disk by H2 or the OS
231at any time.  This means that HAMMER2 has very low resource overhead from the
232point of view of the operating system.  Unlike HAMMER1 which had to lock
233dirty buffers in memory for long periods of time, HAMMER2 has no such
234requirement.
235
236Buffer cache overhead is very well bounded and can handle filesystem
237operations of any complexity, even on boxes with very small amounts
238of physical memory.  Buffer cache overhead is significantly lower with H2
239than with H1 (and orders of magnitude lower than ZFS).
240
241At some point I intend to implement a shortcut to make fsync()'s run fast,
242and that is to allow deep updates to blockrefs to shortcut to auxillary
243space in the volume header to satisfy the fsync requirement.  The related
244blockref is then recorded when the filesystem is mounted after a crash and
245the update chain is reconstituted when a matching blockref is encountered
246again during normal operation of the filesystem.
247
248			MIRROR_TID, MODIFY_TID, UPDATE_TID
249
250In HAMMER2, the core block reference is 128-byte structure called a blockref.
251The blockref contains various bits of information including the 64-bit radix
252key (typically a directory hash if a directory entry, inode number if a
253hidden hardlink target, or file offset if a file block), 64-bit data offset
254with the physical block size radix encoded in it (physical block size can be
255different from logical block size due to compression), three 64-bit
256transaction ids, type information, and up to 512 bits worth of check data
257for the block being reference which can be anything from a simple CRC to
258a strong cryptographic hash.
259
260mirror_tid - This is a media-centric (as in physical disk partition)
261	     transaction id which tracks media-level updates.  The mirror_tid
262	     can be different at the same point on different nodes in a
263	     cluster.
264
265	     Whenever any block in the media topology is modified, its
266	     mirror_tid is updated with the flush id and will propagate
267	     upward during the flush all the way to the volume header.
268
269	     mirror_tid is monotonic.  It is primarily used for on-mount
270	     recovery and volume root validation.  The name is historical
271	     from H1, it is not used for nominal mirroring.
272
273modify_tid - This is a cluster-centric (as in across all the nodes used
274	     to build a cluster) transaction id which tracks filesystem-level
275	     updates.
276
277	     modify_tid is updated when the front-end of the filesystem makes
278	     a change to an inode or data block.  It does NOT propagate upward
279	     during a flush.
280
281update_tid - This is a cluster synchronization transaction id.  Modifications
282	     made to the topology will clear this field to 0 as they propagate
283	     up to the root.  This gives the synchronizer an easy way to
284	     determine what needs revalidation.
285
286	     The synchronizer revalidates the cluster bottom-up by validating
287	     a sub-topology and propagating the highest modify_tid in the
288	     validated sub-topology up via the update_tid field.
289
290	     Update to this field may be optimized by the HAMMER2 VFS to
291	     avoid the double-transition.
292
293The synchronization code updates an out-of-sync node bottom-up and will
294dynamically set update_tid as it goes, but media flushes can occur at any
295time and these flushes will use mirror_tid for flush and freemap management.
296The mirror_tid for each flush propagates upward to the volume header on each
297flush.  modify_tid is set for any chains modified by a cluster op but does
298not propagate up, instead serving as a seed for update_tid.
299
300* The synchronization code is able to determine that a sub-tree is
301  synchronized simply by observing the update_tid at the root of the sub-tree,
302  on an inode-by-inode basis and also on a data-block-by-data-block basis.
303
304* The synchronization code is able to do an incremental update of an
305  out-of-sync node simply by skipping elements with a matching update_tid
306  (when not 0).
307
308* The synchronization code can be interrupted and restarted at any time,
309  and is able to pick up where it left off with very little overhead.
310
311* The synchronization code does not inhibit media flushes.  Media flushes
312  can occur (and must occur) while synchronization is ongoing.
313
314There are several other stored transaction ids in HAMMER2.  There is a
315separate freemap_tid in the volume header that is used to allow freemap
316flushes to be deferred, and inodes have an attr_tid and a dirent_tid which
317tracks attribute changes and (for directories) create/rename/delete changes.
318The inode TIDs are used as an aid for the cache coherency subsystem.
319
320Remember that since this is a copy-on-write filesystem, we can propagate
321a considerable amount of information up the tree to the volume header
322without adding to the I/O we already have to do.
323
324			    DIRECTORIES AND INODES
325
326Directories are hashed, and another major design element is that directory
327entries ARE inodes.  They are one and the same, with a special placemarker
328for hardlinks.  Inodes are 1KB.
329
330Hardlinks are implemented with placemarkers as directory entries which simply
331represent the inode number.  The actual file resides in a parent directory
332that is common to all hardlinks to that file.  If the hardlinks are all within
333a single directory, the actual hardlink inode is in that directory.  The
334hardlink target, as we call it, is a hidden directory entry in a common parent
335whos key is basically just the inode number itself, so lookups are fast.
336
337Half of the inode structure (512 bytes) is used to hold top-level blockrefs
338to the radix block tree representing the file contents.  Files which are
339less than or equal to 512 bytes in size will simply store the file contents
340in this area instead of a blockref array.  So files <= 512 bytes take only
3411KB of space inclusive of the inode.
342
343Inode numbers are not spatially referenced, which complicates NFS servers
344but doesn't complicate anything else.  The inode number is stored in the
345inode itself, an absolute necessity required to properly support HAMMER2s
346hugely flexible snapshots.  I would like to support NFS services but it
347would require (probably) a lookaside index in the root for inode lookups
348and might not happen quickly.
349
350				    RECOVERY
351
352H2 allows freemap flushes to lag behind topology flushes.  The freemap flush
353tracks a separate transaction id (via mirror_tid) in the volume header.
354
355On mount, HAMMER2 will first locate the highest-sequenced check-code-validated
356volume header from the 4 copies available (if the filesystem is big enough,
357e.g. > ~10GB or so, there will be 4 copies of the volume header).
358
359HAMMER2 will then run an incremental scan of the topology for mirror_tid
360transaction ids between the last freemap flush tid and the last topology
361flush tid in order to synchronize the freemap.  Because this scan is
362incremental the time it takes to run will be relatively short and well-bounded
363at mount-time.  This is NOT fsck.  Freemap flushes can be avoided for any
364number of normal topology flushes but should still occur frequently enough
365to avoid long recovery times in case of a crash.
366
367The filesystem is then ready for use.
368
369			    DISK I/O OPTIMIZATIONS
370
371The freemap implements a 1KB allocation resolution.  Each 2MB segment managed
372by the freemap is zoned and has a tendancy to collect inodes, small data,
373indirect blocks, and larger data blocks into separate segments.  The idea is
374to greatly improve I/O performance (particularly by laying inodes down next
375to each other which has a huge effect on directory scans).
376
377The current implementation of HAMMER2 implements a fixed block size of 64KB
378in order to allow the mapping of hammer2_dio's in its IO subsystem to
379conumers that might desire different sizes.  This way we don't have to
380worry about matching the buffer cache / DIO cache to the variable block
381size of underlying elements.
382
383The biggest issue we are avoiding by having a fixed 64KB I/O size is not
384actually to help nominal front-end access issue but instead to reduce the
385complexity when blocks are freed and reused for another purpose.  HAMMER1
386had to have specialized code to check for and invalidate buffer cache buffers
387in the free/reuse case.  HAMMER2 does not need such code.
388
389That said, HAMMER2 places no major restrictions on mixing block sizes within
390a 64KB block.  The only restriction is that a HAMMER2 block cannot cross
391a 64KB boundary.  The soft restrictions the block allocator puts in place
392exist primarily for performance reasons (i.e. try to collect 1K inodes
393together).  The 2MB freemap zone granularity should work very well in this
394regard.
395
396HAMMER2 also allows OS support for ganging buffers together into even
397larger blocks for I/O (OS buffer cache 'clustering'), OS-supported read-ahead,
398OS-driven asynchronous retirement, and other performance features typically
399provided by the OS at the block-level to ensure smooth system operation.
400
401By avoiding wiring buffers/memory and allowing these features to run normally,
402HAMMER2 winds up with very low OS overhead.
403
404				FREEMAP NOTES
405
406The freemap is stored in the reserved blocks situated in the ~4MB reserved
407area at the baes of every ~1GB level-1 zone.  The current implementation
408reserves 8 copies of every freemap block and cycles through them in order
409to make the freemap operate in a copy-on-write fashion.
410
411    - Freemap is copy-on-write.
412    - Freemap operations are transactional, same as everything else.
413    - All backup volume headers are consistent on-mount.
414
415The Freemap is organized using the same radix blockmap algorithm used for
416files and directories, but with fixed radix values.  For a maximally-sized
417filesystem the Freemap will wind up being a 5-level-deep radix blockmap,
418but the top-level is embedded in the volume header so insofar as performance
419goes it is really just a 4-level blockmap.
420
421The freemap radix allocation mechanism is also the same, meaning that it is
422bottom-up and will not allocate unnecessary intermediate levels for smaller
423filesystems.  The number of blockmap levels not including the volume header
424for various filesystem sizes is as follows:
425
426	up-to		#of freemap levels
427	1GB		1-level
428	256GB		2-level
429	64TB		3-level
430	16PB		4-level
431	4EB		5-level
432	16EB		6-level
433
434The Freemap has bitmap granularity down to 16KB and a linear iterator that
435can linearly allocate space down to 1KB.  Due to fragmentation it is possible
436for the linear allocator to become marginalized, but it is relatively easy
437to for a reallocation of small blocks every once in a while (like once a year
438if you care at all) and once the old data cycles out of the snapshots, or you
439also rewrite the snapshots (which you can do), the freemap should wind up
440relatively optimal again.  Generally speaking I believe that algorithms can
441be developed to make this a non-problem without requiring any media structure
442changes.
443
444In order to implement fast snapshots (and writable snapshots for that
445matter), HAMMER2 does NOT ref-count allocations.  All the freemap does is
446keep track of 100% free blocks plus some extra bits for staging the bulkfree
447scan.  The lack of ref-counting makes it possible to:
448
449    - Completely trivialize HAMMER2s snapshot operations.
450    - Allows any volume header backup to be used trivially.
451    - Allows whole sub-trees to be destroyed without having to scan them.
452    - Simplifies normal crash recovery operations.
453    - Simplifies catastrophic recovery operations.
454
455Normal crash recovery is simply a matter of doing an incremental scan
456of the topology between the last flushed freemap TID and the last flushed
457topology TID.  This usually takes only a few seconds and allows:
458
459    - Freemap flushes to be be deferred for any number of topology flush
460      cycles.
461    - Does not have to be flushed for fsync, reducing fsync overhead.
462
463				FREEMAP - BULKFREE
464
465Blocks are freed via a bulkfree scan, which is a two-stage meta-data scan.
466Blocks are first marked as being possibly free and then finalized in the
467second scan.  Live filesystem operations are allowed to run during these
468scans and any freemap block that is allocated or adjusted after the first
469scan will simply be re-marked as allocated and the second scan will not
470transition it to being free.
471
472The cost of not doing ref-count tracking is that HAMMER2 must perform two
473bulkfree scans of the meta-data to determine which blocks can actually be
474freed.  This can be complicated by the volume header backups and snapshots
475which cause the same meta-data topology to be scanned over and over again,
476but mitigated somewhat by keeping a cache of higher-level nodes to detect
477when we would scan a sub-topology that we have already scanned.  Due to the
478copy-on-write nature of the filesystem, such detection is easy to implement.
479
480Part of the ongoing design work is finding ways to reduce the scope of this
481meta-data scan so the entire filesystem's meta-data does not need to be
482scanned (though in tests with HAMMER1, even full meta-data scans have
483turned out to be fairly low cost).  In other words, its an area where
484improvements can be made without any media format changes.
485
486Another advantage of operating the freemap like this is that some future
487version of HAMMER2 might decide to completely change how the freemap works
488and would be able to make the change with relatively low downtime.
489
490				  CLUSTERING
491
492Clustering, as always, is the most difficult bit but we have some advantages
493with HAMMER2 that we did not have with HAMMER1.  First, HAMMER2's media
494structures generally follow the kernel's filesystem hiearchy which allows
495cluster operations to use topology cache and lock state.  Second,
496HAMMER2's writable snapshots make it possible to implement several forms
497of multi-master clustering.
498
499The mount device path you specify serves to bootstrap your entry into
500the cluster.  This is typically local media.  It can even be a ram-disk
501that only contains placemarkers that help HAMMER2 connect to a fully
502networked cluster.
503
504With HAMMER2 you mount a directory entry under the super-root.  This entry
505will contain a cluster identifier that helps HAMMER2 identify and integrate
506with the nodes making up the cluster.  HAMMER2 will automatically integrate
507*all* entries under the super-root when you mount one of them.  You have to
508mount at least one for HAMMER2 to integrate the block device in the larger
509cluster.
510
511For cluster servers every HAMMER2-formatted partition has a "LOCAL" MASTER
512which can be mounted in order to make the rest of the elements under the
513super-root available to the network.  (In a prior specification I emplaced
514the cluster connections in the volume header's configuration space but I no
515longer do that).
516
517Connecting to the wider networked cluster involves setting up the /etc/hammer2
518directory with appropriate IP addresses and keys.  The user-mode hammer2
519service daemon maintains the connections and performs graph operations
520via libdmsg.
521
522Node types within the cluster:
523
524    DUMMY	- Used as a local placeholder (typically in ramdisk)
525    CACHE	- Used as a local placeholder and cache (typically on a SSD)
526    SLAVE	- A SLAVE in the cluster, can source data on quorum agreement.
527    MASTER	- A MASTER in the cluster, can source and sink data on quorum
528		  agreement.
529    SOFT_SLAVE	- A SLAVE in the cluster, can source data locally without
530		  quorum agreement (must be directly mounted).
531    SOFT_MASTER	- A local MASTER but *not* a MASTER in the cluster.  Can source
532		  and sink data locally without quorum agreement, intended to
533		  be synchronized with the real MASTERs when connectivity
534		  allows.  Operations are not coherent with the real MASTERS
535		  even when they are available.
536
537    NOTE: SNAPSHOT, AUTOSNAP, etc represent sub-types, typically under a
538	  SLAVE.  A SNAPSHOT or AUTOSNAP is a SLAVE sub-type that is no longer
539	  synchronized against current masters.
540
541    NOTE: Any SLAVE or other copy can be turned into its own writable MASTER
542	  by giving it a unique cluster id, taking it out of the cluster that
543	  originally spawned it.
544
545There are four major protocols:
546
547    Quorum protocol
548
549	This protocol is used between MASTER nodes to vote on operations
550	and resolve deadlocks.
551
552	This protocol is used between SOFT_MASTER nodes in a sub-cluster
553	to vote on operations, resolve deadlocks, determine what the latest
554	transaction id for an element is, and to perform commits.
555
556    Cache sub-protocol
557
558	This is the MESI sub-protocol which runs under the Quorum
559	protocol.  This protocol is used to maintain cache state for
560	sub-trees to ensure that operations remain cache coherent.
561
562	Depending on administrative rights this protocol may or may
563	not allow a leaf node in the cluster to hold a cache element
564	indefinitely.  The administrative controller may preemptively
565	downgrade a leaf with insufficient administrative rights
566	without giving it a chance to synchronize any modified state
567	back to the cluster.
568
569    Proxy protocol
570
571	The Quorum and Cache protocols only operate between MASTER
572	and SOFT_MASTER nodes.  All other node types must use the
573	Proxy protocol to perform similar actions.  This protocol
574	differs in that proxy requests are typically sent to just
575	one adjacent node and that node then maintains state and
576	forwards the request or performs the required operation.
577	When the link is lost to the proxy, the proxy automatically
578	forwards a deletion of the state to the other nodes based on
579	what it has recorded.
580
581	If a leaf has insufficient administrative rights it may not
582	be allowed to actually initiate a quorum operation and may only
583	be allowed to maintain partial MESI cache state or perhaps none
584	at all (since cache state can block other machines in the
585	cluster).  Instead a leaf with insufficient rights will have to
586	make due with a preemptive loss of cache state and any allowed
587	modifying operations will have to be forwarded to the proxy which
588	continues forwarding it until a node with sufficient administrative
589	rights is encountered.
590
591	To reduce issues and give the cluster more breath, sub-clusters
592	made up of SOFT_MASTERs can be formed in order to provide full
593	cache coherent within a subset of machines and yet still tie them
594	into a greater cluster that they normally would not have such
595	access to.  This effectively makes it possible to create a two
596	or three-tier fan-out of groups of machines which are cache-coherent
597	within the group, but perhaps not between groups, and use other
598	means to synchronize between the groups.
599
600    Media protocol
601
602	This is basically the physical media protocol.
603
604		       MASTER & SLAVE SYNCHRONIZATION
605
606With HAMMER2 I really want to be hard-nosed about the consistency of the
607filesystem, including the consistency of SLAVEs (snapshots, etc).  In order
608to guarantee consistency we take advantage of the copy-on-write nature of
609the filesystem by forking consistent nodes and using the forked copy as the
610source for synchronization.
611
612Similarly, the target for synchronization is not updated on the fly but instead
613is also forked and the forked copy is updated.  When synchronization is
614complete, forked sources can be thrown away and forked copies can replace
615the original synchronization target.
616
617This may seem complex, but 'forking a copy' is actually a virtually free
618operation.  The top-level inode (under the super-root), on-media, is simply
619copied to a new inode and poof, we have an unchanging snapshot to work with.
620
621	- Making a snapshot is fast... almost instantanious.
622
623	- Snapshots are used for various purposes, including synchronization
624	  of out-of-date nodes.
625
626	- A snapshot can be converted into a MASTER or some other PFS type.
627
628	- A snapshot can be forked off from its parent cluster entirely and
629	  turned into its own writable filesystem, either as a single MASTER
630	  or this can be done across the cluster by forking a quorum+ of
631	  existing MASTERs and transfering them all to a new cluster id.
632
633More complex is reintegrating the target once the synchronization is complete.
634For SLAVEs we just delete the old SLAVE and rename the copy to the same name.
635However, if the SLAVE is mounted and not optioned as a static mount (that is
636the mounter wants to see updates as they are synchronized), a reconciliation
637must occur on the live mount to clean up the vnode, inode, and chain caches
638and shift any remaining vnodes over to the updated copy.
639
640	- A mounted SLAVE can track updates made to the SLAVE but the
641	  actual mechanism is that the SLAVE PFS is replaced with an
642	  updated copy, typically every 30-60 seconds.
643
644Reintegrating a MASTER which has fallen out of the quorum due to being out
645of date is also somewhat more complex.  The same updating mechanic is used,
646we actually have to throw the 'old' MASTER away once the new one has been
647updated.  However if the cluster is undergoing heavy modifications the
648updated MASTER will be out of date almost the instant its source is
649snapshotted.  Reintegrating a MASTER thus requires a somewhat more complex
650interaction.
651
652	- If a MASTER is really out of date we can run one or more
653	  synchronization passes concurrent with modifying operations.
654	  The quorum can remain live.
655
656	- A final synchronization pass is required with quorum operations
657	  blocked to reintegrate the now up-to-date MASTER into the cluster.
658
659
660				QUORUM OPERATIONS
661
662Quorum operations can be broken down into HARD BLOCK operations and NETWORK
663operations.  If your MASTERs are all local mounts, then failures and
664sequencing is easy to deal with.
665
666Quorum operations on a networked cluster are more complex.  The problems:
667
668    - Masters cannot rely on clients to moderate quorum transactions.
669      Apart from the reliance being unsafe, the client could also
670      lose contact with one or more masters during the transaction and
671      leave one or more masters out-of-sync without the master(s) knowing
672      they are out of sync.
673
674    - When many clients are present, we do not want a flakey network
675      link from one to cause one or more masters to go out of
676      synchronization and potentially stall the whole works.
677
678    - Normal hammer2 mounts allow a virtually unlimited number of modifying
679      transactions between actual flushes.  The media flush rolls everything
680      up into a single transaction id per flush.  Detection of 'missing'
681      transactions in a concurrent multi-client setup when one or more client
682      temporarily loses connectivity is thus difficult.
683
684    - Clients have a limited amount of time to reconnect to a cluster after
685      a network disconnect before their MESI cache states are lost.
686
687    - Clients may proceed with several transactions before knowing for sure
688      that earlier transactions were completely successful.  Performance is
689      important, we won't be waiting for a full quorum-verified synchronous
690      flush to media before allowing a system call to return.
691
692    - Masters can decide that a client's MESI cache states were lost (i.e.
693      that the transaction was too slow) as well.
694
695The solutions (for modifying transactions):
696
697    - Masters handle quorum confirmation amongst themselves and do not rely
698      on the client for that purpose.
699
700    - A client can connect to one or more masters regardless of the size of
701      the quorum and can submit modifying operations to a single master if
702      desired.  The master will take care of the rest.
703
704      A client must still validate the quorum (and obtain MESI cache states)
705      when doing read-only operations in order to present the correct data
706      to the user process for the VOP.
707
708    - Masters will run a 2-phase commit amongst themselves, often concurrent
709      with other non-conflicting transactions, and will serialize operations
710      and/or enforce synchronization points for 2-phase completion on
711      serialized transactions from the same client or when cache state
712      ownership is shifted from one client to another.
713
714    - Clients will usually allow operations to run asynchronously and return
715      from system calls more or less ASAP once they own the necessary cache
716      coherency locks.  The client can select the validation mode to wait for
717      with mount options:
718
719      (1) Fully async		(mount -o async)
720      (2) Wait for phase-1 ack	(mount)
721      (3) Wait for phase-2 ack	(mount -o sync)		(fsync - wait p2ack)
722      (4) Wait for flush	(mount -o sync)		(fsync - wait flush)
723
724      Modifying system calls cannot be told to wait for a full media
725      flush, as full media flushes are prohibitively expensive.  You
726      still have to fsync().
727
728      The fsync wait mode for network links can be selected, either to
729      return after the phase-2 ack or to return after the media flush.
730      The default is to wait for the phase-2 ack, which at least guarantees
731      that a network failure after that point will not disrupt operations
732      issued before the fsync.
733
734    - Clients must adjust the chain state for modifying operations prior to
735      releasing chain locks / returning from the system call, even if the
736      masters have not finished the transaction.  A late failure by the
737      cluster will result in desynchronized state which requires erroring
738      out the whole filesystem or resynchronizing somehow.
739
740    - Clients can opt to keep a record of transactions through the phase-2
741      ack or the actual media flush on the masters.
742
743      However, replaying/revalidating the log cannot necessarily guarantee
744      success.  If the masters lose synchronization due to network issues
745      between masters (or if the client was mounted fully-async), or if enough
746      masters crash simultaniously such that a quorum fails to flush even
747      after the phase-2 ack, then it is possible that by the time a client
748      is able to replay/revalidate, some other client has squeeded in and
749      committed something that would conflict.
750
751      If the client crashes it works similarly to a crash with a local storage
752      mount... many dirty buffers might be lost.  And the same happens in
753      the cluster case.
754
755				TRANSACTION LOG
756
757Keeping a short-term transaction log, much less being able to properly replay
758it, is fraught with difficulty and I've made it a separate development task.
759
760
761