1 2 HAMMER2 DESIGN DOCUMENT 3 4 Matthew Dillon 5 dillon@backplane.com 6 7 03-Apr-2015 (v3) 8 14-May-2013 (v2) 9 08-Feb-2012 (v1) 10 11 Current Status as of document date 12 13* Filesystem Core - operational 14 - bulkfree - operational 15 - Compression - operational 16 - Snapshots - operational 17 - Deduper - specced 18 - Subhierarchy quotas - specced 19 - Logical Encryption - not specced yet 20 - Copies - not specced yet 21 - fsync bypass - not specced yet 22 23* Clustering core 24 - Network msg core - operational 25 - Network blk device - operational 26 - Error handling - under development 27 - Quorum Protocol - under development 28 - Synchronization - under development 29 - Transaction replay - not specced yet 30 - Cache coherency - not specced yet 31 32 Feature List 33 34* Block topology (both the main topology and the freemap) use a copy-on-write 35 design. Media-level block frees are delayed and flushes rotate between 36 4 volume headers (maxes out at 4 if the filesystem is > ~8GB). Flushes 37 will allocate new blocks up to the root in order to propagate block table 38 changes and transaction ids. 39 40* Incremental update scans are trivial by design. 41 42* Multiple roots, with many features. This is implemented via the super-root 43 concept. When mounting a HAMMER2 filesystem you specify a device path and 44 a directory name in the super-root. (HAMMER1 had only one root). 45 46* All cluster types and multiple PFSs (belonging to the same or different 47 clusters) can be mixed on one physical filesystem. 48 49 This allows independent cluster components to be configured within a 50 single formatted H2 filesystem. Each component is a super-root entry, 51 a cluster identifier, and a unique identifier. The network protocl 52 integrates the component into the cluster when it is created 53 54* Roots are really no different from snapshots (HAMMER1 distinguished between 55 its root mount and its PFS's. HAMMER2 does not). 56 57* Snapshots are writable (in HAMMER1 snapshots were read-only). 58 59* Snapshots are explicit but trivial to create. In HAMMER1 snapshots were 60 both explicit and fine-grained/automatic. HAMMER2 does not implement 61 automatic fine-grained snapshots. H2 snapshots are cheap enough that you 62 can create fine-grained snapshots if you desire. 63 64* HAMMER2 formalizes a synchronization point for the flush, does a pre-flush 65 that does not update the volume root, then waits for all running modifying 66 operations to complete to memory (not to disk) while temporarily stalling 67 new modifying operation initiations. The final flush is then executed. 68 69 At the moment we do not allow concurrent modifying operations during the 70 final flush phase. Ultimately I would like to, but doing so can be complex. 71 72* HAMMER2 flushes and synchronization points do not bisect VOPs (system calls). 73 (HAMMER1 flushes could wind up bisecting VOPs). This means the H2 flushes 74 leave the filesystem in a far more consistent state than H1 flushes did. 75 76* Directory sub-hierarchy-based quotas for space and inode usage tracking. 77 Any directory can be used. 78 79* Low memory footprint. Except for the volume header, the buffer cache 80 is completely asynchronous and dirty buffers can be retired by the OS 81 directly to backing store with no further interactions with the filesystem. 82 83* Background synchronization and mirroring occurs at the logical level. 84 When a failure occurs or a normal validation scan comes up with 85 discrepancies, the synchronization thread will use the quorum to figure 86 out which information is not correct and update accordingly. 87 88* Support for multiple compression algorithms configured on subdirectory 89 tree basis and on a file basis. Block compression up to 64KB will be used. 90 Only compression ratios at powers of 2 that are at least 2:1 (e.g. 2:1, 91 4:1, 8:1, etc) will work in this scheme because physical block allocations 92 in HAMMER2 are always power-of-2. Modest compression can be achieved with 93 low overhead, is turned on by default, and is compatible with deduplication. 94 95* Encryption. Whole-disk encryption is supported by another layer, but I 96 intend to give H2 an encryption feature at the logical layer which works 97 approximately as follows: 98 99 - Encryption controlled by the client on an inode/sub-tree basis. 100 - Server has no visibility to decrypted data. 101 - Encrypt filenames in directory entries. Since the filename[] array 102 is 256 bytes wide, client can add random bytes after the normal 103 terminator to make it virtually impossible for an attacker to figure 104 out the filename. 105 - Encrypt file size and most inode contents. 106 - Encrypt file data (holes are not encrypted). 107 - Encryption occurs after compression, with random filler. 108 - Check codes calculated after encryption & compression (not before). 109 110 - Blockrefs are not encrypted. 111 - Directory and File Topology is not encrypted. 112 - Encryption is not sub-topology validation. Client would have to keep 113 track of that itself. Server or other clients can still e.g. remove 114 files, rename, etc. 115 116 In particular, note that even though the file size field can be encrypted, 117 the server does have visibility on the block topology and thus has a pretty 118 good idea how big the file is. However, a client could add junk blocks 119 at the end of a file to make this less apparent, at the cost of space. 120 121 If a client really wants a fully validated H2-encrypted space the easiest 122 solution is to format a filesystem within an encrypted file by treating it 123 as a block device, but I digress. 124 125* Zero detection on write (writing all-zeros), which requires the data 126 buffer to be scanned, is fully supported. This allows the writing of 0's 127 to create holes. 128 129* Copies support for redundancy within a single physical filesystem. 130 Up to 256 physical disks and/or partitions can be ganged to form a 131 single physical filesystem. If you use a disk or RAID aggregation 132 layer then the actual number of physical disks that can be associated 133 with a single H2 filesystem is unbounded. 134 135 H2 puts an 8-bit copyid in the blockref structure to represent potentially 136 multiple copies of a block. The copyid corresponds to a configuration 137 specification in the volume header. The full algorithm has not been 138 specced yet. 139 140 Copies support is implemented by having multiple blockref entries for 141 the same key, each with a different copyid. The copyid represents which 142 of the 256 slots is used. Meta-data is also subject to the copies 143 mechanism. However, for both meta-data and data, each copy should be 144 identical so the check fields in the blockref for all copies should wind 145 up being the same, and any valid copy can be used by the block-level 146 hammer2_chain code to access the filesystem. File accesses will attempt 147 to use the same copy. If an I/O read error occurs, a different copy will 148 be chosen. Modifying operations must update all copies and/or create 149 new copies as needed. If a write error occurs on a copy and other copies 150 are available, the errored target will be taken offline. 151 152 It is possible to configure H2 to write out fewer copies on-write and then 153 use a background scan to beef-up the number of copies to improve real-time 154 throughput. 155 156* MESI Cache coherency for multi-master/multi-client clustering operations. 157 The servers hosting the MASTERs are also responsible for keeping track of 158 the cache state. 159 160* Hardlinks and softlinks are supported. Hardlinks are somewhat complex to 161 deal with and there is still an edge case. I am trying to avoid storing 162 the hardlinks at the root level because that messes up my concept for 163 sub-tree quotas and is unnecessarily burdensome in terms of SMP collisions 164 under heavy loads. 165 166* The media blockref structure is now large enough to support up to a 192-bit 167 check value, which would typically be a cryptographic hash of some sort. 168 Multiple check value algorithms will be supported with the default being 169 a simple 32-bit iSCSI CRC. 170 171* Fully verified deduplication will be supported and automatic (and 172 necessary in many respects). 173 174* Unverified de-duplication will be supported as a configurable option on a 175 file or subdirectory tree. Unverified deduplication must use the largest 176 available check code (192 bits). It will not verify that data content with 177 the same check code is actually identical during the dedup pass, resulting 178 in approximately 100x to 1000x the deduplication performance but at the cost 179 of potentially corrupting some data. 180 181 The Unverified dedup feature is intended only for those files where 182 occassional corruption is ok, such as in a web-crawler data store or 183 other situations where the data content is not critically important 184 or can be externally recovered if it becomes corrupt. 185 186 GENERAL DESIGN 187 188HAMMER2 generally implements a copy-on-write block design for the filesystem, 189which is very different from HAMMER1's B-Tree design. Because the design 190is copy-on-write it can be trivially snapshotted simply by referencing an 191existing block, and because the media structures logically match a standard 192filesystem directory/file hierarchy snapshots and other similar operations 193can be trivially performed on an entire subdirectory tree at any level in 194the filesystem. 195 196The copy-on-write design implements a block table in a radix-tree format, 197with a small 8x fan-out in the volume header and inode and a large 256x or 1981024x fan-out for indirect blocks. The table is built bottom-up. 199Intermediate radii are only created when necessary so small files will use 200much shallower radix block trees. The inode itself can accomodate files 201up 512KB (65536x8). Directories also use a radix block table and directory 202inodes can accomodate up to 8 entries before pushing an indirect radix block. 203 204The copy-on-write nature of the filesystem implies that any modification 205whatsoever will have to eventually synchronize new disk blocks all the way 206to the super-root of the filesystem and the volume header itself. This forms 207the basis for crash recovery and also ensures that recovery occurs on a 208completed high-level transaction boundary. All disk writes are to new blocks 209except for the volume header (which cycles through 4 copies), thus allowing 210all writes to run asynchronously and concurrently prior to and during a flush, 211and then just doing a final synchronization and volume header update at the 212end. Many of HAMMER2s features are enabled by this core design feature. 213 214Clearly this method requires intermediate modifications to the chain to be 215cached so multiple modifications can be aggregated prior to being 216synchronized. One advantage, however, is that the normal buffer cache can 217be used and intermediate elements can be retired to disk by H2 or the OS 218at any time. This means that HAMMER2 has very low resource overhead from the 219point of view of the operating system. Unlike HAMMER1 which had to lock 220dirty buffers in memory for long periods of time, HAMMER2 has no such 221requirement. 222 223Buffer cache overhead is very well bounded and can handle filesystem 224operations of any complexity, even on boxes with very small amounts 225of physical memory. Buffer cache overhead is significantly lower with H2 226than with H1 (and orders of magnitude lower than ZFS). 227 228At some point I intend to implement a shortcut to make fsync()'s run fast, 229and that is to allow deep updates to blockrefs to shortcut to auxillary 230space in the volume header to satisfy the fsync requirement. The related 231blockref is then recorded when the filesystem is mounted after a crash and 232the update chain is reconstituted when a matching blockref is encountered 233again during normal operation of the filesystem. 234 235 MIRROR_TID, MODIFY_TID 236 237In HAMMER2, the core block reference is 64-byte structure called a blockref. 238The blockref contains various bits of information including the 64-bit radix 239key (typically a directory hash if a directory entry, inode number if a 240hidden hardlink target, or file offset if a file block), 64-bit data offset 241with the physical block size radix encoded in it (physical block size can be 242different from logical block size due to compression), two 64-bit transaction 243ids, type information, and 192 bits worth of check data for the block being 244reference which can be a simple CRC or stronger HASH. 245 246Both mirror_tid and modify_tid propagate upward from the change point all the 247way to the root, but serve different purposes and work in slightly different 248ways. 249 250mirror_tid - This is a media-centric (as in physical disk partition) 251 transaction id which tracks media-level updates. 252 253 Whenever any block in the media topology is modified, its 254 mirror_tid is updated with the flush id and will propagate 255 upward during the flush all the way to the volume header. 256 257 mirror_tid is monotonic. 258 259modify_tid - This is a cluster-centric (as in across all the nodes used 260 to build a cluster) transaction id which tracks filesystem-level 261 updates. 262 263 modify_tid is updated when the front-end of the filesystem makes 264 a change to an inode or data block. It will also propagate 265 upward, stopping at the root of the PFS (the mount point for 266 the cluster). 267 268The major difference between mirror_tid and modify_tid is that for any given 269element in the topology residing on different nodes. e.g. file "x" on node 1 270and file "x" on node 2, if the files are synchronized with each other they 271will have the same modify_tid on a block-by-block basis, and a single check 272of the inode's modify_tid is sufficient to determine that the files are fully 273synchronized and identical. These same inodes and representitive blocks will 274have very different mirror_tids because the nodes will reside on different 275physical media. 276 277I noted above that modify_tids also propagate upward, but not in all cases. 278A node which is undergoing SYNCHRONIZATION only updates the modify_tid of 279a block when it has determined that the block and its entire sub-block 280hierarchy has been synchronized to that point. 281 282The synchronization code updates an out-of-sync node bottom-up and will 283definitely set modify_tid as it goes, but media flushes can occur at any 284time and these flushes will use mirror_tid for flush and freemap management. 285The mirror_tid for each flush propagates upward to the volume header on each 286flush. 287 288* The synchronization code is able to determine that a sub-tree is 289 synchronized simply by observing the modify_tid at the root of the sub-tree, 290 on a directory-by-directory basis. 291 292* The synchronization code is able to do an incremental update of an 293 out-of-sync node simply by skipping elements with matching modify_tids. 294 295* The synchronization code can be interrupted and restarted at any time, 296 and is able to pick up where it left off with very little overhead. 297 298* The synchronization code does not inhibit media flushes. Media flushes 299 can occur (and must occur) while synchronization is ongoing. 300 301There are several other stored transaction ids in HAMMER2. There is a 302separate freemap_tid in the volume header that is used to allow freemap 303flushes to be deferred, and inodes have an attr_tid and a dirent_tid which 304tracks attribute changes and (for directories) create/rename/delete changes. 305The inode TIDs are used as an aid for the cache coherency subsystem. 306 307Remember that since this is a copy-on-write filesystem, we can propagate 308a considerable amount of information up the tree to the volume header 309without adding to the I/O we already have to do. 310 311 DIRECTORIES AND INODES 312 313Directories are hashed, and another major design element is that directory 314entries ARE inodes. They are one and the same, with a special placemarker 315for hardlinks. Inodes are 1KB. 316 317Hardlinks are implemented with placemarkers as directory entries which simply 318represent the inode number. The actual file resides in a parent directory 319that is common to all hardlinks to that file. If the hardlinks are all within 320a single directory, the actual hardlink inode is in that directory. The 321hardlink target, as we call it, is a hidden directory entry in a common parent 322whos key is basically just the inode number itself, so lookups are fast. 323 324Half of the inode structure (512 bytes) is used to hold top-level blockrefs 325to the radix block tree representing the file contents. Files which are 326less than or equal to 512 bytes in size will simply store the file contents 327in this area instead of a blockref array. So files <= 512 bytes take only 3281KB of space inclusive of the inode. 329 330Inode numbers are not spatially referenced, which complicates NFS servers 331but doesn't complicate anything else. The inode number is stored in the 332inode itself, an absolute necessity required to properly support HAMMER2s 333hugely flexible snapshots. I would like to support NFS services but it 334would require (probably) a lookaside index in the root for inode lookups 335and might not happen quickly. 336 337 RECOVERY 338 339H2 allows freemap flushes to lag behind topology flushes. The freemap flush 340tracks a separate transaction id (via mirror_tid) in the volume header. 341 342On mount, HAMMER2 will first locate the highest-sequenced check-code-validated 343volume header from the 4 copies available (if the filesystem is big enough, 344e.g. > ~10GB or so, there will be 4 copies of the volume header). 345 346HAMMER2 will then run an incremental scan of the topology for mirror_tid 347transaction ids between the last freemap flush tid and the last topology 348flush tid in order to synchronize the freemap. Because this scan is 349incremental the time it takes to run will be relatively short and well-bounded 350at mount-time. This is NOT fsck. Freemap flushes can be avoided for any 351number of normal topology flushes but should still occur frequently enough 352to avoid long recovery times in case of a crash. 353 354The filesystem is then ready for use. 355 356 DISK I/O OPTIMIZATIONS 357 358The freemap implements a 1KB allocation resolution. Each 2MB segment managed 359by the freemap is zoned and has a tendancy to collect inodes, small data, 360indirect blocks, and larger data blocks into separate segments. The idea is 361to greatly improve I/O performance (particularly by laying inodes down next 362to each other which has a huge effect on directory scans). 363 364The current implementation of HAMMER2 implements a fixed block size of 64KB 365in order to allow the mapping of hammer2_dio's in its IO subsystem to 366conumers that might desire different sizes. This way we don't have to 367worry about matching the buffer cache / DIO cache to the variable block 368size of underlying elements. 369 370The biggest issue we are avoiding by having a fixed 64KB I/O size is not 371actually to help nominal front-end access issue but instead to reduce the 372complexity when blocks are freed and reused for another purpose. HAMMER1 373had to have specialized code to check for and invalidate buffer cache buffers 374in the free/reuse case. HAMMER2 does not need such code. 375 376That said, HAMMER2 places no major restrictions on mixing block sizes within 377a 64KB block. The only restriction is that a HAMMER2 block cannot cross 378a 64KB boundary. The soft restrictions the block allocator puts in place 379exist primarily for performance reasons (i.e. try to collect 1K inodes 380together). The 2MB freemap zone granularity should work very well in this 381regard. 382 383HAMMER2 also allows OS support for ganging buffers together into even 384larger blocks for I/O (OS buffer cache 'clustering'), OS-supported read-ahead, 385OS-driven asynchronous retirement, and other performance features typically 386provided by the OS at the block-level to ensure smooth system operation. 387 388By avoiding wiring buffers/memory and allowing these features to run normally, 389HAMMER2 winds up with very low OS overhead. 390 391 FREEMAP NOTES 392 393The freemap is stored in the reserved blocks situated in the ~4MB reserved 394area at the baes of every ~1GB level-1 zone. The current implementation 395reserves 8 copies of every freemap block and cycles through them in order 396to make the freemap operate in a copy-on-write fashion. 397 398 - Freemap is copy-on-write. 399 - Freemap operations are transactional, same as everything else. 400 - All backup volume headers are consistent on-mount. 401 402The Freemap is organized using the same radix blockmap algorithm used for 403files and directories, but with fixed radix values. For a maximally-sized 404filesystem the Freemap will wind up being a 5-level-deep radix blockmap, 405but the top-level is embedded in the volume header so insofar as performance 406goes it is really just a 4-level blockmap. 407 408The freemap radix allocation mechanism is also the same, meaning that it is 409bottom-up and will not allocate unnecessary intermediate levels for smaller 410filesystems. The number of blockmap levels not including the volume header 411for various filesystem sizes is as follows: 412 413 up-to #of freemap levels 414 1GB 1-level 415 256GB 2-level 416 64TB 3-level 417 16PB 4-level 418 4EB 5-level 419 16EB 6-level 420 421The Freemap has bitmap granularity down to 16KB and a linear iterator that 422can linearly allocate space down to 1KB. Due to fragmentation it is possible 423for the linear allocator to become marginalized, but it is relatively easy 424to for a reallocation of small blocks every once in a while (like once a year 425if you care at all) and once the old data cycles out of the snapshots, or you 426also rewrite the snapshots (which you can do), the freemap should wind up 427relatively optimal again. Generally speaking I believe that algorithms can 428be developed to make this a non-problem without requiring any media structure 429changes. 430 431In order to implement fast snapshots (and writable snapshots for that 432matter), HAMMER2 does NOT ref-count allocations. All the freemap does is 433keep track of 100% free blocks plus some extra bits for staging the bulkfree 434scan. The lack of ref-counting makes it possible to: 435 436 - Completely trivialize HAMMER2s snapshot operations. 437 - Allows any volume header backup to be used trivially. 438 - Allows whole sub-trees to be destroyed without having to scan them. 439 - Simplifies normal crash recovery operations. 440 - Simplifies catastrophic recovery operations. 441 442Normal crash recovery is simply a matter of doing an incremental scan 443of the topology between the last flushed freemap TID and the last flushed 444topology TID. This usually takes only a few seconds and allows: 445 446 - Freemap flushes to be be deferred for any number of topology flush 447 cycles. 448 - Does not have to be flushed for fsync, reducing fsync overhead. 449 450 FREEMAP - BULKFREE 451 452Blocks are freed via a bulkfree scan, which is a two-stage meta-data scan. 453Blocks are first marked as being possibly free and then finalized in the 454second scan. Live filesystem operations are allowed to run during these 455scans and any freemap block that is allocated or adjusted after the first 456scan will simply be re-marked as allocated and the second scan will not 457transition it to being free. 458 459The cost of not doing ref-count tracking is that HAMMER2 must perform two 460bulkfree scans of the meta-data to determine which blocks can actually be 461freed. This can be complicated by the volume header backups and snapshots 462which cause the same meta-data topology to be scanned over and over again, 463but mitigated somewhat by keeping a cache of higher-level nodes to detect 464when we would scan a sub-topology that we have already scanned. Due to the 465copy-on-write nature of the filesystem, such detection is easy to implement. 466 467Part of the ongoing design work is finding ways to reduce the scope of this 468meta-data scan so the entire filesystem's meta-data does not need to be 469scanned (though in tests with HAMMER1, even full meta-data scans have 470turned out to be fairly low cost). In other words, its an area where 471improvements can be made without any media format changes. 472 473Another advantage of operating the freemap like this is that some future 474version of HAMMER2 might decide to completely change how the freemap works 475and would be able to make the change with relatively low downtime. 476 477 CLUSTERING 478 479Clustering, as always, is the most difficult bit but we have some advantages 480with HAMMER2 that we did not have with HAMMER1. First, HAMMER2's media 481structures generally follow the kernel's filesystem hiearchy which allows 482cluster operations to use topology cache and lock state. Second, 483HAMMER2's writable snapshots make it possible to implement several forms 484of multi-master clustering. 485 486The mount device path you specify serves to bootstrap your entry into 487the cluster. This is typically local media. It can even be a ram-disk 488that only contains placemarkers that help HAMMER2 connect to a fully 489networked cluster. 490 491With HAMMER2 you mount a directory entry under the super-root. This entry 492will contain a cluster identifier that helps HAMMER2 identify and integrate 493with the nodes making up the cluster. HAMMER2 will automatically integrate 494*all* entries under the super-root when you mount one of them. You have to 495mount at least one for HAMMER2 to integrate the block device in the larger 496cluster. This mount will typically be a SOFT_MASTER, DUMMY, SLAVE, or CACHE 497mount that simply serves to cause hammer to integrate the rest of the 498represented cluster. ALL CLUSTER ELEMENTS ARE TREATED ACCORDING TO TYPE 499NO MATTER WHICH ONE YOU MOUNT. 500 501For cluster servers every HAMMER2-formatted partition has a "LOCAL" MASTER 502which can be mounted in order to make the rest of the elements under the 503super-root available to the network. (In a prior specification I emplaced 504the cluster connections in the volume header's configuration space but I no 505longer do that). 506 507Connecting to the wider networked cluster involves setting up the /etc/hammer2 508directory with appropriate IP addresses and keys. The user-mode hammer2 509service daemon maintains the connections and performs graph operations 510via libdmsg. 511 512Node types within the cluster: 513 514 DUMMY - Used as a local placeholder (typically in ramdisk) 515 CACHE - Used as a local placeholder and cache (typically on a SSD) 516 SLAVE - A SLAVE in the cluster, can source data on quorum agreement. 517 MASTER - A MASTER in the cluster, can source and sink data on quorum 518 agreement. 519 SOFT_SLAVE - A SLAVE in the cluster, can source data locally without 520 quorum agreement (must be directly mounted). 521 SOFT_MASTER - A local MASTER but *not* a MASTER in the cluster. Can source 522 and sink data locally without quorum agreement, intended to 523 be synchronized with the real MASTERs when connectivity 524 allows. Operations are not coherent with the real MASTERS 525 even when they are available. 526 527 NOTE: SNAPSHOT, AUTOSNAP, etc represent sub-types, typically under a 528 SLAVE. A SNAPSHOT or AUTOSNAP is a SLAVE sub-type that is no longer 529 synchronized against current masters. 530 531 NOTE: Any SLAVE or other copy can be turned into its own writable MASTER 532 by giving it a unique cluster id, taking it out of the cluster that 533 originally spawned it. 534 535There are four major protocols: 536 537 Quorum protocol 538 539 This protocol is used between MASTER nodes to vote on operations 540 and resolve deadlocks. 541 542 This protocol is used between SOFT_MASTER nodes in a sub-cluster 543 to vote on operations, resolve deadlocks, determine what the latest 544 transaction id for an element is, and to perform commits. 545 546 Cache sub-protocol 547 548 This is the MESI sub-protocol which runs under the Quorum 549 protocol. This protocol is used to maintain cache state for 550 sub-trees to ensure that operations remain cache coherent. 551 552 Depending on administrative rights this protocol may or may 553 not allow a leaf node in the cluster to hold a cache element 554 indefinitely. The administrative controller may preemptively 555 downgrade a leaf with insufficient administrative rights 556 without giving it a chance to synchronize any modified state 557 back to the cluster. 558 559 Proxy protocol 560 561 The Quorum and Cache protocols only operate between MASTER 562 and SOFT_MASTER nodes. All other node types must use the 563 Proxy protocol to perform similar actions. This protocol 564 differs in that proxy requests are typically sent to just 565 one adjacent node and that node then maintains state and 566 forwards the request or performs the required operation. 567 When the link is lost to the proxy, the proxy automatically 568 forwards a deletion of the state to the other nodes based on 569 what it has recorded. 570 571 If a leaf has insufficient administrative rights it may not 572 be allowed to actually initiate a quorum operation and may only 573 be allowed to maintain partial MESI cache state or perhaps none 574 at all (since cache state can block other machines in the 575 cluster). Instead a leaf with insufficient rights will have to 576 make due with a preemptive loss of cache state and any allowed 577 modifying operations will have to be forwarded to the proxy which 578 continues forwarding it until a node with sufficient administrative 579 rights is encountered. 580 581 To reduce issues and give the cluster more breath, sub-clusters 582 made up of SOFT_MASTERs can be formed in order to provide full 583 cache coherent within a subset of machines and yet still tie them 584 into a greater cluster that they normally would not have such 585 access to. This effectively makes it possible to create a two 586 or three-tier fan-out of groups of machines which are cache-coherent 587 within the group, but perhaps not between groups, and use other 588 means to synchronize between the groups. 589 590 Media protocol 591 592 This is basically the physical media protocol. 593 594 MASTER & SLAVE SYNCHRONIZATION 595 596With HAMMER2 I really want to be hard-nosed about the consistency of the 597filesystem, including the consistency of SLAVEs (snapshots, etc). In order 598to guarantee consistency we take advantage of the copy-on-write nature of 599the filesystem by forking consistent nodes and using the forked copy as the 600source for synchronization. 601 602Similarly, the target for synchronization is not updated on the fly but instead 603is also forked and the forked copy is updated. When synchronization is 604complete, forked sources can be thrown away and forked copies can replace 605the original synchronization target. 606 607This may seem complex, but 'forking a copy' is actually a virtually free 608operation. The top-level inode (under the super-root), on-media, is simply 609copied to a new inode and poof, we have an unchanging snapshot to work with. 610 611 - Making a snapshot is fast... almost instantanious. 612 613 - Snapshots are used for various purposes, including synchronization 614 of out-of-date nodes. 615 616 - A snapshot can be converted into a MASTER or some other PFS type. 617 618 - A snapshot can be forked off from its parent cluster entirely and 619 turned into its own writable filesystem, either as a single MASTER 620 or this can be done across the cluster by forking a quorum+ of 621 existing MASTERs and transfering them all to a new cluster id. 622 623More complex is reintegrating the target once the synchronization is complete. 624For SLAVEs we just delete the old SLAVE and rename the copy to the same name. 625However, if the SLAVE is mounted and not optioned as a static mount (that is 626the mounter wants to see updates as they are synchronized), a reconciliation 627must occur on the live mount to clean up the vnode, inode, and chain caches 628and shift any remaining vnodes over to the updated copy. 629 630 - A mounted SLAVE can track updates made to the SLAVE but the 631 actual mechanism is that the SLAVE PFS is replaced with an 632 updated copy, typically every 30-60 seconds. 633 634Reintegrating a MASTER which has fallen out of the quorum due to being out 635of date is also somewhat more complex. The same updating mechanic is used, 636we actually have to throw the 'old' MASTER away once the new one has been 637updated. However if the cluster is undergoing heavy modifications the 638updated MASTER will be out of date almost the instant its source is 639snapshotted. Reintegrating a MASTER thus requires a somewhat more complex 640interaction. 641 642 - If a MASTER is really out of date we can run one or more 643 synchronization passes concurrent with modifying operations. 644 The quorum can remain live. 645 646 - A final synchronization pass is required with quorum operations 647 blocked to reintegrate the now up-to-date MASTER into the cluster. 648 649 650 QUORUM OPERATIONS 651 652Quorum operations can be broken down into HARD BLOCK operations and NETWORK 653operations. If your MASTERs are all local mounts, then failures and 654sequencing is easy to deal with. 655 656Quorum operations on a networked cluster are more complex. The problems: 657 658 - Masters cannot rely on clients to moderate quorum transactions. 659 Apart from the reliance being unsafe, the client could also 660 lose contact with one or more masters during the transaction and 661 leave one or more masters out-of-sync without the master(s) knowing 662 they are out of sync. 663 664 - When many clients are present, we do not want a flakey network 665 link from one to cause one or more masters to go out of 666 synchronization and potentially stall the whole works. 667 668 - Normal hammer2 mounts allow a virtually unlimited number of modifying 669 transactions between actual flushes. The media flush rolls everything 670 up into a single transaction id per flush. Detection of 'missing' 671 transactions in a concurrent multi-client setup when one or more client 672 temporarily loses connectivity is thus difficult. 673 674 - Clients have a limited amount of time to reconnect to a cluster after 675 a network disconnect before their MESI cache states are lost. 676 677 - Clients may proceed with several transactions before knowing for sure 678 that earlier transactions were completely successful. Performance is 679 important, we won't be waiting for a full quorum-verified synchronous 680 flush to media before allowing a system call to return. 681 682 - Masters can decide that a client's MESI cache states were lost (i.e. 683 that the transaction was too slow) as well. 684 685The solutions (for modifying transactions): 686 687 - Masters handle quorum confirmation amongst themselves and do not rely 688 on the client for that purpose. 689 690 - A client can connect to one or more masters regardless of the size of 691 the quorum and can submit modifying operations to a single master if 692 desired. The master will take care of the rest. 693 694 A client must still validate the quorum (and obtain MESI cache states) 695 when doing read-only operations in order to present the correct data 696 to the user process for the VOP. 697 698 - Masters will run a 2-phase commit amongst themselves, often concurrent 699 with other non-conflicting transactions, and will serialize operations 700 and/or enforce synchronization points for 2-phase completion on 701 serialized transactions from the same client or when cache state 702 ownership is shifted from one client to another. 703 704 - Clients will usually allow operations to run asynchronously and return 705 from system calls more or less ASAP once they own the necessary cache 706 coherency locks. The client can select the validation mode to wait for 707 with mount options: 708 709 (1) Fully async (mount -o async) 710 (2) Wait for phase-1 ack (mount) 711 (3) Wait for phase-2 ack (mount -o sync) (fsync - wait p2ack) 712 (4) Wait for flush (mount -o sync) (fsync - wait flush) 713 714 Modifying system calls cannot be told to wait for a full media 715 flush, as full media flushes are prohibitively expensive. You 716 still have to fsync(). 717 718 The fsync wait mode for network links can be selected, either to 719 return after the phase-2 ack or to return after the media flush. 720 The default is to wait for the phase-2 ack, which at least guarantees 721 that a network failure after that point will not disrupt operations 722 issued before the fsync. 723 724 - Clients must adjust the chain state for modifying operations prior to 725 releasing chain locks / returning from the system call, even if the 726 masters have not finished the transaction. A late failure by the 727 cluster will result in desynchronized state which requires erroring 728 out the whole filesystem or resynchronizing somehow. 729 730 - Clients can opt to keep a record of transactions through the phase-2 731 ack or the actual media flush on the masters. 732 733 However, replaying/revalidating the log cannot necessarily guarantee 734 success. If the masters lose synchronization due to network issues 735 between masters (or if the client was mounted fully-async), or if enough 736 masters crash simultaniously such that a quorum fails to flush even 737 after the phase-2 ack, then it is possible that by the time a client 738 is able to replay/revalidate, some other client has squeeded in and 739 committed something that would conflict. 740 741 If the client crashes it works similarly to a crash with a local storage 742 mount... many dirty buffers might be lost. And the same happens in 743 the cluster case. 744 745 TRANSACTION LOG 746 747Keeping a short-term transaction log, much less being able to properly replay 748it, is fraught with difficulty and I've made it a separate development task. 749 750 751