1 2 HAMMER2 DESIGN DOCUMENT 3 4 Matthew Dillon 5 dillon@backplane.com 6 7 09-Jul-2016 (v4) 8 03-Apr-2015 (v3) 9 14-May-2013 (v2) 10 08-Feb-2012 (v1) 11 12 Current Status as of document date 13 14* Filesystem Core - operational 15 - bulkfree - operational 16 - Compression - operational 17 - Snapshots - operational 18 - Deduper - live operational, batch specced 19 - Subhierarchy quotas - (may have to be discarded) 20 - Logical Encryption - not specced yet 21 - Copies - not specced yet 22 - fsync bypass - not specced yet 23 24* Clustering core 25 - Network msg core - operational 26 - Network blk device - operational 27 - Error handling - under development 28 - Quorum Protocol - under development 29 - Synchronization - under development 30 - Transaction replay - not specced yet 31 - Cache coherency - not specced yet 32 33 Feature List 34 35* Block topology (both the main topology and the freemap) use a copy-on-write 36 design. Media-level block frees are delayed and flushes rotate between 37 4 volume headers (maxes out at 4 if the filesystem is > ~8GB). Flushes 38 will allocate new blocks up to the root in order to propagate block table 39 changes and transaction ids. 40 41* Incremental synchronization is queueless and trivial by design. 42 43* Multiple roots, with many features. This is implemented via the super-root 44 concept. When mounting a HAMMER2 filesystem you specify a device path and 45 a directory name in the super-root. (HAMMER1 had only one root). 46 47* All cluster types and multiple PFSs (belonging to the same or different 48 clusters) can be mixed on one physical filesystem. 49 50 This allows independent cluster components to be configured within a 51 single formatted H2 filesystem. Each component is a super-root entry, 52 a cluster identifier, and a unique identifier. The network protocl 53 integrates the component into the cluster when it is created 54 55* Roots are really no different from snapshots (HAMMER1 distinguished between 56 its root mount and its PFS's. HAMMER2 does not). 57 58* I/O and chain locking thread separation. I/O stalls and lock stalls can 59 cause any filesystem which purports to operate over multiple physical and 60 network devices to implode. HAMMER2 incorporates a frontend/backend design 61 which separates media operations into support threads and allows the 62 frontend to validate the cluster, proceed with an operation, and disconnect 63 any remaining running operation even when backend ops have not completed 64 on all nodes. This allows the frontend to return 'early' (so to speak). 65 66* Early return on best data-path supported by virtue of the above. In a 67 multi-master system, frontend ops will issue I/O on all cluster elements 68 concurrently and will return the instant incoming data validates the 69 cluster. 70 71* Snapshots are writable (in HAMMER1 snapshots were read-only). 72 73* Snapshots are explicit but trivial to create. In HAMMER1 snapshots were 74 both explicit and fine-grained/automatic. HAMMER2 does not implement 75 automatic fine-grained snapshots. H2 snapshots are cheap enough that you 76 can create fine-grained snapshots if you desire. 77 78* HAMMER2 formalizes a synchronization point for the flush, does a pre-flush 79 that does not update the volume root, then waits for all running modifying 80 operations to complete to memory (not to disk) while temporarily stalling 81 new modifying operation initiations. The final flush is then executed. 82 83 At the moment we do not allow concurrent modifying operations during the 84 final flush phase. Ultimately I would like to, but doing so can be complex. 85 86* HAMMER2 flushes and synchronization points do not bisect VOPs (system calls). 87 (HAMMER1 flushes could wind up bisecting VOPs). This means the H2 flushes 88 leave the filesystem in a far more consistent state than H1 flushes did. 89 90* Directory sub-hierarchy-based quotas for space and inode usage tracking. 91 Any directory can be used. 92 93* Low memory footprint. Except for the volume header, the buffer cache 94 is completely asynchronous and dirty buffers can be retired by the OS 95 directly to backing store with no further interactions with the filesystem. 96 97* Background synchronization and mirroring occurs at the logical level. 98 When a failure occurs or a normal validation scan comes up with 99 discrepancies, the synchronization thread will use the quorum to figure 100 out which information is not correct and update accordingly. 101 102* Support for multiple compression algorithms configured on a subdirectory 103 tree basis and on a file basis. Block compression up to 64KB will be used. 104 Only compression ratios at powers of 2 that are at least 2:1 (e.g. 2:1, 105 4:1, 8:1, etc) will work in this scheme because physical block allocations 106 in HAMMER2 are always power-of-2. Modest compression can be achieved with 107 low overhead, is turned on by default, and is compatible with deduplication. 108 109* Encryption. Whole-disk encryption is supported by another layer, but I 110 intend to give H2 an encryption feature at the logical layer which works 111 approximately as follows: 112 113 - Encryption controlled by the client on an inode/sub-tree basis. 114 - Server has no visibility to decrypted data. 115 - Encrypt filenames in directory entries. Since the filename[] array 116 is 256 bytes wide, client can add random bytes after the normal 117 terminator to make it virtually impossible for an attacker to figure 118 out the filename. 119 - Encrypt file size and most inode contents. 120 - Encrypt file data (holes are not encrypted). 121 - Encryption occurs after compression, with random filler. 122 - Check codes calculated after encryption & compression (not before). 123 124 - Blockrefs are not encrypted. 125 - Directory and File Topology is not encrypted. 126 - Encryption is not sub-topology validation. Client would have to keep 127 track of that itself. Server or other clients can still e.g. remove 128 files, rename, etc. 129 130 In particular, note that even though the file size field can be encrypted, 131 the server does have visibility on the block topology and thus has a pretty 132 good idea how big the file is. However, a client could add junk blocks 133 at the end of a file to make this less apparent, at the cost of space. 134 135 If a client really wants a fully validated H2-encrypted space the easiest 136 solution is to format a filesystem within an encrypted file by treating it 137 as a block device, but I digress. 138 139* Zero detection on write (writing all-zeros), which requires the data 140 buffer to be scanned, is fully supported. This allows the writing of 0's 141 to create holes. 142 143* Allow sector overwrite (avoid copy-on-write) under certain circumstances. 144 This is allowed on file data blocks if the file check mode is set to NONE, 145 as long as the data block's modify_tid does not violate the last snapshot 146 taken (if it does, a copy is made and overwrites are allowed on the copy 147 until the next snapshot). 148 149* Copies support for redundancy within a single physical filesystem. 150 Up to 256 physical disks and/or partitions can be ganged to form a 151 single physical filesystem. If you use a disk or RAID aggregation 152 layer then the actual number of physical disks that can be associated 153 with a single H2 filesystem is unbounded. 154 155 H2 puts an 8-bit copyid in the blockref structure to represent potentially 156 multiple copies of a block. The copyid corresponds to a configuration 157 specification in the volume header. The full algorithm has not been 158 specced yet. 159 160 Copies support is implemented by having multiple blockref entries for 161 the same key, each with a different copyid. The copyid represents which 162 of the 256 slots is used. Meta-data is also subject to the copies 163 mechanism. However, for both meta-data and data, each copy should be 164 identical so the check fields in the blockref for all copies should wind 165 up being the same, and any valid copy can be used by the block-level 166 hammer2_chain code to access the filesystem. File accesses will attempt 167 to use the same copy. If an I/O read error occurs, a different copy will 168 be chosen. Modifying operations must update all copies and/or create 169 new copies as needed. If a write error occurs on a copy and other copies 170 are available, the errored target will be taken offline. 171 172 It is possible to configure H2 to write out fewer copies on-write and then 173 use a background scan to beef-up the number of copies to improve real-time 174 throughput. 175 176* MESI Cache coherency for multi-master/multi-client clustering operations. 177 The servers hosting the MASTERs are also responsible for keeping track of 178 the cache state. 179 180* Hardlinks and softlinks are supported. Hardlinks are somewhat complex to 181 deal with and there is still an edge case. I am trying to avoid storing 182 the hardlinks at the root level because that messes up my concept for 183 sub-tree quotas and is unnecessarily burdensome in terms of SMP collisions 184 under heavy loads. 185 186* The media blockref structure is now large enough to support up to a 192-bit 187 check value, which would typically be a cryptographic hash of some sort. 188 Multiple check value algorithms will be supported with the default being 189 a simple 32-bit iSCSI CRC. 190 191* Fully verified deduplication will be supported and automatic (and 192 necessary in many respects). 193 194* Unverified de-duplication will be supported as a configurable option on a 195 file or subdirectory tree. Unverified deduplication must use the largest 196 available check code (192 bits). It will not verify that data content with 197 the same check code is actually identical during the dedup pass, resulting 198 in approximately 100x to 1000x the deduplication performance but at the cost 199 of potentially corrupting some data. 200 201 The Unverified dedup feature is intended only for those files where 202 occassional corruption is ok, such as in a web-crawler data store or 203 other situations where the data content is not critically important 204 or can be externally recovered if it becomes corrupt. 205 206 GENERAL DESIGN 207 208HAMMER2 generally implements a copy-on-write block design for the filesystem, 209which is very different from HAMMER1's B-Tree design. Because the design 210is copy-on-write it can be trivially snapshotted simply by referencing an 211existing block, and because the media structures logically match a standard 212filesystem directory/file hierarchy snapshots and other similar operations 213can be trivially performed on an entire subdirectory tree at any level in 214the filesystem. 215 216The copy-on-write design implements a block table in a radix-tree format, 217with a small 8x fan-out in the volume header and inode and a large 256x or 2181024x fan-out for indirect blocks. The table is built bottom-up. 219Intermediate radii are only created when necessary so small files will use 220much shallower radix block trees. The inode itself can accomodate files 221up 512KB (65536x8). Directories also use a radix block table and directory 222inodes can accomodate up to 8 entries before pushing an indirect radix block. 223 224The copy-on-write nature of the filesystem implies that any modification 225whatsoever will have to eventually synchronize new disk blocks all the way 226to the super-root of the filesystem and the volume header itself. This forms 227the basis for crash recovery and also ensures that recovery occurs on a 228completed high-level transaction boundary. All disk writes are to new blocks 229except for the volume header (which cycles through 4 copies), thus allowing 230all writes to run asynchronously and concurrently prior to and during a flush, 231and then just doing a final synchronization and volume header update at the 232end. Many of HAMMER2s features are enabled by this core design feature. 233 234Clearly this method requires intermediate modifications to the chain to be 235cached so multiple modifications can be aggregated prior to being 236synchronized. One advantage, however, is that the normal buffer cache can 237be used and intermediate elements can be retired to disk by H2 or the OS 238at any time. This means that HAMMER2 has very low resource overhead from the 239point of view of the operating system. Unlike HAMMER1 which had to lock 240dirty buffers in memory for long periods of time, HAMMER2 has no such 241requirement. 242 243Buffer cache overhead is very well bounded and can handle filesystem 244operations of any complexity, even on boxes with very small amounts 245of physical memory. Buffer cache overhead is significantly lower with H2 246than with H1 (and orders of magnitude lower than ZFS). 247 248At some point I intend to implement a shortcut to make fsync()'s run fast, 249and that is to allow deep updates to blockrefs to shortcut to auxillary 250space in the volume header to satisfy the fsync requirement. The related 251blockref is then recorded when the filesystem is mounted after a crash and 252the update chain is reconstituted when a matching blockref is encountered 253again during normal operation of the filesystem. 254 255 MIRROR_TID, MODIFY_TID, UPDATE_TID 256 257In HAMMER2, the core block reference is 128-byte structure called a blockref. 258The blockref contains various bits of information including the 64-bit radix 259key (typically a directory hash if a directory entry, inode number if a 260hidden hardlink target, or file offset if a file block), 64-bit data offset 261with the physical block size radix encoded in it (physical block size can be 262different from logical block size due to compression), three 64-bit 263transaction ids, type information, and up to 512 bits worth of check data 264for the block being reference which can be anything from a simple CRC to 265a strong cryptographic hash. 266 267mirror_tid - This is a media-centric (as in physical disk partition) 268 transaction id which tracks media-level updates. The mirror_tid 269 can be different at the same point on different nodes in a 270 cluster. 271 272 Whenever any block in the media topology is modified, its 273 mirror_tid is updated with the flush id and will propagate 274 upward during the flush all the way to the volume header. 275 276 mirror_tid is monotonic. It is primarily used for on-mount 277 recovery and volume root validation. The name is historical 278 from H1, it is not used for nominal mirroring. 279 280modify_tid - This is a cluster-centric (as in across all the nodes used 281 to build a cluster) transaction id which tracks filesystem-level 282 updates. 283 284 modify_tid is updated when the front-end of the filesystem makes 285 a change to an inode or data block. It does NOT propagate upward 286 during a flush. 287 288update_tid - This is a cluster synchronization transaction id. Modifications 289 made to the topology will clear this field to 0 as they propagate 290 up to the root. This gives the synchronizer an easy way to 291 determine what needs revalidation. 292 293 The synchronizer revalidates the cluster bottom-up by validating 294 a sub-topology and propagating the highest modify_tid in the 295 validated sub-topology up via the update_tid field. 296 297 Update to this field may be optimized by the HAMMER2 VFS to 298 avoid the double-transition. 299 300The synchronization code updates an out-of-sync node bottom-up and will 301dynamically set update_tid as it goes, but media flushes can occur at any 302time and these flushes will use mirror_tid for flush and freemap management. 303The mirror_tid for each flush propagates upward to the volume header on each 304flush. modify_tid is set for any chains modified by a cluster op but does 305not propagate up, instead serving as a seed for update_tid. 306 307* The synchronization code is able to determine that a sub-tree is 308 synchronized simply by observing the update_tid at the root of the sub-tree, 309 on an inode-by-inode basis and also on a data-block-by-data-block basis. 310 311* The synchronization code is able to do an incremental update of an 312 out-of-sync node simply by skipping elements with a matching update_tid 313 (when not 0). 314 315* The synchronization code can be interrupted and restarted at any time, 316 and is able to pick up where it left off with very little overhead. 317 318* The synchronization code does not inhibit media flushes. Media flushes 319 can occur (and must occur) while synchronization is ongoing. 320 321There are several other stored transaction ids in HAMMER2. There is a 322separate freemap_tid in the volume header that is used to allow freemap 323flushes to be deferred, and inodes have a pfs_psnap_tid which is used in 324conjuction with CHECK_NONE to allow blocks without a check code which do 325not violate the most recent snapshot to be overwritten in-place. 326 327Remember that since this is a copy-on-write filesystem, we can propagate 328a considerable amount of information up the tree to the volume header 329without adding to the I/O we already have to do. 330 331 DIRECTORIES AND INODES 332 333Directories are hashed, and another major design element is that directory 334entries ARE inodes. They are one and the same, with a special placemarker 335for hardlinks. Inodes are 1KB. 336 337Hardlinks are implemented with placemarkers as directory entries which simply 338represent the inode number. The actual file resides in a parent directory 339that is common to all hardlinks to that file. If the hardlinks are all within 340a single directory, the actual hardlink inode is in that directory. The 341hardlink target, as we call it, is a hidden directory entry in a common parent 342whos key is basically just the inode number itself, so lookups are fast. 343 344Half of the inode structure (512 bytes) is used to hold top-level blockrefs 345to the radix block tree representing the file contents. Files which are 346less than or equal to 512 bytes in size will simply store the file contents 347in this area instead of a blockref array. So files <= 512 bytes take only 3481KB of space inclusive of the inode. 349 350Inode numbers are not spatially referenced, which complicates NFS servers 351but doesn't complicate anything else. The inode number is stored in the 352inode itself, an absolute necessity required to properly support HAMMER2s 353hugely flexible snapshots. I would like to support NFS services but it 354would require (probably) a lookaside index in the root for inode lookups 355and might not happen quickly. 356 357 RECOVERY 358 359H2 allows freemap flushes to lag behind topology flushes. The freemap flush 360tracks a separate transaction id (via mirror_tid) in the volume header. 361 362On mount, HAMMER2 will first locate the highest-sequenced check-code-validated 363volume header from the 4 copies available (if the filesystem is big enough, 364e.g. > ~10GB or so, there will be 4 copies of the volume header). 365 366HAMMER2 will then run an incremental scan of the topology for mirror_tid 367transaction ids between the last freemap flush tid and the last topology 368flush tid in order to synchronize the freemap. Because this scan is 369incremental the time it takes to run will be relatively short and well-bounded 370at mount-time. This is NOT fsck. Freemap flushes can be avoided for any 371number of normal topology flushes but should still occur frequently enough 372to avoid long recovery times in case of a crash. 373 374The filesystem is then ready for use. 375 376 DISK I/O OPTIMIZATIONS 377 378The freemap implements a 1KB allocation resolution. Each 2MB segment managed 379by the freemap is zoned and has a tendancy to collect inodes, small data, 380indirect blocks, and larger data blocks into separate segments. The idea is 381to greatly improve I/O performance (particularly by laying inodes down next 382to each other which has a huge effect on directory scans). 383 384The current implementation of HAMMER2 implements a fixed block size of 64KB 385in order to allow the mapping of hammer2_dio's in its IO subsystem to 386conumers that might desire different sizes. This way we don't have to 387worry about matching the buffer cache / DIO cache to the variable block 388size of underlying elements. 389 390The biggest issue we are avoiding by having a fixed 64KB I/O size is not 391actually to help nominal front-end access issue but instead to reduce the 392complexity when blocks are freed and reused for another purpose. HAMMER1 393had to have specialized code to check for and invalidate buffer cache buffers 394in the free/reuse case. HAMMER2 does not need such code. 395 396That said, HAMMER2 places no major restrictions on mixing block sizes within 397a 64KB block. The only restriction is that a HAMMER2 block cannot cross 398a 64KB boundary. The soft restrictions the block allocator puts in place 399exist primarily for performance reasons (i.e. try to collect 1K inodes 400together). The 2MB freemap zone granularity should work very well in this 401regard. 402 403HAMMER2 also allows OS support for ganging buffers together into even 404larger blocks for I/O (OS buffer cache 'clustering'), OS-supported read-ahead, 405OS-driven asynchronous retirement, and other performance features typically 406provided by the OS at the block-level to ensure smooth system operation. 407 408By avoiding wiring buffers/memory and allowing these features to run normally, 409HAMMER2 winds up with very low OS overhead. 410 411 FREEMAP NOTES 412 413The freemap is stored in the reserved blocks situated in the ~4MB reserved 414area at the baes of every ~1GB level-1 zone. The current implementation 415reserves 8 copies of every freemap block and cycles through them in order 416to make the freemap operate in a copy-on-write fashion. 417 418 - Freemap is copy-on-write. 419 - Freemap operations are transactional, same as everything else. 420 - All backup volume headers are consistent on-mount. 421 422The Freemap is organized using the same radix blockmap algorithm used for 423files and directories, but with fixed radix values. For a maximally-sized 424filesystem the Freemap will wind up being a 5-level-deep radix blockmap, 425but the top-level is embedded in the volume header so insofar as performance 426goes it is really just a 4-level blockmap. 427 428The freemap radix allocation mechanism is also the same, meaning that it is 429bottom-up and will not allocate unnecessary intermediate levels for smaller 430filesystems. The number of blockmap levels not including the volume header 431for various filesystem sizes is as follows: 432 433 up-to #of freemap levels 434 1GB 1-level 435 256GB 2-level 436 64TB 3-level 437 16PB 4-level 438 4EB 5-level 439 16EB 6-level 440 441The Freemap has bitmap granularity down to 16KB and a linear iterator that 442can linearly allocate space down to 1KB. Due to fragmentation it is possible 443for the linear allocator to become marginalized, but it is relatively easy 444to for a reallocation of small blocks every once in a while (like once a year 445if you care at all) and once the old data cycles out of the snapshots, or you 446also rewrite the snapshots (which you can do), the freemap should wind up 447relatively optimal again. Generally speaking I believe that algorithms can 448be developed to make this a non-problem without requiring any media structure 449changes. 450 451In order to implement fast snapshots (and writable snapshots for that 452matter), HAMMER2 does NOT ref-count allocations. All the freemap does is 453keep track of 100% free blocks plus some extra bits for staging the bulkfree 454scan. The lack of ref-counting makes it possible to: 455 456 - Completely trivialize HAMMER2s snapshot operations. 457 - Allows any volume header backup to be used trivially. 458 - Allows whole sub-trees to be destroyed without having to scan them. 459 - Simplifies normal crash recovery operations. 460 - Simplifies catastrophic recovery operations. 461 462Normal crash recovery is simply a matter of doing an incremental scan 463of the topology between the last flushed freemap TID and the last flushed 464topology TID. This usually takes only a few seconds and allows: 465 466 - Freemap flushes to be be deferred for any number of topology flush 467 cycles. 468 - Does not have to be flushed for fsync, reducing fsync overhead. 469 470 FREEMAP - BULKFREE 471 472Blocks are freed via a bulkfree scan, which is a two-stage meta-data scan. 473Blocks are first marked as being possibly free and then finalized in the 474second scan. Live filesystem operations are allowed to run during these 475scans and any freemap block that is allocated or adjusted after the first 476scan will simply be re-marked as allocated and the second scan will not 477transition it to being free. 478 479The cost of not doing ref-count tracking is that HAMMER2 must perform two 480bulkfree scans of the meta-data to determine which blocks can actually be 481freed. This can be complicated by the volume header backups and snapshots 482which cause the same meta-data topology to be scanned over and over again, 483but mitigated somewhat by keeping a cache of higher-level nodes to detect 484when we would scan a sub-topology that we have already scanned. Due to the 485copy-on-write nature of the filesystem, such detection is easy to implement. 486 487Part of the ongoing design work is finding ways to reduce the scope of this 488meta-data scan so the entire filesystem's meta-data does not need to be 489scanned (though in tests with HAMMER1, even full meta-data scans have 490turned out to be fairly low cost). In other words, its an area where 491improvements can be made without any media format changes. 492 493Another advantage of operating the freemap like this is that some future 494version of HAMMER2 might decide to completely change how the freemap works 495and would be able to make the change with relatively low downtime. 496 497 CLUSTERING 498 499Clustering, as always, is the most difficult bit but we have some advantages 500with HAMMER2 that we did not have with HAMMER1. First, HAMMER2's media 501structures generally follow the kernel's filesystem hiearchy which allows 502cluster operations to use topology cache and lock state. Second, 503HAMMER2's writable snapshots make it possible to implement several forms 504of multi-master clustering. 505 506The mount device path you specify serves to bootstrap your entry into 507the cluster. This is typically local media. It can even be a ram-disk 508that only contains placemarkers that help HAMMER2 connect to a fully 509networked cluster. 510 511With HAMMER2 you mount a directory entry under the super-root. This entry 512will contain a cluster identifier that helps HAMMER2 identify and integrate 513with the nodes making up the cluster. HAMMER2 will automatically integrate 514*all* entries under the super-root when you mount one of them. You have to 515mount at least one for HAMMER2 to integrate the block device in the larger 516cluster. 517 518For cluster servers every HAMMER2-formatted partition has a "LOCAL" MASTER 519which can be mounted in order to make the rest of the elements under the 520super-root available to the network. (In a prior specification I emplaced 521the cluster connections in the volume header's configuration space but I no 522longer do that). 523 524Connecting to the wider networked cluster involves setting up the /etc/hammer2 525directory with appropriate IP addresses and keys. The user-mode hammer2 526service daemon maintains the connections and performs graph operations 527via libdmsg. 528 529Node types within the cluster: 530 531 DUMMY - Used as a local placeholder (typically in ramdisk) 532 CACHE - Used as a local placeholder and cache (typically on a SSD) 533 SLAVE - A SLAVE in the cluster, can source data on quorum agreement. 534 MASTER - A MASTER in the cluster, can source and sink data on quorum 535 agreement. 536 SOFT_SLAVE - A SLAVE in the cluster, can source data locally without 537 quorum agreement (must be directly mounted). 538 SOFT_MASTER - A local MASTER but *not* a MASTER in the cluster. Can source 539 and sink data locally without quorum agreement, intended to 540 be synchronized with the real MASTERs when connectivity 541 allows. Operations are not coherent with the real MASTERS 542 even when they are available. 543 544 NOTE: SNAPSHOT, AUTOSNAP, etc represent sub-types, typically under a 545 SLAVE. A SNAPSHOT or AUTOSNAP is a SLAVE sub-type that is no longer 546 synchronized against current masters. 547 548 NOTE: Any SLAVE or other copy can be turned into its own writable MASTER 549 by giving it a unique cluster id, taking it out of the cluster that 550 originally spawned it. 551 552There are four major protocols: 553 554 Quorum protocol 555 556 This protocol is used between MASTER nodes to vote on operations 557 and resolve deadlocks. 558 559 This protocol is used between SOFT_MASTER nodes in a sub-cluster 560 to vote on operations, resolve deadlocks, determine what the latest 561 transaction id for an element is, and to perform commits. 562 563 Cache sub-protocol 564 565 This is the MESI sub-protocol which runs under the Quorum 566 protocol. This protocol is used to maintain cache state for 567 sub-trees to ensure that operations remain cache coherent. 568 569 Depending on administrative rights this protocol may or may 570 not allow a leaf node in the cluster to hold a cache element 571 indefinitely. The administrative controller may preemptively 572 downgrade a leaf with insufficient administrative rights 573 without giving it a chance to synchronize any modified state 574 back to the cluster. 575 576 Proxy protocol 577 578 The Quorum and Cache protocols only operate between MASTER 579 and SOFT_MASTER nodes. All other node types must use the 580 Proxy protocol to perform similar actions. This protocol 581 differs in that proxy requests are typically sent to just 582 one adjacent node and that node then maintains state and 583 forwards the request or performs the required operation. 584 When the link is lost to the proxy, the proxy automatically 585 forwards a deletion of the state to the other nodes based on 586 what it has recorded. 587 588 If a leaf has insufficient administrative rights it may not 589 be allowed to actually initiate a quorum operation and may only 590 be allowed to maintain partial MESI cache state or perhaps none 591 at all (since cache state can block other machines in the 592 cluster). Instead a leaf with insufficient rights will have to 593 make due with a preemptive loss of cache state and any allowed 594 modifying operations will have to be forwarded to the proxy which 595 continues forwarding it until a node with sufficient administrative 596 rights is encountered. 597 598 To reduce issues and give the cluster more breath, sub-clusters 599 made up of SOFT_MASTERs can be formed in order to provide full 600 cache coherent within a subset of machines and yet still tie them 601 into a greater cluster that they normally would not have such 602 access to. This effectively makes it possible to create a two 603 or three-tier fan-out of groups of machines which are cache-coherent 604 within the group, but perhaps not between groups, and use other 605 means to synchronize between the groups. 606 607 Media protocol 608 609 This is basically the physical media protocol. 610 611 MASTER & SLAVE SYNCHRONIZATION 612 613With HAMMER2 I really want to be hard-nosed about the consistency of the 614filesystem, including the consistency of SLAVEs (snapshots, etc). In order 615to guarantee consistency we take advantage of the copy-on-write nature of 616the filesystem by forking consistent nodes and using the forked copy as the 617source for synchronization. 618 619Similarly, the target for synchronization is not updated on the fly but instead 620is also forked and the forked copy is updated. When synchronization is 621complete, forked sources can be thrown away and forked copies can replace 622the original synchronization target. 623 624This may seem complex, but 'forking a copy' is actually a virtually free 625operation. The top-level inode (under the super-root), on-media, is simply 626copied to a new inode and poof, we have an unchanging snapshot to work with. 627 628 - Making a snapshot is fast... almost instantanious. 629 630 - Snapshots are used for various purposes, including synchronization 631 of out-of-date nodes. 632 633 - A snapshot can be converted into a MASTER or some other PFS type. 634 635 - A snapshot can be forked off from its parent cluster entirely and 636 turned into its own writable filesystem, either as a single MASTER 637 or this can be done across the cluster by forking a quorum+ of 638 existing MASTERs and transfering them all to a new cluster id. 639 640More complex is reintegrating the target once the synchronization is complete. 641For SLAVEs we just delete the old SLAVE and rename the copy to the same name. 642However, if the SLAVE is mounted and not optioned as a static mount (that is 643the mounter wants to see updates as they are synchronized), a reconciliation 644must occur on the live mount to clean up the vnode, inode, and chain caches 645and shift any remaining vnodes over to the updated copy. 646 647 - A mounted SLAVE can track updates made to the SLAVE but the 648 actual mechanism is that the SLAVE PFS is replaced with an 649 updated copy, typically every 30-60 seconds. 650 651Reintegrating a MASTER which has fallen out of the quorum due to being out 652of date is also somewhat more complex. The same updating mechanic is used, 653we actually have to throw the 'old' MASTER away once the new one has been 654updated. However if the cluster is undergoing heavy modifications the 655updated MASTER will be out of date almost the instant its source is 656snapshotted. Reintegrating a MASTER thus requires a somewhat more complex 657interaction. 658 659 - If a MASTER is really out of date we can run one or more 660 synchronization passes concurrent with modifying operations. 661 The quorum can remain live. 662 663 - A final synchronization pass is required with quorum operations 664 blocked to reintegrate the now up-to-date MASTER into the cluster. 665 666 667 QUORUM OPERATIONS 668 669Quorum operations can be broken down into HARD BLOCK operations and NETWORK 670operations. If your MASTERs are all local mounts, then failures and 671sequencing is easy to deal with. 672 673Quorum operations on a networked cluster are more complex. The problems: 674 675 - Masters cannot rely on clients to moderate quorum transactions. 676 Apart from the reliance being unsafe, the client could also 677 lose contact with one or more masters during the transaction and 678 leave one or more masters out-of-sync without the master(s) knowing 679 they are out of sync. 680 681 - When many clients are present, we do not want a flakey network 682 link from one to cause one or more masters to go out of 683 synchronization and potentially stall the whole works. 684 685 - Normal hammer2 mounts allow a virtually unlimited number of modifying 686 transactions between actual flushes. The media flush rolls everything 687 up into a single transaction id per flush. Detection of 'missing' 688 transactions in a concurrent multi-client setup when one or more client 689 temporarily loses connectivity is thus difficult. 690 691 - Clients have a limited amount of time to reconnect to a cluster after 692 a network disconnect before their MESI cache states are lost. 693 694 - Clients may proceed with several transactions before knowing for sure 695 that earlier transactions were completely successful. Performance is 696 important, we won't be waiting for a full quorum-verified synchronous 697 flush to media before allowing a system call to return. 698 699 - Masters can decide that a client's MESI cache states were lost (i.e. 700 that the transaction was too slow) as well. 701 702The solutions (for modifying transactions): 703 704 - Masters handle quorum confirmation amongst themselves and do not rely 705 on the client for that purpose. 706 707 - A client can connect to one or more masters regardless of the size of 708 the quorum and can submit modifying operations to a single master if 709 desired. The master will take care of the rest. 710 711 A client must still validate the quorum (and obtain MESI cache states) 712 when doing read-only operations in order to present the correct data 713 to the user process for the VOP. 714 715 - Masters will run a 2-phase commit amongst themselves, often concurrent 716 with other non-conflicting transactions, and will serialize operations 717 and/or enforce synchronization points for 2-phase completion on 718 serialized transactions from the same client or when cache state 719 ownership is shifted from one client to another. 720 721 - Clients will usually allow operations to run asynchronously and return 722 from system calls more or less ASAP once they own the necessary cache 723 coherency locks. The client can select the validation mode to wait for 724 with mount options: 725 726 (1) Fully async (mount -o async) 727 (2) Wait for phase-1 ack (mount) 728 (3) Wait for phase-2 ack (mount -o sync) (fsync - wait p2ack) 729 (4) Wait for flush (mount -o sync) (fsync - wait flush) 730 731 Modifying system calls cannot be told to wait for a full media 732 flush, as full media flushes are prohibitively expensive. You 733 still have to fsync(). 734 735 The fsync wait mode for network links can be selected, either to 736 return after the phase-2 ack or to return after the media flush. 737 The default is to wait for the phase-2 ack, which at least guarantees 738 that a network failure after that point will not disrupt operations 739 issued before the fsync. 740 741 - Clients must adjust the chain state for modifying operations prior to 742 releasing chain locks / returning from the system call, even if the 743 masters have not finished the transaction. A late failure by the 744 cluster will result in desynchronized state which requires erroring 745 out the whole filesystem or resynchronizing somehow. 746 747 - Clients can opt to keep a record of transactions through the phase-2 748 ack or the actual media flush on the masters. 749 750 However, replaying/revalidating the log cannot necessarily guarantee 751 success. If the masters lose synchronization due to network issues 752 between masters (or if the client was mounted fully-async), or if enough 753 masters crash simultaniously such that a quorum fails to flush even 754 after the phase-2 ack, then it is possible that by the time a client 755 is able to replay/revalidate, some other client has squeeded in and 756 committed something that would conflict. 757 758 If the client crashes it works similarly to a crash with a local storage 759 mount... many dirty buffers might be lost. And the same happens in 760 the cluster case. 761 762 TRANSACTION LOG 763 764Keeping a short-term transaction log, much less being able to properly replay 765it, is fraught with difficulty and I've made it a separate development task. 766 767 768