1 2 HAMMER2 DESIGN DOCUMENT 3 4 Matthew Dillon 5 dillon@backplane.com 6 7 24-Jul-2017 (v5) 8 09-Jul-2016 (v4) 9 03-Apr-2015 (v3) 10 14-May-2013 (v2) 11 08-Feb-2012 (v1) 12 13 Current Status as of document date 14 15* Filesystem Core - operational 16 - bulkfree - operational 17 - Compression - operational 18 - Snapshots - operational 19 - Deduper - live operational, batch specced 20 - Subhierarchy quotas - scrapped (still possible on a limited basis) 21 - Logical Encryption - not specced yet 22 - Copies - not specced yet 23 - fsync bypass - not specced yet 24 25* Clustering core 26 - Network msg core - operational 27 - Network blk device - operational 28 - Error handling - under development 29 - Quorum Protocol - under development 30 - Synchronization - under development 31 - Transaction replay - not specced yet 32 - Cache coherency - not specced yet 33 34 Intended Feature List 35 (not all features yet implemented) 36 37* Standard filesystem semantics with full hardlink and softlink support. 38 39* Filesystems can be constructed across multiple nodes. Each low-level 40 H2 device can accomodate nodes belonging to multiple cluster components 41 as well as nodes that are simply local to the device or machine. 42 43* A dynamic radix tree on each formatted device is used to index the 44 topology, with 64-bit keys. Elements can be ranged in powers of 2. 45 The tree is built efficiently, bottom-up. Indirect blocks are only 46 created when a layer fills up. 47 48* Utilizes a copy-on-write block mechanic for both the main topology 49 and the freemap. Media-level block frees are delayed and flushes rotate 50 between 4 volume headers (maxes out at 4 if the filesystem is > ~8GB). 51 Flushes will allocate new blocks up to the root in order to propagate 52 block table changes and transaction ids. Recovery will choose the most 53 recent valid volume root and can thus work around failures which cause 54 partial volume header writes. 55 56* Utilizes a fat blockref structure (128 bytes) which can store up to 57 64 bytes (512 bits) of check code data. Defaults to simpler 64-bit 58 hashes. 59 60* 1024 byte fat inode structure also contains helpful meta-data for 61 debugging catastrophic recovery, up to 512 bytes of direct-data for 62 small files, or 4 indirect blocks (instead of the direct-data) as 63 the data root. 64 65 Inodes are stored as hidden elements under each node directory in the 66 super-root. H2 originally tried to embed inodes in the directories in 67 which they resides, or in the nearest common parent directory when multiple 68 hardlinks were present, but this wound up being too difficult to get 69 right and made NFS support impossible (would have required adding more 70 complexity to index inode numbers, which I didn't want to do). 71 72* Directory entries are indexed in the radix tree just like everything 73 else based on a hash of the filename plus an iterator to deal with 74 collisions nicely. Directory entries with filenames <= 64 bytes 75 fit entirely within the 128-byte blockref structure without requiring 76 a data reference for highly optimal directory scans. In the optimal 77 case the directory entry simply uses the 64-byte check field to store 78 the filename (since there is no data reference). 79 80 Directory entries record inode number, file type, and filename. 81 82* The top-level for each formatted device is called the super-root. The 83 top-level cannot be directly mounted. Instead, it contains inodes 84 representing pieces of nodes in the cluster which can be individually 85 mounted. 86 87 This means that H2 filesystems can create multiple roots if desired. 88 In fact, snapshots are stored as discretely mountable nodes at this 89 level, so it is possible to boot from or mount root from a snapshot. 90 91 (HAMMER1 had only one root, but multiple 'PFS's, but they weren't nearly 92 as flexible). 93 94 Each formatted H2 device can hold pieces of or whole nodes belonging 95 to multiple filesystems. This allows independent cluster components to 96 be configured within a single formatted H2 filesystem. Each component is 97 a super-root entry, a cluster identifier, and a unique identifier. The 98 network protocl integrates the component into the cluster when it is 99 created 100 101* Snapshots are implemented as nodes under the super-root. Snapshots 102 are writable and easy to create (in HAMMER1 snapshots were read-only). 103 However, HAMMER2 snapshots are not as fine-grained as HAMMER1 snapshots, 104 and also not automatic. They must be explicitly created but are cheap 105 enough that fine-grained H2 snapshots can be created on a schedule if 106 desired. 107 108* Utilizes a Frontend-Backend operational design with multiple dedicated 109 threads for each cluster component. This allows the frontend to dispatch 110 parallel ops to backend components and then finalize the frontend operation 111 the instant a sufficient number of components agree (depending on the 112 replication mode), even if other nodes in the cluster are stalled. 113 114 H2 can deal with stalled backend threads without stalling the frontend. 115 116* Flush handling is really difficult to deal with because we want to 117 utilize the system's normal buffer cache for efficiency. This means 118 that some flush elements (dirty data buffer cache buffers) are not 119 necessarily in sync with dirty meta-data tracked by the filesystem code. 120 121 If H2 locks the entire filesystem during a flush, then many front-end 122 operations can wind up stalling for a very long time (depending on how 123 much dirty data the filesystem and operating system lets build up). 124 125 Currently HAMMER2 tries to deal with by allowing for an almost-fully 126 asynchronous flush. Essentially, everything related to data and meta-data 127 except the volume header itself can be flushed asynchronously. This 128 means it can also be flushed concurrently with front-end operations. 129 130 In order to make the 'final' flush of the volume header itself meaningful, 131 the flush code will first attempt to asynchronously flush all pending 132 buffers and meta-data, then will lock the filesystem and do a second 133 flush of anything that slipped through while the first flush was running. 134 And then will flush the volume header. 135 136 CURRENT STATUS: This work is still in progress and there are still 137 stall issues in the handling of flushes. 138 139* Low memory footprint. Except for the volume header, the buffer cache 140 is completely asynchronous and dirty buffers can be retired by the OS 141 directly to backing store with no further interactions with the filesystem. 142 143* Compression support. Multiple algorithms are supported and can be 144 configured on a subdirectory hierarchy or individual file basis. 145 Block compression up to 64KB will be used. Only compression ratios at 146 powers of 2 that are at least 2:1 (e.g. 2:1, 4:1, 8:1, etc) will work in 147 this scheme because physical block allocations in HAMMER2 are always 148 power-of-2. Modest compression can be achieved with low overhead, is 149 turned on by default, and is compatible with deduplication. 150 151* De-duplication support. HAMMER2 uses a relatively simple freemap 152 scheme that allows the filesystem to discard block references 153 asynchronously, and the same scheme allows essentially unlimited 154 references to the same data block in the hierarchy. Thus, both live 155 de-duplication and bulk deduplication is relatively easy to implement. 156 157* Zero detection on write (writing all-zeros), which requires the data 158 buffer to be scanned, is fully supported. This allows the writing of 0's 159 to create holes. 160 161 Generally speaking pre-writing zerod blocks to reserve space doesn't work 162 well on copy-on-write filesystems. However, if both compression and 163 check codes are disabled on a file, H2 will also disable zero-detection, 164 allow pre-reservation of file blocks (by pre-zeroing), and allow data 165 overwrites to write to the same sector. 166 167* In addition to the above, sector overwrites (avoiding the copy-on-write) 168 are also allowed when multiple writes to the same block occur in-between 169 flush operations. 170 171* Incremental synchronization via highest-transaction id propagation 172 within the radix tree. This is a queueless, incremental design. 173 174 CURRENT STATUS: Due to the flat inode hierarchy now being employed, 175 the current synchronization code which silently recurses indirect nodes 176 will be inefficient due to the fact that all the inodes are at the 177 same logical level in the topology. To fix this, the code will need 178 to explicitly iterate indirect nodes and keep track of the related 179 key ranges to match them up on an indirect-block basis, which would 180 be incredibly efficient. 181 182* Background synchronization and mirroring occurs at the logical layer 183 rather than the physical layer. This allows cluster components to 184 have differing storage arrangements. 185 186 In addition, this mechanism will fully correct any out of sync nodes 187 in the cluster as long as a sufficient number of other nodes agree on 188 what the proper state should be. 189 190 DESIGN PENDING ON THESE FEATURES 191 192* Encryption. Whole-disk encryption is supported by another layer, but I 193 intend to give H2 an encryption feature at the logical layer which works 194 approximately as follows: 195 196 - Encryption controlled by the client on an inode/sub-tree basis. 197 - Server has no visibility to decrypted data. 198 - Encrypt filenames in directory entries. Since the filename[] array 199 is 256 bytes wide, client can add random bytes after the normal 200 terminator to make it virtually impossible for an attacker to figure 201 out the filename. 202 - Encrypt file size and most inode contents. 203 - Encrypt file data (holes are not encrypted). 204 - Encryption occurs after compression, with random filler. 205 - Check codes calculated after encryption & compression (not before). 206 207 - Blockrefs are not encrypted. 208 - Directory and File Topology is not encrypted. 209 - Encryption is not sub-topology validation. Client would have to keep 210 track of that itself. Server or other clients can still e.g. remove 211 files, rename, etc. 212 213 In particular, note that even though the file size field can be encrypted, 214 the server does have visibility on the block topology and thus has a pretty 215 good idea how big the file is. However, a client could add junk blocks 216 at the end of a file to make this less apparent, at the cost of space. 217 218 If a client really wants a fully validated H2-encrypted space the easiest 219 solution is to format a filesystem within an encrypted file by treating it 220 as a block device, but I digress. 221 222* Device ganging, copies for redundancy, and file splitting. 223 224 Device ganging - The idea here is not to gang devices into a single 225 physical volume but to instead format each device independently 226 and allow crossover-references in the blockref to other devices in 227 the set. 228 229 One of the things we want to accomplish is to ensure that a failed 230 device does not prevent access to radix tree elements in other devices 231 in the gang, and that the failed device can be reconstructed. To do 232 this, each device implements complete reachability from the node root 233 to all elements underneath it. When a device fails, the sychronization 234 code can theoretically reconstruct the missing material in other 235 devices making up the gang. New devices can be added to the gang and 236 existing devices can be removed from the gang. 237 238 Redundant copies - This is actually a fairly tough problem. The 239 solution I would like to implement is to use the device ganging feature 240 to also implement redundancy, that way if a device fails within the 241 gang there's a good chance that it can still remain completely functional 242 without having to resynchronize. But making this work is difficult to say 243 the least. 244 245* MESI Cache coherency for multi-master/multi-client clustering operations. 246 The servers hosting the MASTERs are also responsible for keeping track of 247 the cache state. 248 249 This is a feature that we would need to implement coherent cross-machine 250 multi-threading and migration. 251 252* Implement unverified de-duplication (where only the check code is tested, 253 avoiding having to actually read data blocks to calculate a de-duplication. 254 This would make use of the blockref structure's widest check field 255 (512 bits). 256 257 Out of necessity this type of feature would be settable on a file or 258 recursive directory tree basis, but should only be used when the data 259 is throw-away or can be reconstructed since data corruption (mismatched 260 duplicates with the same hash) is still possible even with a 512-bit 261 check code. 262 263 The Unverified dedup feature is intended only for those files where 264 occassional corruption is ok, such as in a web-crawler data store or 265 other situations where the data content is not critically important 266 or can be externally recovered if it becomes corrupt. 267 268 GENERAL DESIGN 269 270HAMMER2 generally implements a copy-on-write block design for the filesystem, 271which is very different from HAMMER1's B-Tree design. Because the design 272is copy-on-write it can be trivially snapshotted simply by making a copy 273of the block table we desire to snapshot. Snapshotting the root inode 274effectively snapshots the entire filesystem, whereas snapshotting a file 275inode only snapshots that one file. Snapshotting a directory inode is 276generally unhelpful since it only contains directory entries and the 277underlying files are not arranged under it in the radix tree. 278 279The copy-on-write design implements a block table as a radix-tree, 280with a small fan-out in the volume header and inode (typically 4x) and 281a large fan-out for indirect blocks (typically 128x and 512x depending). 282The table is built bottom-up. Intermediate radii are only created when 283necessary so small files and directories will have a much shallower radix 284tree. 285 286HAMMER2 implements several space optimizations: 287 288 1. Directory entries with filenames <= 64 characters will fit entirely 289 in the 128-byte blockref structure and do not require additional data 290 block references. Since blockrefs are the core elements making up 291 block tables, most directories should have good locality of reference 292 for directory scans. 293 294 2. Inodes embed 4 blockrefs, so files up to 256KB and directories with 295 up to four directory entries (not including "." or "..") can be 296 accomodated without requiring any indirecct blocks. 297 298 3. Indirect blocks can be sized to any power of two up to 65536 bytes, 299 and H2 typically uses 16384 and 65536 bytes. The smaller size is 300 used for initial indirect blocks to reduce storage overhead for 301 medium-sized files and directories. 302 303 4. The File inode itself can directly hold the data for small 304 files <= 512 bytes in size. 305 306 5. The last block in a file will have a storage allocation in powers 307 of 2 from 1KB to 64KB as needed. Thus a small file in excess of 308 512 bytes but less than 64KB will not waste a full 64KB block. 309 310 6. When compression is enabled, small physical blocks will be allocated 311 when possible. However, only reductions in powers of 2 are supported. 312 So if a 64KB data block can be compressed to (16KB+1) to 32KB, then 313 a 32KB block will be used. This gives H2 modest compression at very 314 low cost without too much added complexity. 315 316 7. Live de-dup will attempt to share data blocks when file copying is 317 detected, significantly reducing actual physical writes to storage 318 and the storage used. Bulk de-dup (when implemented), will catch 319 other cases of de-duplication. 320 321Directories contain directory entries which are indexed using a hash of 322their filename. The hash is carefully designed to maintain some natural 323sort ordering. 324 325The copy-on-write nature of the filesystem means that any modification 326whatsoever will have to eventually synchronize new disk blocks all the way 327to the super-root of the filesystem and the volume header itself. This forms 328the basis for crash recovery and also ensures that recovery occurs on a 329completed high-level transaction boundary. All disk writes are to new blocks 330except for the volume header (which cycles through 4 copies), thus allowing 331all writes to run asynchronously and concurrently prior to and during a flush, 332and then just doing a final synchronization and volume header update at the 333end. Many of HAMMER2s features are enabled by this core design feature. 334 335The Freemap is also implemented using a radix tree via a set of pre-reserved 336blocks (approximately 4MB for every 2GB of storage), and also cycles through 337multiple copies to ensure that crash recovery can restore the state of the 338filesystem quickly at mount time. 339 340HAMMER2 tries to maintain a small footprint and one way it does this is 341by using the normal buffer cache for data and meta-data, and allowing the 342kernel to asynchronously flush device buffers at any time (even during 343synchronization). The volume root is flushed separately, separated from 344the asynchronous flushes by a synchronizing BUF_CMD_FLUSH op. This means 345that HAMMER2 has very low resource overhead from the point of view of the 346operating system and is very much unlike HAMMER1 which had to lock dirty 347buffers into memory for long periods of time. HAMMER2 has no such 348requirement. 349 350Buffer cache overhead is very well bounded and can handle filesystem 351operations of any complexity, even on boxes with very small amounts 352of physical memory. Buffer cache overhead is significantly lower with H2 353than with H1 (and orders of magnitude lower than ZFS). 354 355At some point I intend to implement a shortcut to make fsync()'s run fast, 356and that is to allow deep updates to blockrefs to shortcut to auxillary 357space in the volume header to satisfy the fsync requirement. The related 358blockref is then recorded when the filesystem is mounted after a crash and 359the update chain is reconstituted when a matching blockref is encountered 360again during normal operation of the filesystem. 361 362 MIRROR_TID, MODIFY_TID, UPDATE_TID 363 364In HAMMER2, the core block reference is a 128-byte structure called a blockref. 365The blockref contains various bits of information including the 64-bit radix 366key (typically a directory hash if a directory entry, inode number if a 367hidden hardlink target, or file offset if a file block), number of significant 368key bits for ranged recursion of indirect blocks, a 64-bit device seek that 369encodes the radix of the physical block size in the low bits (physical block 370size can be different from logical block size due to compression), 371three 64-bit transaction ids, type information, and up to 512 bits worth 372of check data for the block being reference which can be anything from 373a simple CRC to a strong cryptographic hash. 374 375mirror_tid - This is a media-centric (as in physical disk partition) 376 transaction id which tracks media-level updates. The mirror_tid 377 can be different at the same point on different nodes in a 378 cluster. 379 380 Whenever any block in the media topology is modified, its 381 mirror_tid is updated with the flush id and will propagate 382 upward during the flush all the way to the volume header. 383 384 mirror_tid is monotonic. It is primarily used for on-mount 385 recovery and volume root validation. The name is historical 386 from H1, it is not used for nominal mirroring. 387 388modify_tid - This is a cluster-centric (as in across all the nodes used 389 to build a cluster) transaction id which tracks filesystem-level 390 updates. 391 392 modify_tid is updated when the front-end of the filesystem makes 393 a change to an inode or data block. It does NOT propagate upward 394 during a flush. 395 396update_tid - This is a cluster synchronization transaction id. Modifications 397 made to the topology will clear this field to 0 as they propagate 398 up to the root. This gives the synchronizer an easy way to 399 determine what needs revalidation. 400 401 The synchronizer revalidates the cluster bottom-up by validating 402 a sub-topology and propagating the highest modify_tid in the 403 validated sub-topology up via the update_tid field. 404 405 Update to this field may be optimized by the HAMMER2 VFS to 406 avoid the double-transition. 407 408The synchronization code updates an out-of-sync node bottom-up and will 409dynamically set update_tid as it goes, but media flushes can occur at any 410time and these flushes will use mirror_tid for flush and freemap management. 411The mirror_tid for each flush propagates upward to the volume header on each 412flush. modify_tid is set for any chains modified by a cluster op but does 413not propagate up, instead serving as a seed for update_tid. 414 415* The synchronization code is able to determine that a sub-tree is 416 synchronized simply by observing the update_tid at the root of the sub-tree, 417 on an inode-by-inode basis and also on a data-block-by-data-block basis. 418 419* The synchronization code is able to do an incremental update of an 420 out-of-sync node simply by skipping elements with a matching update_tid 421 (when not 0). 422 423* The synchronization code can be interrupted and restarted at any time, 424 and is able to pick up where it left off with very low overhead. 425 426* The synchronization code does not inhibit media flushes. Media flushes 427 can occur (and must occur) while synchronization is ongoing. 428 429There are several other stored transaction ids in HAMMER2. There is a 430separate freemap_tid in the volume header that is used to allow freemap 431flushes to be deferred, and inodes have a pfs_psnap_tid which is used in 432conjuction with CHECK_NONE to allow blocks without a check code which do 433not violate the most recent snapshot to be overwritten in-place. 434 435Remember that since this is a copy-on-write filesystem, we can propagate 436a considerable amount of information up the tree to the volume header 437without adding to the I/O we already have to do. 438 439 DIRECTORIES AND INODES 440 441Directories are hashed. In HAMMER2, a directory can contain a mix of 442directory entries AND embedded inodes. In the first iteration of HAMMER2 443I tried really hard to embed inodes (since most files are not usually 444hardlinked), but this created huge problems for NFS exports. At the 445moment the super-root directory utilizes embedded inodes and filesystems 446under the super-root typically use normal directory entries. The real 447inodes are in the mounted root directory as hidden entries. 448 449However, I reserve the right to implement embedded inodes within a 450mount to create export domains which can serve as mini-roots. Such 451mini-roots would be able to have their own quotas and would be separately 452snapshottable, but would also have to be exported separately from the 453primary mount they exist under. 454 455Hardlinks are implemented normally, with directory entries and the maintenance 456of a nlinks count in the target inode. 457 458 RECOVERY 459 460H2 allows freemap flushes to lag behind topology flushes. The freemap flush 461tracks a separate transaction id (via mirror_tid) in the volume header. 462 463On mount, HAMMER2 will first locate the highest-sequenced check-code-validated 464volume header from the 4 copies available (if the filesystem is big enough, 465e.g. > ~10GB or so, there will be 4 copies of the volume header). 466 467HAMMER2 will then run an incremental scan of the topology for mirror_tid 468transaction ids between the last freemap flush tid and the last topology 469flush tid in order to synchronize the freemap. Because this scan is 470incremental the time it takes to run will be relatively short and well-bounded 471at mount-time. This is NOT an fsck. Freemap flushes can be avoided for any 472number of normal topology flushes but should still occur frequently enough 473to avoid long recovery times in case of a crash. 474 475The filesystem is then ready for use. 476 477 DISK I/O OPTIMIZATIONS 478 479The freemap implements a 1KB allocation resolution. Each 2MB segment managed 480by the freemap is zoned and has a tendancy to collect inodes, small data, 481indirect blocks, and larger data blocks into separate segments. The idea is 482to greatly improve I/O performance (particularly by laying inodes down next 483to each other which has a huge effect on directory scans). 484 485The current implementation of HAMMER2 implements a fixed I/O block size 486of 64KB in order to allow the mapping of hammer2_dio's in its IO subsystem 487to conumers that might desire different sizes. This way we don't have to 488worry about matching the buffer cache / DIO cache to the variable block 489size of underlying elements. In addition, 64KB I/Os allow compatibility 490with physical sector sizes up to 64KB in the underlying physical storage 491with no change in the byte-by-byte format of the filesystem. 492 493The biggest issue we are avoiding by having a fixed 64KB I/O size is not 494actually to help nominal front-end access issue but instead to reduce the 495complexity of having to deal with mixed block sizes in the buffer cache, 496particularly when blocks are freed and then later reused with a different 497block size. HAMMER1 had to have specialized code to check for and 498invalidate buffer cache buffers in the free/reuse case. HAMMER2 does not 499need such code. 500 501That said, HAMMER2 places no major restrictions on mixing block sizes within 502a 64KB block. The only restriction is that a HAMMER2 block cannot cross 503a 64KB boundary. The soft restrictions the block allocator puts in place 504exist primarily for performance reasons (i.e. try to collect 1K inodes 505together). The 2MB freemap zone granularity should work very well in this 506regard. 507 508HAMMER2 also allows OS support for ganging buffers together into even 509larger blocks for I/O (OS buffer cache 'clustering'), OS-supported read-ahead, 510OS-driven asynchronous retirement, and other performance features typically 511provided by the OS at the block-level to ensure smooth system operation. 512 513By avoiding wiring buffers/memory and allowing these features to run normally, 514HAMMER2 winds up with very low OS overhead. 515 516 FREEMAP NOTES 517 518The freemap is stored in the reserved blocks situated in the ~4MB reserved 519area at the baes of every ~1GB level-1 zone. The current implementation 520reserves 8 copies of every freemap block and cycles through them in order 521to make the freemap operate in a copy-on-write fashion. 522 523 - Freemap is copy-on-write. 524 - Freemap operations are transactional, same as everything else. 525 - All backup volume headers are consistent on-mount. 526 527The Freemap is organized using the same radix blockmap algorithm used for 528files and directories, but with fixed radix values. For a maximally-sized 529filesystem the Freemap will wind up being a 5-level-deep radix blockmap, 530but the top-level is embedded in the volume header so insofar as performance 531goes it is really just a 4-level blockmap. 532 533The freemap radix allocation mechanism is also the same, meaning that it is 534bottom-up and will not allocate unnecessary intermediate levels for smaller 535filesystems. The number of blockmap levels not including the volume header 536for various filesystem sizes is as follows: 537 538 up-to #of freemap levels 539 1GB 1-level 540 256GB 2-level 541 64TB 3-level 542 16PB 4-level 543 4EB 5-level 544 16EB 6-level 545 546The Freemap has bitmap granularity down to 16KB and a linear iterator that 547can linearly allocate space down to 1KB. Due to fragmentation it is possible 548for the linear allocator to become marginalized, but it is relatively easy 549to for a reallocation of small blocks every once in a while (like once a year 550if you care at all) and once the old data cycles out of the snapshots, or you 551also rewrite the snapshots (which you can do), the freemap should wind up 552relatively optimal again. Generally speaking I believe that algorithms can 553be developed to make this a non-problem without requiring any media structure 554changes. 555 556In order to implement fast snapshots (and writable snapshots for that 557matter), HAMMER2 does NOT ref-count allocations. All the freemap does is 558keep track of 100% free blocks plus some extra bits for staging the bulkfree 559scan. The lack of ref-counting makes it possible to: 560 561 - Completely trivialize HAMMER2s snapshot operations. 562 - Allows any volume header backup to be used trivially. 563 - Allows whole sub-trees to be destroyed without having to scan them. 564 - Simplifies normal crash recovery operations. 565 - Simplifies catastrophic recovery operations. 566 567Normal crash recovery is simply a matter of doing an incremental scan 568of the topology between the last flushed freemap TID and the last flushed 569topology TID. This usually takes only a few seconds and allows: 570 571 - Freemap flushes to be be deferred for any number of topology flush 572 cycles. 573 - Does not have to be flushed for fsync, reducing fsync overhead. 574 575 FREEMAP - BULKFREE 576 577Blocks are freed via a bulkfree scan, which is a two-stage meta-data scan. 578Blocks are first marked as being possibly free and then finalized in the 579second scan. Live filesystem operations are allowed to run during these 580scans and any freemap block that is allocated or adjusted after the first 581scan will simply be re-marked as allocated and the second scan will not 582transition it to being free. 583 584The cost of not doing ref-count tracking is that HAMMER2 must perform two 585bulkfree scans of the meta-data to determine which blocks can actually be 586freed. This can be complicated by the volume header backups and snapshots 587which cause the same meta-data topology to be scanned over and over again, 588but mitigated somewhat by keeping a cache of higher-level nodes to detect 589when we would scan a sub-topology that we have already scanned. Due to the 590copy-on-write nature of the filesystem, such detection is easy to implement. 591 592Part of the ongoing design work is finding ways to reduce the scope of this 593meta-data scan so the entire filesystem's meta-data does not need to be 594scanned (though in tests with HAMMER1, even full meta-data scans have 595turned out to be fairly low cost). In other words, its an area where 596improvements can be made without any media format changes. 597 598Another advantage of operating the freemap like this is that some future 599version of HAMMER2 might decide to completely change how the freemap works 600and would be able to make the change with relatively low downtime. 601 602 CLUSTERING 603 604Clustering, as always, is the most difficult bit but we have some advantages 605with HAMMER2 that we did not have with HAMMER1. First, HAMMER2's media 606structures generally follow the kernel's filesystem hiearchy which allows 607cluster operations to use topology cache and lock state. Second, 608HAMMER2's writable snapshots make it possible to implement several forms 609of multi-master clustering. 610 611The mount device path you specify serves to bootstrap your entry into 612the cluster. This is typically local media. It can even be a ram-disk 613that only contains placemarkers that help HAMMER2 connect to a fully 614networked cluster. 615 616With HAMMER2 you mount a directory entry under the super-root. This entry 617will contain a cluster identifier that helps HAMMER2 identify and integrate 618with the nodes making up the cluster. HAMMER2 will automatically integrate 619*all* entries under the super-root when you mount one of them. You have to 620mount at least one for HAMMER2 to integrate the block device in the larger 621cluster. 622 623For cluster servers every HAMMER2-formatted partition has a "LOCAL" MASTER 624which can be mounted in order to make the rest of the elements under the 625super-root available to the network. (In a prior specification I emplaced 626the cluster connections in the volume header's configuration space but I no 627longer do that). 628 629Connecting to the wider networked cluster involves setting up the /etc/hammer2 630directory with appropriate IP addresses and keys. The user-mode hammer2 631service daemon maintains the connections and performs graph operations 632via libdmsg. 633 634Node types within the cluster: 635 636 DUMMY - Used as a local placeholder (typically in ramdisk) 637 CACHE - Used as a local placeholder and cache (typically on a SSD) 638 SLAVE - A SLAVE in the cluster, can source data on quorum agreement. 639 MASTER - A MASTER in the cluster, can source and sink data on quorum 640 agreement. 641 SOFT_SLAVE - A SLAVE in the cluster, can source data locally without 642 quorum agreement (must be directly mounted). 643 SOFT_MASTER - A local MASTER but *not* a MASTER in the cluster. Can source 644 and sink data locally without quorum agreement, intended to 645 be synchronized with the real MASTERs when connectivity 646 allows. Operations are not coherent with the real MASTERS 647 even when they are available. 648 649 NOTE: SNAPSHOT, AUTOSNAP, etc represent sub-types, typically under a 650 SLAVE. A SNAPSHOT or AUTOSNAP is a SLAVE sub-type that is no longer 651 synchronized against current masters. 652 653 NOTE: Any SLAVE or other copy can be turned into its own writable MASTER 654 by giving it a unique cluster id, taking it out of the cluster that 655 originally spawned it. 656 657There are four major protocols: 658 659 Quorum protocol 660 661 This protocol is used between MASTER nodes to vote on operations 662 and resolve deadlocks. 663 664 This protocol is used between SOFT_MASTER nodes in a sub-cluster 665 to vote on operations, resolve deadlocks, determine what the latest 666 transaction id for an element is, and to perform commits. 667 668 Cache sub-protocol 669 670 This is the MESI sub-protocol which runs under the Quorum 671 protocol. This protocol is used to maintain cache state for 672 sub-trees to ensure that operations remain cache coherent. 673 674 Depending on administrative rights this protocol may or may 675 not allow a leaf node in the cluster to hold a cache element 676 indefinitely. The administrative controller may preemptively 677 downgrade a leaf with insufficient administrative rights 678 without giving it a chance to synchronize any modified state 679 back to the cluster. 680 681 Proxy protocol 682 683 The Quorum and Cache protocols only operate between MASTER 684 and SOFT_MASTER nodes. All other node types must use the 685 Proxy protocol to perform similar actions. This protocol 686 differs in that proxy requests are typically sent to just 687 one adjacent node and that node then maintains state and 688 forwards the request or performs the required operation. 689 When the link is lost to the proxy, the proxy automatically 690 forwards a deletion of the state to the other nodes based on 691 what it has recorded. 692 693 If a leaf has insufficient administrative rights it may not 694 be allowed to actually initiate a quorum operation and may only 695 be allowed to maintain partial MESI cache state or perhaps none 696 at all (since cache state can block other machines in the 697 cluster). Instead a leaf with insufficient rights will have to 698 make due with a preemptive loss of cache state and any allowed 699 modifying operations will have to be forwarded to the proxy which 700 continues forwarding it until a node with sufficient administrative 701 rights is encountered. 702 703 To reduce issues and give the cluster more breath, sub-clusters 704 made up of SOFT_MASTERs can be formed in order to provide full 705 cache coherent within a subset of machines and yet still tie them 706 into a greater cluster that they normally would not have such 707 access to. This effectively makes it possible to create a two 708 or three-tier fan-out of groups of machines which are cache-coherent 709 within the group, but perhaps not between groups, and use other 710 means to synchronize between the groups. 711 712 Media protocol 713 714 This is basically the physical media protocol. 715 716 MASTER & SLAVE SYNCHRONIZATION 717 718With HAMMER2 I really want to be hard-nosed about the consistency of the 719filesystem, including the consistency of SLAVEs (snapshots, etc). In order 720to guarantee consistency we take advantage of the copy-on-write nature of 721the filesystem by forking consistent nodes and using the forked copy as the 722source for synchronization. 723 724Similarly, the target for synchronization is not updated on the fly but instead 725is also forked and the forked copy is updated. When synchronization is 726complete, forked sources can be thrown away and forked copies can replace 727the original synchronization target. 728 729This may seem complex, but 'forking a copy' is actually a virtually free 730operation. The top-level inode (under the super-root), on-media, is simply 731copied to a new inode and poof, we have an unchanging snapshot to work with. 732 733 - Making a snapshot is fast... almost instantanious. 734 735 - Snapshots are used for various purposes, including synchronization 736 of out-of-date nodes. 737 738 - A snapshot can be converted into a MASTER or some other PFS type. 739 740 - A snapshot can be forked off from its parent cluster entirely and 741 turned into its own writable filesystem, either as a single MASTER 742 or this can be done across the cluster by forking a quorum+ of 743 existing MASTERs and transfering them all to a new cluster id. 744 745More complex is reintegrating the target once the synchronization is complete. 746For SLAVEs we just delete the old SLAVE and rename the copy to the same name. 747However, if the SLAVE is mounted and not optioned as a static mount (that is 748the mounter wants to see updates as they are synchronized), a reconciliation 749must occur on the live mount to clean up the vnode, inode, and chain caches 750and shift any remaining vnodes over to the updated copy. 751 752 - A mounted SLAVE can track updates made to the SLAVE but the 753 actual mechanism is that the SLAVE PFS is replaced with an 754 updated copy, typically every 30-60 seconds. 755 756Reintegrating a MASTER which has fallen out of the quorum due to being out 757of date is also somewhat more complex. The same updating mechanic is used, 758we actually have to throw the 'old' MASTER away once the new one has been 759updated. However if the cluster is undergoing heavy modifications the 760updated MASTER will be out of date almost the instant its source is 761snapshotted. Reintegrating a MASTER thus requires a somewhat more complex 762interaction. 763 764 - If a MASTER is really out of date we can run one or more 765 synchronization passes concurrent with modifying operations. 766 The quorum can remain live. 767 768 - A final synchronization pass is required with quorum operations 769 blocked to reintegrate the now up-to-date MASTER into the cluster. 770 771 772 QUORUM OPERATIONS 773 774Quorum operations can be broken down into HARD BLOCK operations and NETWORK 775operations. If your MASTERs are all local mounts, then failures and 776sequencing is easy to deal with. 777 778Quorum operations on a networked cluster are more complex. The problems: 779 780 - Masters cannot rely on clients to moderate quorum transactions. 781 Apart from the reliance being unsafe, the client could also 782 lose contact with one or more masters during the transaction and 783 leave one or more masters out-of-sync without the master(s) knowing 784 they are out of sync. 785 786 - When many clients are present, we do not want a flakey network 787 link from one to cause one or more masters to go out of 788 synchronization and potentially stall the whole works. 789 790 - Normal hammer2 mounts allow a virtually unlimited number of modifying 791 transactions between actual flushes. The media flush rolls everything 792 up into a single transaction id per flush. Detection of 'missing' 793 transactions in a concurrent multi-client setup when one or more client 794 temporarily loses connectivity is thus difficult. 795 796 - Clients have a limited amount of time to reconnect to a cluster after 797 a network disconnect before their MESI cache states are lost. 798 799 - Clients may proceed with several transactions before knowing for sure 800 that earlier transactions were completely successful. Performance is 801 important, we won't be waiting for a full quorum-verified synchronous 802 flush to media before allowing a system call to return. 803 804 - Masters can decide that a client's MESI cache states were lost (i.e. 805 that the transaction was too slow) as well. 806 807The solutions (for modifying transactions): 808 809 - Masters handle quorum confirmation amongst themselves and do not rely 810 on the client for that purpose. 811 812 - A client can connect to one or more masters regardless of the size of 813 the quorum and can submit modifying operations to a single master if 814 desired. The master will take care of the rest. 815 816 A client must still validate the quorum (and obtain MESI cache states) 817 when doing read-only operations in order to present the correct data 818 to the user process for the VOP. 819 820 - Masters will run a 2-phase commit amongst themselves, often concurrent 821 with other non-conflicting transactions, and will serialize operations 822 and/or enforce synchronization points for 2-phase completion on 823 serialized transactions from the same client or when cache state 824 ownership is shifted from one client to another. 825 826 - Clients will usually allow operations to run asynchronously and return 827 from system calls more or less ASAP once they own the necessary cache 828 coherency locks. The client can select the validation mode to wait for 829 with mount options: 830 831 (1) Fully async (mount -o async) 832 (2) Wait for phase-1 ack (mount) 833 (3) Wait for phase-2 ack (mount -o sync) (fsync - wait p2ack) 834 (4) Wait for flush (mount -o sync) (fsync - wait flush) 835 836 Modifying system calls cannot be told to wait for a full media 837 flush, as full media flushes are prohibitively expensive. You 838 still have to fsync(). 839 840 The fsync wait mode for network links can be selected, either to 841 return after the phase-2 ack or to return after the media flush. 842 The default is to wait for the phase-2 ack, which at least guarantees 843 that a network failure after that point will not disrupt operations 844 issued before the fsync. 845 846 - Clients must adjust the chain state for modifying operations prior to 847 releasing chain locks / returning from the system call, even if the 848 masters have not finished the transaction. A late failure by the 849 cluster will result in desynchronized state which requires erroring 850 out the whole filesystem or resynchronizing somehow. 851 852 - Clients can opt to keep a record of transactions through the phase-2 853 ack or the actual media flush on the masters. 854 855 However, replaying/revalidating the log cannot necessarily guarantee 856 success. If the masters lose synchronization due to network issues 857 between masters (or if the client was mounted fully-async), or if enough 858 masters crash simultaniously such that a quorum fails to flush even 859 after the phase-2 ack, then it is possible that by the time a client 860 is able to replay/revalidate, some other client has squeeded in and 861 committed something that would conflict. 862 863 If the client crashes it works similarly to a crash with a local storage 864 mount... many dirty buffers might be lost. And the same happens in 865 the cluster case. 866 867 TRANSACTION LOG 868 869Keeping a short-term transaction log, much less being able to properly replay 870it, is fraught with difficulty and I've made it a separate development task. 871For now HAMMER2 does not have one. 872 873