1 2 HAMMER2 DESIGN DOCUMENT 3 4 Matthew Dillon 5 08-Feb-2012 6 14-May-2013 7 dillon@backplane.com 8 9* These features have been speced in the media structures. 10 11* Implementation work has begun. 12 13* Filesytem core is now operational, cluster messaging links are primitive 14 but work (and are fully encrypted). Work continues on the block allocator 15 and work has not yet begun on copies, block-encryption, block-compression, 16 mirroring, or quorum/cluster ops. 17 18* Obviously a fully functional filesystem is not yet ready but once the 19 freemap and the backend garbage collector is implemented the HAMMER2 20 filesystem will be usable. Missing many features, but usable. 21 22* Design of all media elements is complete. 23 24 Feature List 25 26* Multiple roots (allowing snapshots to be mounted). This is implemented 27 via the super-root concept. When mounting a HAMMER2 filesystem you specify 28 a device path and a directory name in the super-root. (HAMMER1 had only 29 one root). 30 31* Roots are really no different from snapshots (HAMMER1 distinguished between 32 its root mount and its PFS's. HAMMER2 does not). 33 34* Snapshots are writable (in HAMMER1 snapshots were read-only). 35 36* Snapshots are explicit but trivial to create. In HAMMER1 snapshots were 37 both explicit and fine-grained/automatic. HAMMER2 does not implement 38 automatic fine-grained snapshots. H2 snapshots are cheap enough that you 39 can create fine-grained snapshots if you desire. 40 41* HAMMER2 flushes formalized a synchronization point for the flush, wait 42 for all running modifying operations to complete to memory (not to disk) 43 while temporarily stalling new modifying operation initiations. The 44 flush is then able to proceed concurrent with unstalling and allowing 45 new modifying operations to run. 46 47* The flush is fully meta-data-synchronized in HAMMER2. In HAMMER1 it was 48 possible for flushes to bisect inode creation vs directory entry creation 49 and to create problems with directory renames. HAMMER2 has no issues with 50 any of these. Dealing with data synchronization is another matter but 51 it should be possible to address explcit write()'s properly. mmap()'d 52 R+W data... not so easy. 53 54* Directory sub-hierarchy-based quotas for space and inode usage tracking. 55 Any directory can be used. 56 57* Low memory footprint. Except for the volume header, the buffer cache 58 is completely asynchronous and dirty buffers can be retired by the OS 59 directly to backing store with no further interactions with the filesystem. 60 61* Incremental queueless mirroring / mirroring-streams. Because HAMMER2 is 62 block-oriented and copy-on-write each blockref tracks both direct 63 modifications to the referenced data via (modify_tid) and indirect 64 modifications to the referenced data or any sub-tree via (mirror_tid). 65 This makes it possible to do an incremental scan of meta-data that covers 66 only changes made since the mirror_tid recorded in a prior-run. 67 68 This feature is also intended to be used to locate recently allocated 69 blocks and thus be able to fixup the freemap after a crash. 70 71 HAMMER2 mirroring works a bit differently than HAMMER1 mirroring in 72 that HAMMER2 does not keep track of 'deleted' records. Instead any 73 recursion by the mirroring code which finds that (modify_tid) has 74 been updated must also send the direct block table or indirect block 75 table state it winds up recursing through so the target can check 76 similar key ranges and locate elements to be deleted. This can be 77 avoided if the mirroring stream is mostly caught up in that very recent 78 deletions will be cached in memory and can be queried, allowing shorter 79 record deletions to be passed in the stream instead. 80 81* Will support multiple compression algorithms configured on subdirectory 82 tree basis and on a file basis. Up to 64K block compression will be used. 83 Only compression ratios near powers of 2 that are at least 2:1 (e.g. 2:1, 84 4:1, 8:1, etc) will work in this scheme because physical block allocations 85 in HAMMER2 are always power-of-2. 86 87 Compression algorithm #0 will mean no compression and no zero-checking. 88 Compression algorithm #1 will mean zero-checking but no other compression. 89 Real compression will be supported starting with algorithm 2. 90 91* Zero detection on write (writing all-zeros), which requires the data 92 buffer to be scanned, will be supported as compression algorithm #1. 93 This allows the writing of 0's to create holes and will be the default 94 compression algorithm for HAMMER2. 95 96* Copies support for redundancy. Each copy has its own blockref. The 97 blockrefs representing the copies must exist within the same blockset 98 (set of 8 blockrefs), though I may relax this requirement in the 99 implementation. 100 101 The design is such that the filesystem should be able to function at 102 full speed even if disks are pulled or inserted, as long as at least one 103 good copy is present. A background task will be needed to resynchronize 104 missing copies (or remove excessive copies in the case where the copies 105 value is reduced on a live filesystem). 106 107 Copies are specified using the same copyinfo[] array that is used to 108 specify cluster interconnections for PFS's. 109 110* Clusterable with MESI cache coherency and dynamic granularity. 111 The media format for HAMMER1 was less condusive to logical clustering 112 than I had hoped so I was never able to get that aspect of my personal goals 113 working with HAMMER1. HAMMER2 effectively solves the issues that cropped 114 up with HAMMER1 (mainly that HAMMER1's B-Tree did not reflect the logical 115 file/directory hierarchy, making cache coherency very difficult). 116 117* Hardlinks will be supported. All other standard features will be supported 118 too of course. Hardlinks in this sort of filesystem require significant 119 work. 120 121* The media blockref structure is now large enough to support up to a 192-bit 122 check value, which would typically be a cryptographic hash of some sort. 123 Multiple check value algorithms will be supported with the default being 124 a simple 32-bit iSCSI CRC. 125 126* Fully verified deduplication will be supported and automatic (and 127 necessary in many respects). 128 129* Non-verified de-duplication will be supported as a configurable option on 130 a file or subdirectory tree. Non-verified deduplication would use the 131 largest available check code (192 bits) and not bother to verify data 132 matches during the dedup pass, which is necessary on extremely large 133 filesystems with a great deal of deduplicable data (as otherwise a large 134 chunk of the media would have to be read to implement the dedup). 135 136 This feature is intended only for those files where occassional corruption 137 is ok, such as in a large data store of farmed web content. 138 139 GENERAL DESIGN 140 141HAMMER2 generally implements a copy-on-write block design for the filesystem, 142which is very different from HAMMER1's B-Tree design. Because the design 143is copy-on-write it can be trivially snapshotted simply by referencing an 144existing block, and because the media structures logically match a standard 145filesystem directory/file hierarchy snapshots and other similar operations 146can be trivially performed on an entire subdirectory tree at any level in 147the filesystem. 148 149The copy-on-write nature of the filesystem implies that any modification 150whatsoever will have to eventually synchronize new disk blocks all the way 151to the super-root of the filesystem and the volume header itself. This forms 152the basis for crash recovery. All disk writes are to new blocks except for 153the volume header, thus allowing all writes to run concurrently except for 154the volume header update at the end. 155 156Clearly this method requires intermediate modifications to the chain to be 157cached so multiple modifications can be aggregated prior to being 158synchronized. One advantage, however, is that the cache can be flushed at 159any time WITHOUT having to allocate yet another new block when further 160modifications are made as long as the volume header has not yet been flushed. 161This means that buffer cache overhead is very well bounded and can handle 162filesystem operations of any complexity even on boxes with very small amounts 163of physical memory. 164 165I intend to implement a shortcut to make fsync()'s run fast, and that is to 166allow deep updates to blockrefs to shortcut to auxillary space in the 167volume header to satisfy the fsync requirement. The related blockref is 168then recorded when the filesystem is mounted after a crash and the update 169chain is reconstituted when a matching blockref is encountered again during 170normal operation of the filesystem. 171 172Basically this means that no real work needs to be done at mount-time 173even after a crash. 174 175Directories are hashed, and another major design element is that directory 176entries ARE INODES. They are one and the same. In addition to directory 177entries being inodes the data for very small files (512 bytes or smaller) 178can be directly embedded in the inode (overloaded onto the same space that 179the direct blockref array uses). This should result in very high 180performance. 181 182Inode numbers are not spatially referenced, which complicates NFS servers 183but doesn't complicate anything else. The inode number is stored in the 184inode itself, an absolutely necessary feature in order to support the 185hugely flexible snapshots that we want to have in HAMMER2. 186 187 DISK I/O OPTIMIZATIONS 188 189The freemap implements a 1KB allocation resolution. The minimum I/O size 190is 16KB. HAMMER2 typically implements 16KB and 64KB physical I/O sizes 191and will cluster larger I/O's. 192 193Each 2MB segment managed by the freemap handles just one particular 194physical I/O size. Typically this means that inodes, small data, and 195initial (small) indirect blocks get clustered together. Also large 64KB 196file-data and indirect blocks get clustered together. 197 198 HARDLINKS 199 200Hardlinks are a particularly sticky problem for HAMMER2 due to the lack of 201a spatial reference to the inode number. We do not want to have to have 202an index of inode numbers for any basic HAMMER2 feature if we can help it. 203 204Hardlinks are handled by placing the inode for a multiply-hardlinked file 205in the closest common parent directory. If "a/x" and "a/y" are hardlinked 206the inode for the hardlinked file will be placed in directory "a", e.g. 207"a/3239944", but it will be invisible and will be in an out-of-band namespace. 208The directory entries "a/x" and "a/y" will be given the same inode number 209but in fact just be placemarks that cause HAMMER2 to recurse upwards through 210the directory tree to find the invisible inode number. 211 212Because directories are hashed and a different namespace (hash key range) 213is used for hardlinked inodes, standard directory scans are able to trivially 214skip this invisible namespace and inode-specific lookups can restrict their 215lookup to within this space. 216 217The nature of snapshotting makes handling link-count 2->1 and 1->2 cases 218trivial. Basically the inode media structure is copied as needed to break-up 219or re-form the standard directory entry/inode. There are no backpointers in 220HAMMER2 and no reference counts on the blocks (see FREEMAP NOTES below), so 221it is an utterly trivial operation. 222 223 FREEMAP NOTES 224 225In order to implement fast snapshots (and writable snapshots for that 226matter), HAMMER2 does NOT ref-count allocations. The freemap which 227is still under design just won't do that. All the freemap does is 228keep track of 100% free blocks. 229 230This not only trivializes all the snapshot features it also trivializes 231hardlink handling and solves the problem of keeping the freemap sychronized 232in the event of a crash. Now all we have to do after a crash is make 233sure blocks allocated before the freemap was flushed are properly 234marked as allocated in the allocmap. This is a trivial exercise using the 235same algorithm the mirror streaming code uses (which is very similar to 236HAMMER1)... an incremental meta-data scan that covers only the blocks that 237might have been allocated between the last allocation map sync and now. 238 239Thus the freemap does not have to be synchronized during a fsync(). 240 241The complexity is in figuring out what can be freed... that is, when one 242can mark blocks in the freemap as being free. HAMMER2 implements this as 243a background task which essentially must scan available meta-data to 244determine which blocks are not being referenced. 245 246Part of the ongoing design work is finding ways to reduce the scope of this 247meta-data scan so the entire filesystem's meta-data does not need to be 248scanned (though in tests with HAMMER1, even full meta-data scans have 249turned out to be fairly low cost). In other words, its an area that we 250can continue to improve on as the filesystem matures. Not only that, but 251we can completely change the freemap algorithms without creating 252incompatibilities (at worse simply having to require that a R+W mount do 253a full meta-data scan when upgrading or downgrading the freemap algorithm). 254 255 CLUSTERING 256 257Clustering, as always, is the most difficult bit but we have some advantages 258with HAMMER2 that we did not have with HAMMER1. First, HAMMER2's media 259structures generally follow the kernel's filesystem hiearchy. Second, 260HAMMER2's writable snapshots make it possible to implement several forms 261of multi-master clustering. 262 263The mount device path you specify serves to bootstrap your entry into 264the cluster. This can be local media or directly specify a network 265cluster connection (or several). When a local media mount is used the 266volume header is scanned for local copies and the best volume header is 267selected from all available copies. Multiple devices may be specified for 268redundancy. 269 270The volume header on local media also contains cluster connection 271specifications keyed by super-root pfsid. Network connections are 272maintained to all targets. ALL ELEMENTS ARE TREATED ACCORDING TO TYPE 273NO MATTER WHICH ONE YOU MOUNT FROM. 274 275The actual networked cluster may be far larger than the elements you list 276in the hammer2_copy_data[] array, but your machine will only make direct 277connections as specified by the array. 278 279In the simplest case you simply network a few machines together as ring 0 280masters and each client connects directly to all the masters (and/or are 281the masters themselves). Thus any quorum operation is straight-forward. 282These master nodes are labeled 'ring 0'. 283 284If you have too many clients to reasonably connect directly you set up 285sub-clusters as satellites. This is called 'ring 1'. Ring 1 may contain 286several sub-clusters. A client then connects to all the nodes in a 287particular sub-cluster (typically 3). The quorum protocol runs as per 288normal except that once the operation is resolved against the sub-cluster 289an aggregation must be resolved against the master nodes (ring 0). The 290sub-cluster does this for the client... all the client sees is the normal 291quorum operation against the sub-cluster. 292 293Since each node in the sub-cluster connects to all master nodes we get 294a multiplication. If we set a reasonable upper limit of, say, 256 295connections at each master node then ring 1 may contain 85 sub-clusters x 3 296nodes in each sub-cluster. 297 298In the most complex case when one wishes to support potentially millions 299of clients then further fan-out is required into ring 2, ring 3, and 300so forth. However, each sub-cluster in ring 2 must only connect to 3011 sub-cluster in ring 1 (otherwise the cache state will become mightily 302confused). Using reasonable metrics this will allow ring 2 to contain 30385 * 85 = 7225 sub-clusters. At this point you could have 1000 clients 304connect to each sub-cluster and support 7.2 million clients, but if that 305isn't enough going to another ring will support 61M clients, and so forth. 306 307Each ring imposes additional latencies for cache operations but the key 308to making this work efficiently is that the satellite clusters can negotiate 309coarse-grained cache coherency locks with the next lower ring and then 310fan-out finer-grained locks to the next higher ring. Since caching can 311occur anywhere (including on the connecting client), it is the cache 312coherency lock that ultimately dictates efficiency and allows a client 313(or satellite) to access large amoutns of data from local storage. 314 315Modifying operations, particularly commits, also have higher latencies 316when multiple rings are in use. In this situation it is possible to 317short-cut localized operations by having competing clients connect to 318to sub-clusters which are near each other topologically... having the 319competing clients connect to the same sub-cluster would be the most optimal. 320 321In addition, sub-clusters (typically in ring 1) can act in SOFT_MASTER mode 322which allows the sub-cluster to acknowledge a full commit within its own 323quorum only, and then resolve asynchronously to the masters in ring 0. 324 325The nodes in these intermediate rings can be pure proxies with only memory 326caches, use local media for persistent cache, or use local media to 327completely slave the filesystem. 328 329 ADMIN - Media does not participate, administrative proxy only 330 CLIENT - Media does not participate, client only 331 CACHE - Media only acts as a persistent cache 332 COPY - Media only acts as a local copy 333 SLAVE - Media is a RO slave that can be mounted RW 334 335 SOFT_SLAVE - This is a SLAVE which can become writable when 336 the quorum is not available, but is not guaranteed 337 to be able to be merged back when the quorum becomes 338 available again. Elements which cannot be merged 339 back remain localized and writable until manual 340 or scripted intervention recombines them. 341 342 SOFT_MASTER - Similar to the above but can form a sub-cluster 343 and run the quorum protocol within the sub-cluster 344 to serve machines that connect to the sub-cluster 345 when the master cluster is not available. 346 347 The SOFT_MASTER nodes in a sub-cluster must be 348 fully interconnected with each other. 349 350 MASTER - This is a MASTER node in the quorum protocol. 351 352 The MASTER nodes in a cluster must be fully 353 interconnected with each other. 354 355There are four major protocols: 356 357 Quorum protocol 358 359 This protocol is used between MASTER nodes to vote on operations 360 and resolve deadlocks. 361 362 This protocol is used between SOFT_MASTER nodes in a sub-cluster 363 to vote on operations, resolve deadlocks, determine what the latest 364 transaction id for an element is, and to perform commits. 365 366 Cache sub-protocol 367 368 This is the MESI sub-protocol which runs under the Quorum 369 protocol. This protocol is used to maintain cache state for 370 sub-trees to ensure that operations remain cache coherent. 371 372 Depending on administrative rights this protocol may or may 373 not allow a leaf node in the cluster to hold a cache element 374 indefinitely. The administrative controller may preemptively 375 downgrade a leaf with insufficient administrative rights 376 without giving it a chance to synchronize any modified state 377 back to the cluster. 378 379 Proxy protocol 380 381 The Quorum and Cache protocols only operate between MASTER 382 and SOFT_MASTER nodes. All other node types must use the 383 Proxy protocol to perform similar actions. This protocol 384 differs in that proxy requests are typically sent to just 385 one adjacent node and that node then maintains state and 386 forwards the request or performs the required operation. 387 When the link is lost to the proxy, the proxy automatically 388 forwards a deletion of the state to the other nodes based on 389 what it has recorded. 390 391 If a leaf has insufficient administrative rights it may not 392 be allowed to actually initiate a quorum operation and may only 393 be allowed to maintain partial MESI cache state or perhaps none 394 at all (since cache state can block other machines in the 395 cluster). Instead a leaf with insufficient rights will have to 396 make due with a preemptive loss of cache state and any allowed 397 modifying operations will have to be forwarded to the proxy which 398 continues forwarding it until a node with sufficient administrative 399 rights is encountered. 400 401 To reduce issues and give the cluster more breath, sub-clusters 402 made up of SOFT_MASTERs can be formed in order to provide full 403 cache coherent within a subset of machines and yet still tie them 404 into a greater cluster that they normally would not have such 405 access to. This effectively makes it possible to create a two 406 or three-tier fan-out of groups of machines which are cache-coherent 407 within the group, but perhaps not between groups, and use other 408 means to synchronize between the groups. 409 410 Media protocol 411 412 This is basically the physical media protocol. 413 414There are lots of ways to implement multi-master environments using the 415above core features but the implementation is going to be fairly complex 416even with HAMMER2's feature set. 417 418Keep in mind that modifications propagate all the way to the super-root 419and volume header, so in any clustered arrangement the use of (modify_tid) 420and (mirror_tid) is critical in determining the synchronization state of 421portion(s) of the filesystem. 422 423Specifically, since any modification propagates to the root the (mirror_tid) 424in higher level directories is going to be in a constant state of flux. This 425state of flux DOES NOT invalidate the cache state for these higher levels 426of directories. Instead, the (modify_tid) is used on a node-by-node basis 427to determine cache state at any given level, and (mirror_tid) is used to 428determine whether any recursively underlying state is desynchronized. 429The inode structure also has two additional transaction ids used to optimize 430path lookups, stat, and directory lookup/scan operations. 431