1 2 HAMMER2 DESIGN DOCUMENT 3 4 Matthew Dillon 5 08-Feb-2012 6 dillon@backplane.com 7 8* These features have been speced in the media structures. 9 10* Implementation work has begun. 11 12* A working filesystem with some features implemented is expected by July 2012. 13 14* A fully functional filesystem with most (but not all) features is expected 15 by the end of 2012. 16 17* All elements of the filesystem have been designed except for the freemap 18 (which isn't needed for initial work). 8MB per 2GB of filesystem 19 storage has been reserved for the freemap. The design of the freemap 20 is expected to be completely speced by mid-year. 21 22* This is my only project this year. I'm not going to be doing any major 23 kernel bug hunting this year. 24 25 Feature List 26 27* Multiple roots (allowing snapshots to be mounted). This is implemented 28 via the super-root concept. When mounting a HAMMER2 filesystem you specify 29 a device path and a directory name in the super-root. 30 31* HAMMER1 had PFS's. HAMMER2 does not. Instead, in HAMMER2 any directory 32 in the tree can be configured as a PFS, causing all elements recursively 33 underneath that directory to become a part of that PFS. 34 35* Writable snapshots. Any subdirectory tree can be snapshotted. Snapshots 36 show up in the super-root. It is possible to snapshot a subdirectory 37 and then later snapshot a parent of that subdirectory... really there are 38 no limitations here. 39 40* Directory sub-hierarchy based quotas and space and inode usage tracking. 41 Any directory sub-tree, whether at a mount point or not, tracks aggregate 42 inode use and data space use. This is stored in the directory inode all 43 the way up the chain. 44 45* Incremental queueless mirroring / mirroring-streams. Because HAMMER2 is 46 block-oriented and copy-on-write each blockref tracks both direct 47 modifications to the referenced data via (modify_tid) and indirect 48 modifications to the referenced data or any sub-tree via (mirror_tid). 49 This makes it possible to do an incremental scan of meta-data that covers 50 only changes made since the mirror_tid recorded in a prior-run. 51 52 This feature is also intended to be used to locate recently allocated 53 blocks and thus be able to fixup the freemap after a crash. 54 55 HAMMER2 mirroring works a bit differently than HAMMER1 mirroring in 56 that HAMMER2 does not keep track of 'deleted' records. Instead any 57 recursion by the mirroring code which finds that (modify_tid) has 58 been updated must also send the direct block table or indirect block 59 table state it winds up recursing through so the target can check 60 similar key ranges and locate elements to be deleted. This can be 61 avoided if the mirroring stream is mostly caught up in that very recent 62 deletions will be cached in memory and can be queried, allowing shorter 63 record deletions to be passed in the stream instead. 64 65* Will support multiple compression algorithms configured on subdirectory 66 tree basis and on a file basis. Up to 64K block compression will be used. 67 Only compression ratios near powers of 2 that are at least 2:1 (e.g. 2:1, 68 4:1, 8:1, etc) will work in this scheme because physical block allocations 69 in HAMMER2 are always power-of-2. 70 71 Compression algorithm #0 will mean no compression and no zero-checking. 72 Compression algorithm #1 will mean zero-checking but no other compression. 73 Real compression will be supported starting with algorithm 2. 74 75* Zero detection on write (writing all-zeros), which requires the data 76 buffer to be scanned, will be supported as compression algorithm #1. 77 This allows the writing of 0's to create holes and will be the default 78 compression algorithm for HAMMER2. 79 80* Copies support for redundancy. Each copy has its own blockref. The 81 blockrefs representing the copies must exist within the same blockset 82 (set of 8 blockrefs), though I may relax this requirement in the 83 implementation. 84 85 The design is such that the filesystem should be able to function at 86 full speed even if disks are pulled or inserted, as long as at least one 87 good copy is present. A background task will be needed to resynchronize 88 missing copies (or remove excessive copies in the case where the copies 89 value is reduced on a live filesystem). 90 91 Copies are specified using the same copyinfo[] array that is used to 92 specify cluster interconnections for PFS's. 93 94* Clusterable with MESI cache coherency and dynamic granularity. 95 The media format for HAMMER1 was less condusive to logical clustering 96 than I had hoped so I was never able to get that aspect of my personal goals 97 working with HAMMER1. HAMMER2 effectively solves the issues that cropped 98 up with HAMMER1 (mainly that HAMMER1's B-Tree did not reflect the logical 99 file/directory hierarchy, making cache coherency very difficult). 100 101* Hardlinks will be supported. All other standard features will be supported 102 too of course. Hardlinks in this sort of filesystem require significant 103 work. 104 105* The media blockref structure is now large enough to support up to a 192-bit 106 check value, which would typically be a cryptographic hash of some sort. 107 Multiple check value algorithms will be supported with the default being 108 a simple 32-bit iSCSI CRC. 109 110* Fully verified deduplication will be supported and automatic (and 111 necessary in many respects). 112 113* Non-verified de-duplication will be supported as a configurable option on 114 a file or subdirectory tree. Non-verified deduplication would use the 115 largest available check code (192 bits) and not bother to verify data 116 matches during the dedup pass, which is necessary on extremely large 117 filesystems with a great deal of deduplicable data (as otherwise a large 118 chunk of the media would have to be read to implement the dedup). 119 120 This feature is intended only for those files where occassional corruption 121 is ok, such as in a large data store of farmed web content. 122 123 GENERAL DESIGN 124 125HAMMER2 generally implements a copy-on-write block design for the filesystem, 126which is very different from HAMMER1's B-Tree design. Because the design 127is copy-on-write it can be trivially snapshotted simply by referencing an 128existing block, and because the media structures logically match a standard 129filesystem directory/file hierarchy snapshots and other similar operations 130can be trivially performed on an entire subdirectory tree at any level in 131the filesystem. 132 133The copy-on-write nature of the filesystem implies that any modification 134whatsoever will have to eventually synchronize new disk blocks all the way 135to the super-root of the filesystem and the volume header itself. This forms 136the basis for crash recovery. All disk writes are to new blocks except for 137the volume header, thus allowing all writes to run concurrently except for 138the volume header update at the end. 139 140Clearly this method requires intermediate modifications to the chain to be 141cached so multiple modifications can be aggregated prior to being 142synchronized. One advantage, however, is that the cache can be flushed at 143any time WITHOUT having to allocate yet another new block when further 144modifications are made as long as the volume header has not yet been flushed. 145This means that buffer cache overhead is very well bounded and can handle 146filesystem operations of any complexity even on boxes with very small amounts 147of physical memory. 148 149I intend to implement a shortcut to make fsync()'s run fast, and that is to 150allow deep updates to blockrefs to shortcut to auxillary space in the 151volume header to satisfy the fsync requirement. The related blockref is 152then recorded when the filesystem is mounted after a crash and the update 153chain is reconstituted when a matching blockref is encountered again during 154normal operation of the filesystem. 155 156Basically this means that no real work needs to be done at mount-time 157even after a crash. 158 159Directories are hashed, and another major design element is that directory 160entries ARE INODES. They are one and the same. In addition to directory 161entries being inodes the data for very small files (512 bytes or smaller) 162can be directly embedded in the inode (overloaded onto the same space that 163the direct blockref array uses). This should result in very high 164performance. 165 166Inode numbers are not spatially referenced, which complicates NFS servers 167but doesn't complicate anything else. The inode number is stored in the 168inode itself, an absolutely necessary feature in order to support the 169hugely flexible snapshots that we want to have in HAMMER2. 170 171 HARDLINKS 172 173Hardlinks are a particularly sticky problem for HAMMER2 due to the lack of 174a spatial reference to the inode number. We do not want to have to have 175an index of inode numbers for any basic HAMMER2 feature if we can help it. 176 177Hardlinks are handled by placing the inode for a multiply-hardlinked file 178in the closest common parent directory. If "a/x" and "a/y" are hardlinked 179the inode for the hardlinked file will be placed in directory "a", e.g. 180"a/3239944", but it will be invisible and will be in an out-of-band namespace. 181The directory entries "a/x" and "a/y" will be given the same inode number 182but in fact just be placemarks that cause HAMMER2 to recurse upwards through 183the directory tree to find the invisible inode number. 184 185Because directories are hashed and a different namespace (hash key range) 186is used for hardlinked inodes, standard directory scans are able to trivially 187skip this invisible namespace and inode-specific lookups can restrict their 188lookup to within this space. 189 190The nature of snapshotting makes handling link-count 2->1 and 1->2 cases 191trivial. Basically the inode media structure is copied as needed to break-up 192or re-form the standard directory entry/inode. There are no backpointers in 193HAMMER2 and no reference counts on the blocks (see FREEMAP NOTES below), so 194it is an utterly trivial operation. 195 196 FREEMAP NOTES 197 198In order to implement fast snapshots (and writable snapshots for that 199matter), HAMMER2 does NOT ref-count allocations. The freemap which 200is still under design just won't do that. All the freemap does is 201keep track of 100% free blocks. 202 203This not only trivializes all the snapshot features it also trivializes 204hardlink handling and solves the problem of keeping the freemap sychronized 205in the event of a crash. Now all we have to do after a crash is make 206sure blocks allocated before the freemap was flushed are properly 207marked as allocated in the allocmap. This is a trivial exercise using the 208same algorithm the mirror streaming code uses (which is very similar to 209HAMMER1)... an incremental meta-data scan that covers only the blocks that 210might have been allocated between the last allocation map sync and now. 211 212Thus the freemap does not have to be synchronized during a fsync(). 213 214The complexity is in figuring out what can be freed... that is, when one 215can mark blocks in the freemap as being free. HAMMER2 implements this as 216a background task which essentially must scan available meta-data to 217determine which blocks are not being referenced. 218 219Part of the ongoing design work is finding ways to reduce the scope of this 220meta-data scan so the entire filesystem's meta-data does not need to be 221scanned (though in tests with HAMMER1, even full meta-data scans have 222turned out to be fairly low cost). In other words, its an area that we 223can continue to improve on as the filesystem matures. Not only that, but 224we can completely change the freemap algorithms without creating 225incompatibilities (at worse simply having to require that a R+W mount do 226a full meta-data scan when upgrading or downgrading the freemap algorithm). 227 228 CLUSTERING 229 230Clustering, as always, is the most difficult bit but we have some advantages 231with HAMMER2 that we did not have with HAMMER1. First, HAMMER2's media 232structures generally follow the kernel's filesystem hiearchy. Second, 233HAMMER2's writable snapshots make it possible to implement several forms 234of multi-master clustering. 235 236The mount device path you specify serves to bootstrap your entry into 237the cluster. This can be local media or directly specify a network 238cluster connection (or several). When a local media mount is used the 239volume header is scanned for local copies and the best volume header is 240selected from all available copies. Multiple devices may be specified for 241redundancy. 242 243The volume header on local media also contains cluster connection 244specifications keyed by super-root pfsid. Network connections are 245maintained to all targets. ALL ELEMENTS ARE TREATED ACCORDING TO TYPE 246NO MATTER WHICH ONE YOU MOUNT FROM. 247 248The actual networked cluster may be far larger than the elements you list 249in the hammer2_copy_data[] array, but your machine will only make direct 250connections as specified by the array. 251 252In the simplest case you simply network a few machines together as ring 0 253masters and each client connects directly to all the masters (and/or are 254the masters themselves). Thus any quorum operation is straight-forward. 255These master nodes are labeled 'ring 0'. 256 257If you have too many clients to reasonably connect directly you set up 258sub-clusters as satellites. This is called 'ring 1'. Ring 1 may contain 259several sub-clusters. A client then connects to all the nodes in a 260particular sub-cluster (typically 3). The quorum protocol runs as per 261normal except that once the operation is resolved against the sub-cluster 262an aggregation must be resolved against the master nodes (ring 0). The 263sub-cluster does this for the client... all the client sees is the normal 264quorum operation against the sub-cluster. 265 266Since each node in the sub-cluster connects to all master nodes we get 267a multiplication. If we set a reasonable upper limit of, say, 256 268connections at each master node then ring 1 may contain 85 sub-clusters x 3 269nodes in each sub-cluster. 270 271In the most complex case when one wishes to support potentially millions 272of clients then further fan-out is required into ring 2, ring 3, and 273so forth. However, each sub-cluster in ring 2 must only connect to 2741 sub-cluster in ring 1 (otherwise the cache state will become mightily 275confused). Using reasonable metrics this will allow ring 2 to contain 27685 * 85 = 7225 sub-clusters. At this point you could have 1000 clients 277connect to each sub-cluster and support 7.2 million clients, but if that 278isn't enough going to another ring will support 61M clients, and so forth. 279 280Each ring imposes additional latencies for cache operations but the key 281to making this work efficiently is that the satellite clusters can negotiate 282coarse-grained cache coherency locks with the next lower ring and then 283fan-out finer-grained locks to the next higher ring. Since caching can 284occur anywhere (including on the connecting client), it is the cache 285coherency lock that ultimately dictates efficiency and allows a client 286(or satellite) to access large amoutns of data from local storage. 287 288Modifying operations, particularly commits, also have higher latencies 289when multiple rings are in use. In this situation it is possible to 290short-cut localized operations by having competing clients connect to 291to sub-clusters which are near each other topologically... having the 292competing clients connect to the same sub-cluster would be the most optimal. 293 294In addition, sub-clusters (typically in ring 1) can act in SOFT_MASTER mode 295which allows the sub-cluster to acknowledge a full commit within its own 296quorum only, and then resolve asynchronously to the masters in ring 0. 297 298The nodes in these intermediate rings can be pure proxies with only memory 299caches, use local media for persistent cache, or use local media to 300completely slave the filesystem. 301 302 ADMIN - Media does not participate, administrative proxy only 303 CLIENT - Media does not participate, client only 304 CACHE - Media only acts as a persistent cache 305 COPY - Media only acts as a local copy 306 SLAVE - Media is a RO slave that can be mounted RW 307 308 SOFT_SLAVE - This is a SLAVE which can become writable when 309 the quorum is not available, but is not guaranteed 310 to be able to be merged back when the quorum becomes 311 available again. Elements which cannot be merged 312 back remain localized and writable until manual 313 or scripted intervention recombines them. 314 315 SOFT_MASTER - Similar to the above but can form a sub-cluster 316 and run the quorum protocol within the sub-cluster 317 to serve machines that connect to the sub-cluster 318 when the master cluster is not available. 319 320 The SOFT_MASTER nodes in a sub-cluster must be 321 fully interconnected with each other. 322 323 MASTER - This is a MASTER node in the quorum protocol. 324 325 The MASTER nodes in a cluster must be fully 326 interconnected with each other. 327 328There are four major protocols: 329 330 Quorum protocol 331 332 This protocol is used between MASTER nodes to vote on operations 333 and resolve deadlocks. 334 335 This protocol is used between SOFT_MASTER nodes in a sub-cluster 336 to vote on operations, resolve deadlocks, determine what the latest 337 transaction id for an element is, and to perform commits. 338 339 Cache sub-protocol 340 341 This is the MESI sub-protocol which runs under the Quorum 342 protocol. This protocol is used to maintain cache state for 343 sub-trees to ensure that operations remain cache coherent. 344 345 Depending on administrative rights this protocol may or may 346 not allow a leaf node in the cluster to hold a cache element 347 indefinitely. The administrative controller may preemptively 348 downgrade a leaf with insufficient administrative rights 349 without giving it a chance to synchronize any modified state 350 back to the cluster. 351 352 Proxy protocol 353 354 The Quorum and Cache protocols only operate between MASTER 355 and SOFT_MASTER nodes. All other node types must use the 356 Proxy protocol to perform similar actions. This protocol 357 differs in that proxy requests are typically sent to just 358 one adjacent node and that node then maintains state and 359 forwards the request or performs the required operation. 360 When the link is lost to the proxy, the proxy automatically 361 forwards a deletion of the state to the other nodes based on 362 what it has recorded. 363 364 If a leaf has insufficient administrative rights it may not 365 be allowed to actually initiate a quorum operation and may only 366 be allowed to maintain partial MESI cache state or perhaps none 367 at all (since cache state can block other machines in the 368 cluster). Instead a leaf with insufficient rights will have to 369 make due with a preemptive loss of cache state and any allowed 370 modifying operations will have to be forwarded to the proxy which 371 continues forwarding it until a node with sufficient administrative 372 rights is encountered. 373 374 To reduce issues and give the cluster more breath, sub-clusters 375 made up of SOFT_MASTERs can be formed in order to provide full 376 cache coherent within a subset of machines and yet still tie them 377 into a greater cluster that they normally would not have such 378 access to. This effectively makes it possible to create a two 379 or three-tier fan-out of groups of machines which are cache-coherent 380 within the group, but perhaps not between groups, and use other 381 means to synchronize between the groups. 382 383 Media protocol 384 385 This is basically the physical media protocol. 386 387There are lots of ways to implement multi-master environments using the 388above core features but the implementation is going to be fairly complex 389even with HAMMER2's feature set. 390 391Keep in mind that modifications propagate all the way to the super-root 392and volume header, so in any clustered arrangement the use of (modify_tid) 393and (mirror_tid) is critical in determining the synchronization state of 394portion(s) of the filesystem. 395 396Specifically, since any modification propagates to the root the (mirror_tid) 397in higher level directories is going to be in a constant state of flux. This 398state of flux DOES NOT invalidate the cache state for these higher levels 399of directories. Instead, the (modify_tid) is used on a node-by-node basis 400to determine cache state at any given level, and (mirror_tid) is used to 401determine whether any recursively underlying state is desynchronized. 402The inode structure also has two additional transaction ids used to optimize 403path lookups, stat, and directory lookup/scan operations. 404