1 2The Crazy Hacker's Crazy Guide to Bup Craziness 3=============================================== 4 5Despite what you might have heard, bup is not that crazy, and neither are 6you if you're trying to figure out how it works. But it's also (as of this 7writing) rather new and the source code doesn't have a lot of comments, so 8it can be a little confusing at first glance. This document is designed to 9make it easier for you to get started if you want to add a new feature, fix 10a bug, or just understand how it all works. 11 12 13Bup Source Code Layout 14---------------------- 15 16As you're reading this, you might want to look at different parts of the bup 17source code to follow along and see what we're talking about. bup's code is 18written primarily in python with a bit of C code in speed-sensitive places. 19Here are the most important things to know: 20 21 - bup (symlinked to main.py) is the main program that runs when you type 22 'bup'. 23 24 - cmd/bup-* (mostly symlinked to cmd/*-cmd.py) are the individual 25 subcommands, in a way similar to how git breaks all its subcommands into 26 separate programs. Not all the programs have to be written in python; 27 they could be in any language, as long as they end up named cmd/bup-*. 28 We might end up re-coding large parts of bup in C eventually so that it 29 can be even faster and (perhaps) more portable. 30 31 - lib/bup/*.py are python library files used by the cmd/*.py commands. 32 That directory name seems a little silly (and worse, redundant) but there 33 seemed to be no better way to let programs write "from bup import 34 index" and have it work. Putting bup in the top level conflicted with 35 the 'bup' command; calling it anything other than 'bup' was fundamentally 36 wrong, and doesn't work when you install bup on your system in /usr/lib 37 somewhere. So we get the annoyingly long paths. 38 39 40Repository Structure 41==================== 42 43Before you can talk about how bup works, we need to first address what it 44does. The purpose of bup is essentially to let you "replicate" data between 45two main data structures: 46 471. Your computer's filesystem; 48 492. A bup repository. (Yes, we know, that part also resides in your 50 filesystem. Stop trying to confuse yourself. Don't worry, we'll be 51 plenty confusing enough as it is.) 52 53Essentially, copying data from the filesystem to your repository is called 54"backing stuff up," which is what bup specializes in. Normally you initiate 55a backup using the 'bup save' command, but that's getting ahead of 56ourselves. 57 58For the inverse operation, ie. copying from the repository to your 59filesystem, you have several choices; the main ones are 'bup restore', 'bup 60ftp', 'bup fuse', and 'bup web'. 61 62Now, those are the basics of backups. In other words, we just spent about 63half a page telling you that bup backs up and restores data. Are we having 64fun yet? 65 66The next thing you'll want to know is the format of the bup repository, 67because hacking on bup is rather impossible unless you understand that part. 68In short, a bup repository is a git repository. If you don't know about 69git, you'll want to read about it now. A really good article to read is 70"Git for Computer Scientists" - you can find it in Google. Go read it now. 71We'll wait. 72 73Got it? Okay, so now you're an expert in blobs, trees, commits, and refs, 74the four building blocks of a git repository. bup uses these four things, 75and they're formatted in exactly the same way as git does it, so you can use 76git to manipulate the bup repository if you want, and you probably won't 77break anything. It's also a comfort to know you can squeeze data out using 78git, just in case bup fails you, and as a developer, git offers some nice 79tools (like 'git rev-list' and 'git log' and 'git diff' and 'git show' and 80so on) that allow you to explore your repository and help debug when things 81go wrong. 82 83Now, bup does use these tools a little bit differently than plain git. We 84need to do this in order to address two deficiencies in git when used for 85large backups, namely a) git bogs down and crashes if you give it really 86large files; b) git is too slow when you give it too many files; and c) git 87doesn't store detailed filesystem metadata. 88 89Let's talk about each of those problems in turn. 90 91 92Handling large files (cmd/split, hashsplit.split_to_blob_or_tree) 93-------------------- 94 95The primary reason git can't handle huge files is that it runs them through 96xdelta, which generally means it tries to load the entire contents of a file 97into memory at once. If it didn't do this, it would have to store the 98entire contents of every single revision of every single file, even if you 99only changed a few bytes of that file. That would be a terribly inefficient 100use of disk space, and git is well known for its amazingly efficient 101repository format. 102 103Unfortunately, xdelta works great for small files and gets amazingly slow 104and memory-hungry for large files. For git's main purpose, ie. managing 105your source code, this isn't a problem. But when backing up your 106filesystem, you're going to have at least a few large files, and so it's a 107non-starter. bup has to do something totally different. 108 109What bup does instead of xdelta is what we call "hashsplitting." We wanted 110a general-purpose way to efficiently back up *any* large file that might 111change in small ways, without storing the entire file every time. In fact, 112the original versions of bup could only store a single file at a time; 113surprisingly enough, this was enough to give us a large part of bup's 114functionality. If you just take your entire filesystem and put it in a 115giant tarball each day, then send that tarball to bup, bup will be able to 116efficiently store only the changes to that tarball from one day to the next. 117For small files, bup's compression won't be as good as xdelta's, but for 118anything over a few megabytes in size, bup's compression will actually 119*work*, which is a big advantage over xdelta. 120 121How does hashsplitting work? It's deceptively simple. We read through the 122file one byte at a time, calculating a rolling checksum of the last 64 123bytes. (Why 64? No reason. Literally. We picked it out of the air. 124Probably some other number is better. Feel free to join the mailing list 125and tell us which one and why.) (The rolling checksum idea is actually 126stolen from rsync and xdelta, although we use it differently. And they use 127some kind of variable window size based on a formula we don't totally 128understand.) 129 130The original rolling checksum algorithm we used was called "stupidsum," 131because it was based on the only checksum Avery remembered how to calculate at 132the time. He also remembered that it was the introductory checksum 133algorithm in a whole article about how to make good checksums that he read 134about 15 years ago, and it was thoroughly discredited in that article for 135being very stupid. But, as so often happens, Avery couldn't remember any 136better algorithms from the article. So what we got is stupidsum. 137 138Since then, we have replaced the stupidsum algorithm with what we call 139"rollsum," based on code in librsync. It's essentially the same as what 140rsync does, except we use a fixed window size. 141 142(If you're a computer scientist and can demonstrate that some other rolling 143checksum would be faster and/or better and/or have fewer screwy edge cases, 144we need your help! Avery's out of control! Join our mailing list! Please! 145Save us! ... oh boy, I sure hope he doesn't read this) 146 147In any case, rollsum seems to do pretty well at its job. 148You can find it in bupsplit.c. Basically, it converts the last 64 bytes 149read into a 32-bit integer. What we then do is take the lowest 13 150bits of the rollsum, and if they're all 1's, we consider that to be the end 151of a chunk. This happens on average once every 2^13 = 8192 bytes, so the 152average chunk size is 8192 bytes. 153 154(Why 13 bits? Well, we picked the number at random and... eugh. You're 155getting the idea, right? Join the mailing list and tell us why we're 156wrong.) 157 158(Incidentally, even though the average chunk size is 8192 bytes, the actual 159probability distribution of block sizes ends up being non-uniform; if we 160remember our stats classes correctly, which we probably don't, it's probably 161an "exponential distribution." The idea is that for each byte in the block, 162the probability that it's the last block is one in 8192. Thus, the 163block sizes end up being skewed toward the smaller end. That's not 164necessarily for the best, but maybe it is. Computer science to the rescue? 165You know the drill.) 166 167Anyway, so we're dividing up those files into chunks based on the rolling 168checksum. Then we store each chunk separately (indexed by its sha1sum) as a 169git blob. Why do we split this way? Well, because the results are actually 170really nice. Let's imagine you have a big mysql database dump (produced by 171mysqldump) and it's basically 100 megs of SQL text. Tomorrow's database 172dump adds 100 rows to the middle of the file somewhere, soo it's 100.01 megs 173of text. 174 175A naive block splitting algorithm - for example, just dividing the file into 1768192-byte blocks - would be a disaster. After the first bit of text has 177changed, every block after that would have a different boundary, so most of 178the blocks in the new backup would be different from the previous ones, and 179you'd have to store the same data all over again. But with hashsplitting, 180no matter how much data you add, modify, or remove in the middle of the 181file, all the chunks *before* and *after* the affected chunk are absolutely 182the same. All that matters to the hashsplitting algorithm is the 183"separator" sequence, and a single change can only affect, at most, one 184separator sequence or the bytes between two separator sequences. And 185because of rollsum, about one in 8192 possible 64-byte sequences is a 186separator sequence. Like magic, the hashsplit chunking algorithm will chunk 187your file the same way every time, even without knowing how it had chunked 188it previously. 189 190The next problem is less obvious: after you store your series of chunks as 191git blobs, how do you store their sequence? Each blob has a 20-byte sha1 192identifier, which means the simple list of blobs is going to be 20/8192 = 1930.25% of the file length. For a 200GB file, that's 488 megs of just 194sequence data. 195 196As an overhead percentage, 0.25% basically doesn't matter. 488 megs sounds 197like a lot, but compared to the 200GB you have to store anyway, it's 198irrelevant. What *is* relevant is that 488 megs is a lot of memory you have 199to use in order to keep track of the list. Worse, if you back up an 200almost-identical file tomorrow, you'll have *another* 488 meg blob to keep 201track of, and it'll be almost but not quite the same as last time. 202 203Hmm, big files, each one almost the same as the last... you know where this 204is going, right? 205 206Actually no! Ha! We didn't split this list in the same way. We could 207have, in fact, but it wouldn't have been very "git-like", since we'd like to 208store the list as a git 'tree' object in order to make sure git's 209refcounting and reachability analysis doesn't get confused. Never mind the 210fact that we want you to be able to 'git checkout' your data without any 211special tools. 212 213What we do instead is we extend the hashsplit algorithm a little further 214using what we call "fanout." Instead of checking just the last 13 bits of 215the checksum, we use additional checksum bits to produce additional splits. 216Note that (most likely due to an implementation bug), the next higher bit 217after the 13 bits (marked 'x'): 218 219 ...... '..x1'1111'1111'1111 220 221is actually ignored next. Now, let's say we use a 4-bit fanout. That means 222we'll break a series of chunks into its own tree object whenever the next 2234 bits of the rolling checksum are 1, in addition to the 13 lowest ones. 224Since the 13 lowest bits already have to be 1, the boundary of a group of 225chunks is necessarily also always the boundary of a particular chunk. 226 227And so on. Eventually you'll have too many chunk groups, but you can group 228them into supergroups by using another 4 bits, and continue from there. 229 230What you end up with is an actual tree of blobs - which git 'tree' objects 231are ideal to represent. And if you think about it, just like the original 232list of chunks, the tree itself is pretty stable across file modifications. 233Any one modification will only affect the chunks actually containing the 234modifications, thus only the groups containing those chunks, and so on up 235the tree. Essentially, the number of changed git objects is O(log n) 236where n is the number of chunks. Since log 200 GB, using a base of 16 or 237so, is not a very big number, this is pretty awesome. Remember, any git 238object we *don't* change in a new backup is one we can reuse from last time, 239so the deduplication effect is pretty awesome. 240 241Better still, the hashsplit-tree format is good for a) random instead of 242sequential access to data (which you can see in action with 'bup fuse'); and 243b) quickly showing the differences between huge files (which we haven't 244really implemented because we don't need it, but you can try 'git diff -M -C 245-C backup1 backup2 -- filename' for a good start). 246 247So now we've split out 200 GB file into about 24 million pieces. That 248brings us to git limitation number 2. 249 250 251Handling huge numbers of files (git.PackWriter) 252------------------------------ 253 254git is designed for handling reasonably-sized repositories that change 255relatively infrequently. (You might think you change your source code 256"frequently" and that git handles much more frequent changes than, say, svn 257can handle. But that's not the same kind of "frequently" we're talking 258about. Imagine you're backing up all the files on your disk, and one of 259those files is a 100 GB database file with hundreds of daily users. Your 260disk changes so frequently you can't even back up all the revisions even if 261you were backing stuff up 24 hours a day. That's "frequently.") 262 263git's way of doing things works really nicely for the way software 264developers write software, but it doesn't really work so well for everything 265else. The #1 killer is the way it adds new objects to the repository: it 266creates one file per blob. Then you later run 'git gc' and combine those 267files into a single file (using highly efficient xdelta compression, and 268ignoring any files that are no longer relevant). 269 270'git gc' is slow, but for source code repositories, the resulting 271super-efficient storage (and associated really fast access to the stored 272files) is worth it. For backups, it's not; you almost never access your 273backed-up data, so storage time is paramount, and retrieval time is mostly 274unimportant. 275 276To back up that 200 GB file with git and hashsplitting, you'd have to create 27724 million little 8k files, then copy them into a 200 GB packfile, then 278delete the 24 million files again. That would take about 400 GB of disk 279space to run, require lots of random disk seeks, and require you to go 280through your data twice. 281 282So bup doesn't do that. It just writes packfiles directly. Luckily, these 283packfiles are still git-formatted, so git can happily access them once 284they're written. 285 286But that leads us to our next problem. 287 288 289Huge numbers of huge packfiles (midx.py, bloom.py, cmd/midx, cmd/bloom) 290------------------------------ 291 292Git isn't actually designed to handle super-huge repositories. Most git 293repositories are small enough that it's reasonable to merge them all into a 294single packfile, which 'git gc' usually does eventually. 295 296The problematic part of large packfiles isn't the packfiles themselves - git 297is designed to expect the total size of all packs to be larger than 298available memory, and once it can handle that, it can handle virtually any 299amount of data about equally efficiently. The problem is the packfile 300indexes (.idx) files. In bup we call these idx (pronounced "idix") files 301instead of using the word "index," because the word index is already used 302for something totally different in git (and thus bup) and we'll become 303hopelessly confused otherwise. 304 305Anyway, each packfile (*.pack) in git has an associated idx (*.idx) that's a 306sorted list of git object hashes and file offsets. If you're looking for a 307particular object based on its sha1, you open the idx, binary search it to 308find the right hash, then take the associated file offset, seek to that 309offset in the packfile, and read the object contents. 310 311The performance of the binary search is about O(log n) with the number of 312hashes in the pack, with an optimized first step (you can read about it 313elsewhere) that somewhat improves it to O(log(n)-7). 314 315Unfortunately, this breaks down a bit when you have *lots* of packs. Say 316you have 24 million objects (containing around 200 GB of data) spread across 317200 packfiles of 1GB each. To look for an object requires you search 318through about 122000 objects per pack; ceil(log2(122000)-7) = 10, so you'll 319have to search 10 times. About 7 of those searches will be confined to a 320single 4k memory page, so you'll probably have to page in about 3-4 pages 321per file, times 200 files, which makes 600-800 4k pages (2.4-3.6 megs)... 322every single time you want to look for an object. 323 324This brings us to another difference between git's and bup's normal use 325case. With git, there's a simple optimization possible here: when looking 326for an object, always search the packfiles in MRU (most recently used) 327order. Related objects are usually clusted together in a single pack, so 328you'll usually end up searching around 3 pages instead of 600, which is a 329tremendous improvement. (And since you'll quickly end up swapping in all 330the pages in a particular idx file this way, it isn't long before searching 331for a nearby object doesn't involve any swapping at all.) 332 333bup isn't so lucky. git users spend most of their time examining existing 334objects (looking at logs, generating diffs, checking out branches), which 335lends itself to the above optimization. bup, on the other hand, spends most 336of its time looking for *nonexistent* objects in the repository so that it 337can back them up. When you're looking for objects that aren't in the 338repository, there's no good way to optimize; you have to exhaustively check 339all the packs, one by one, to ensure that none of them contain the data you 340want. 341 342To improve performance of this sort of operation, bup introduces midx 343(pronounced "midix" and short for "multi-idx") files. As the name implies, 344they index multiple packs at a time. 345 346Imagine you had a midx file for your 200 packs. midx files are a lot like 347idx files; they have a lookup table at the beginning that narrows down the 348initial search, followed by a binary search. Then unlike idx files (which 349have a fixed-size 256-entry lookup table) midx tables have a variably-sized 350table that makes sure the entire binary search can be contained to a single 351page of the midx file. Basically, the lookup table tells you which page to 352load, and then you binary search inside that page. A typical search thus 353only requires the kernel to swap in two pages, which is better than results 354with even a single large idx file. And if you have lots of RAM, eventually 355the midx lookup table (at least) will end up cached in memory, so only a 356single page should be needed for each lookup. 357 358You generate midx files with 'bup midx'. The downside of midx files is that 359generating one takes a while, and you have to regenerate it every time you 360add a few packs. 361 362UPDATE: Brandon Low contributed an implementation of "bloom filters", which 363have even better characteristics than midx for certain uses. Look it up in 364Wikipedia. He also massively sped up both midx and bloom by rewriting the 365key parts in C. The nicest thing about bloom filters is we can update them 366incrementally every time we get a new idx, without regenerating from 367scratch. That makes the update phase much faster, and means we can also get 368away with generating midxes less often. 369 370midx files are a bup-specific optimization and git doesn't know what to do 371with them. However, since they're stored as separate files, they don't 372interfere with git's ability to read the repository. 373 374 375Detailed Metadata 376----------------- 377 378So that's the basic structure of a bup repository, which is also a git 379repository. There's just one more thing we have to deal with: 380filesystem metadata. Git repositories are really only intended to 381store file contents with a small bit of extra information, like 382symlink targets and executable bits, so we have to store the rest 383some other way. 384 385Bup stores more complete metadata in the VFS in a file named .bupm in 386each tree. This file contains one entry for each file in the tree 387object, sorted in the same order as the tree. The first .bupm entry 388is for the directory itself, i.e. ".", and its name is the empty 389string, "". 390 391Each .bupm entry contains a variable length sequence of records 392containing the metadata for the corresponding path. Each record 393records one type of metadata. Current types include a common record 394type (containing the normal stat information), a symlink target type, 395a hardlink target type, a POSIX1e ACL type, etc. See metadata.py for 396the complete list. 397 398The .bupm file is optional, and when it's missing, bup will behave as 399it did before the addition of metadata, and restore files using the 400tree information. 401 402The nice thing about this design is that you can walk through each 403file in a tree just by opening the tree and the .bupm contents, and 404iterating through both at the same time. 405 406Since the contents of any .bupm file should match the state of the 407filesystem when it was *indexed*, bup must record the detailed 408metadata in the index. To do this, bup records four values in the 409index, the atime, mtime, and ctime (as timespecs), and an integer 410offset into a secondary "metadata store" which has the same name as 411the index, but with ".meta" appended. This secondary store contains 412the encoded Metadata object corresponding to each path in the index. 413 414Currently, in order to decrease the storage required for the metadata 415store, bup only writes unique values there, reusing offsets when 416appropriate across the index. The effectiveness of this approach 417relies on the expectation that there will be many duplicate metadata 418records. Storing the full timestamps in the index is intended to make 419that more likely, because it makes it unnecessary to record those 420values in the secondary store. So bup clears them before encoding the 421Metadata objects destined for the index, and timestamp differences 422don't contribute to the uniqueness of the metadata. 423 424Bup supports recording and restoring hardlinks, and it does so by 425tracking sets of paths that correspond to the same dev/inode pair when 426indexing. This information is stored in an optional file with the 427same name as the index, but ending with ".hlink". 428 429If there are multiple index runs, and the hardlinks change, bup will 430notice this (within whatever subtree it is asked to reindex) and 431update the .hlink information accordingly. 432 433The current hardlink implementation will refuse to link to any file 434that resides outside the restore tree, and if the restore tree spans a 435different set of filesystems than the save tree, complete sets of 436hardlinks may not be restored. 437 438 439Filesystem Interaction 440====================== 441 442Storing data is just half of the problem of making a backup; figuring out 443what to store is the other half. 444 445At the most basic level, piping the output of 'tar' into 'bup split' is an 446easy way to offload that decision; just let tar do all the hard stuff. And 447if you like tar files, that's a perfectly acceptable way to do it. But we 448can do better. 449 450Backing up with tarballs would totally be the way to go, except for two 451serious problems: 452 4531. The result isn't easily "seekable." Tar files have no index, so if (as 454 commonly happens) you only want to restore one file in a 200 GB backup, 455 you'll have to read up to 200 GB before you can get to the beginning of 456 that file. tar is short for "tape archive"; on a tape, there was no 457 better way to do it anyway, so they didn't try. But on a disk, random 458 file access is much, much better when you can figure out how. 459 4602. tar doesn't remember which files it backed up last time, so it has to 461 read through the entire file contents again in order to generate the 462 tarball, large parts of which will then be skipped by bup since they've 463 already been stored. This is much slower than necessary. 464 465(The second point isn't entirely true for all versions of tar. For example, 466GNU tar has an "incremental" mode that can somewhat mitigate this problem, 467if you're smart enough to know how to use it without hurting yourself. But 468you still have to decide which backups are "incremental" and which ones will 469be "full" and so on, so even when it works, it's more error-prone than bup.) 470 471bup divides the backup process into two major steps: a) indexing the 472filesystem, and b) saving file contents into the repository. Let's look at 473those steps in detail. 474 475 476Indexing the filesystem (cmd/drecurse, cmd/index, index.py) 477----------------------- 478 479Splitting the filesystem indexing phase into its own program is 480nontraditional, but it gives us several advantages. 481 482The first advantage is trivial, but might be the most important: you can 483index files a lot faster than you can back them up. That means we can 484generate the index (.bup/bupindex) first, then have a nice, reliable, 485non-lying completion bar that tells you how much of your filesystem remains 486to be backed up. The alternative would be annoying failures like counting 487the number of *files* remaining (as rsync does), even though one of the 488files is a virtual machine image of 80 GB, and the 1000 other files are each 489under 10k. With bup, the percentage complete is the *real* percentage 490complete, which is very pleasant. 491 492Secondly, it makes it easier to debug and test; you can play with the index 493without actually backing up any files. 494 495Thirdly, you can replace the 'bup index' command with something else and not 496have to change anything about the 'bup save' command. The current 'bup 497index' implementation just blindly walks the whole filesystem looking for 498files that have changed since the last time it was indexed; this works fine, 499but something using inotify instead would be orders of magnitude faster. 500Windows and MacOS both have inotify-like services too, but they're totally 501different; if we want to support them, we can simply write new bup commands 502that do the job, and they'll never interfere with each other. 503 504And fourthly, git does it that way, and git is awesome, so who are we to 505argue? 506 507So let's look at how the index file works. 508 509First of all, note that the ".bup/bupindex" file is not the same as git's 510".git/index" file. The latter isn't used in bup; as far as git is 511concerned, your bup repository is a "bare" git repository and doesn't have a 512working tree, and thus it doesn't have an index either. 513 514However, the bupindex file actually serves exactly the same purpose as git's 515index file, which is why we still call it "the index." We just had to 516redesign it for the usual bup-vs-git reasons, mostly that git just isn't 517designed to handle millions of files in a single repository. (The only way 518to find a file in git's index is to search it linearly; that's very fast in 519git-sized repositories, but very slow in bup-sized ones.) 520 521Let's not worry about the exact format of the bupindex file; it's still not 522optimal, and will probably change again. The most important things to know 523about bupindex are: 524 525 - You can iterate through it much faster than you can iterate through the 526 "real" filesystem (using something like the 'find' command). 527 528 - If you delete it, you can get it back just by reindexing your filesystem 529 (although that can be annoying to wait for); it's not critical to the 530 repository itself. 531 532 - You can iterate through only particular subtrees if you want. 533 534 - There is no need to have more than one index for a particular filesystem, 535 since it doesn't store anything about backups; it just stores file 536 metadata. It's really just a cache (or 'index') of your filesystem's 537 existing metadata. You could share the bupindex between repositories, or 538 between multiple users on the same computer. If you back up your 539 filesystem to multiple remote repositories to be extra safe, you can 540 still use the same bupindex file across all of them, because it's the 541 same filesystem every time. 542 543 - Filenames in the bupindex are absolute paths, because that's the best way 544 to ensure that you only need one bupindex file and that they're 545 interchangeable. 546 547 548A note on file "dirtiness" 549-------------------------- 550 551The concept on which 'bup save' operates is simple enough; it reads through 552the index and backs up any file that is "dirty," that is, doesn't already 553exist in the repository. 554 555Determination of dirtiness is a little more complicated than it sounds. The 556most dirtiness-relevant flag in the bupindex is IX_HASHVALID; if 557this flag is reset, the file *definitely* is dirty and needs to be backed 558up. But a file may be dirty even if IX_HASHVALID is set, and that's the 559confusing part. 560 561The index stores a listing of files, their attributes, and 562their git object ids (sha1 hashes), if known. The "if known" is what 563IX_HASHVALID is about. When 'bup save' backs up a file, it sets 564the sha1 and sets IX_HASHVALID; when 'bup index' sees that a file has 565changed, it leaves the sha1 alone and resets IX_HASHVALID. 566 567Remember that the index can be shared between users, repositories, and 568backups. So IX_HASHVALID doesn't mean your repository *has* that sha1 in 569it; it only means that if you *do* have it, that you don't need to back up 570the file. Thus, 'bup save' needs to check every file in the index to make 571sure its hash exists, not just that it's valid. 572 573There's an optimization possible, however: if you know a particular tree's 574hash is valid and exists (say /usr), then you don't need to check the 575validity of all its children; because of the way git trees and blobs work, 576if your repository is valid and you have a tree object, then you have all 577the blobs it points to. You won't back up a tree object without backing up 578its blobs first, so you don't need to double check it next time. (If you 579really want to double check this, it belongs in a tool like 'bup fsck' or 580'git fsck'.) So in short, 'bup save' on a "clean" index (all files are 581marked IX_HASHVALID) can be very fast; we just check our repository and see 582if the top level IX_HASHVALID sha1 exists. If it does, then we're done. 583 584Similarly, if not the entire index is valid, you can still avoid recursing 585into subtrees if those particular subtrees are IX_HASHVALID and their sha1s 586are in the repository. The net result is that, as long as you never lose 587your index, 'bup save' can always run very fast. 588 589Another interesting trick is that you can skip backing up files even if 590IX_HASHVALID *isn't* set, as long as you have that file's sha1 in the 591repository. What that means is you've chosen not to backup the latest 592version of that file; instead, your new backup set just contains the 593most-recently-known valid version of that file. This is a good trick if you 594want to do frequent backups of smallish files and infrequent backups of 595large ones. Each of your backups will be "complete," in that they contain 596all the small files and the large ones, but intermediate ones will just 597contain out-of-date copies of the large files. Note that this isn't done 598right now, and 'bup save --smaller' doesn't store bigger files _at all_. 599 600A final game we can play with the bupindex involves restoring: when you 601restore a directory from a previous backup, you can update the bupindex 602right away. Then, if you want to restore a different backup on top, you can 603compare the files in the index against the ones in the backup set, and 604update only the ones that have changed. (Even more interesting things 605happen if people are using the files on the restored system and you haven't 606updated the index yet; the net result would be an automated merge of all 607non-conflicting files.) This would be a poor man's distributed filesystem. 608The only catch is that nobody has written this feature for 'bup restore' 609yet. Someday! 610 611 612How 'bup save' works (cmd/save) 613-------------------- 614 615This section is too boring and has been omitted. Once you understand the 616index, there's nothing special about bup save. 617 618 619Retrieving backups: the bup vfs layer (vfs.py, cmd/ls, cmd/ftp, cmd/fuse) 620===================================== 621 622One of the neat things about bup's storage format, at least compared to most 623backup tools, is it's easy to read a particular file, or even part of a 624file. That means a read-only virtual filesystem is easy to generate and 625it'll have good performance characteristics. Because of git's commit 626structure, you could even use branching and merging to make a transactional 627read-write filesystem... but that's probably getting a little out of bup's 628scope. Who knows what the future might bring, though? 629 630Read-only filesystems are well within our reach today, however. The 'bup 631ls', 'bup ftp', and 'bup fuse' commands all use a "VFS" (virtual filesystem) 632layer to let you access your repositories. Feel free to explore the source 633code for these tools and vfs.py - they're pretty straightforward. Some 634things to note: 635 636 - None of these use the bupindex for anything. 637 638 - For user-friendliness, they present your refs/commits/trees as a single 639 hierarchy (ie. a filesystem), which isn't really how git repositories 640 are formatted. So don't get confused! 641 642 643Handling Python 3's insistence on strings 644========================================= 645 646In Python 2 strings were bytes, and bup used them for all kinds of 647data. Python 3 made a pervasive backward-incompatible change to treat 648all strings as Unicode, i.e. in Python 2 'foo' and b'foo' were the 649same thing, while u'foo' was a Unicode string. In Python 3 'foo' 650became synonymous with u'foo', completely changing the type and 651potential content, depending on the locale. 652 653In addition, and particularly bad for bup, Python 3 also (initially) 654insisted that all kinds of things were strings that just aren't (at 655least not on many platforms), i.e. user names, groups, filesystem 656paths, etc. There's no guarantee that any of those are always 657representable in Unicode. 658 659Over the years, Python 3 has gradually backed away from that initial 660position, adding alternate interfaces like os.environb or allowing 661bytes arguments to many functions like open(b'foo'...), so that in 662those cases it's at least possible to accurately retrieve the system 663data. 664 665After a while, they devised the concept of 666[bytesmuggling](https://www.python.org/dev/peps/pep-0383/) as a more 667comprehensive solution. In theory, this might be sufficient, but our 668initial randomized testing discovered that some binary arguments would 669crash Python during startup[1]. Eventually Johannes Berg tracked down 670the [cause](https://sourceware.org/bugzilla/show_bug.cgi?id=26034), 671and we hope that the problem will be fixed eventually in glibc or 672worked around by Python, but in either case, it will be a long time 673before any fix is widely available. 674 675Before we tracked down that bug we were pursuing an approach that 676would let us side step the issue entirely by manipulating the 677LC_CTYPE, but that approach was somewhat complicated, and once we 678understood what was causing the crashes, we decided to just let Python 6793 operate "normally", and work around the issues. 680 681Consequently, we've had to wrap a number of things ourselves that 682incorrectly return Unicode strings (libacl, libreadline, hostname, 683etc.) and we've had to come up with a way to avoid the fatal crashes 684caused by some command line arguments (sys.argv) described above. To 685fix the latter, for the time being, we just use a trivial sh wrapper 686to redirect all of the command line arguments through the environment 687in BUP_ARGV_{0,1,2,...} variables, since the variables are unaffected, 688and we can access them directly in Python 3 via environb. 689 690[1] Our randomized argv testing found that the byte smuggling approach 691 was not working correctly for some values (initially discovered in 692 Python 3.7, and observed in other versions). The interpreter 693 would just crash while starting up like this: 694 695 Fatal Python error: _PyMainInterpreterConfig_Read: memory allocation failed 696 ValueError: character U+134bd2 is not in range [U+0000; U+10ffff] 697 698 Current thread 0x00007f2f0e1d8740 (most recent call first): 699 Traceback (most recent call last): 700 File "t/test-argv", line 28, in <module> 701 out = check_output(cmd) 702 File "/usr/lib/python3.7/subprocess.py", line 395, in check_output 703 **kwargs).stdout 704 File "/usr/lib/python3.7/subprocess.py", line 487, in run 705 output=stdout, stderr=stderr) 706 707We hope you'll enjoy bup. Looking forward to your patches! 708 709-- apenwarr and the rest of the bup team 710 711Local Variables: 712mode: markdown 713End: 714