1
2The Crazy Hacker's Crazy Guide to Bup Craziness
3===============================================
4
5Despite what you might have heard, bup is not that crazy, and neither are
6you if you're trying to figure out how it works.  But it's also (as of this
7writing) rather new and the source code doesn't have a lot of comments, so
8it can be a little confusing at first glance.  This document is designed to
9make it easier for you to get started if you want to add a new feature, fix
10a bug, or just understand how it all works.
11
12
13Bup Source Code Layout
14----------------------
15
16As you're reading this, you might want to look at different parts of the bup
17source code to follow along and see what we're talking about.  bup's code is
18written primarily in python with a bit of C code in speed-sensitive places.
19Here are the most important things to know:
20
21 - bup (symlinked to main.py) is the main program that runs when you type
22   'bup'.
23
24 - cmd/bup-* (mostly symlinked to cmd/*-cmd.py) are the individual
25   subcommands, in a way similar to how git breaks all its subcommands into
26   separate programs.  Not all the programs have to be written in python;
27   they could be in any language, as long as they end up named cmd/bup-*.
28   We might end up re-coding large parts of bup in C eventually so that it
29   can be even faster and (perhaps) more portable.
30
31 - lib/bup/*.py are python library files used by the cmd/*.py commands.
32   That directory name seems a little silly (and worse, redundant) but there
33   seemed to be no better way to let programs write "from bup import
34   index" and have it work.  Putting bup in the top level conflicted with
35   the 'bup' command; calling it anything other than 'bup' was fundamentally
36   wrong, and doesn't work when you install bup on your system in /usr/lib
37   somewhere.  So we get the annoyingly long paths.
38
39
40Repository Structure
41====================
42
43Before you can talk about how bup works, we need to first address what it
44does.  The purpose of bup is essentially to let you "replicate" data between
45two main data structures:
46
471. Your computer's filesystem;
48
492. A bup repository. (Yes, we know, that part also resides in your
50   filesystem.  Stop trying to confuse yourself.  Don't worry, we'll be
51   plenty confusing enough as it is.)
52
53Essentially, copying data from the filesystem to your repository is called
54"backing stuff up," which is what bup specializes in.  Normally you initiate
55a backup using the 'bup save' command, but that's getting ahead of
56ourselves.
57
58For the inverse operation, ie. copying from the repository to your
59filesystem, you have several choices; the main ones are 'bup restore', 'bup
60ftp', 'bup fuse', and 'bup web'.
61
62Now, those are the basics of backups.  In other words, we just spent about
63half a page telling you that bup backs up and restores data.  Are we having
64fun yet?
65
66The next thing you'll want to know is the format of the bup repository,
67because hacking on bup is rather impossible unless you understand that part.
68In short, a bup repository is a git repository.  If you don't know about
69git, you'll want to read about it now.  A really good article to read is
70"Git for Computer Scientists" - you can find it in Google.  Go read it now.
71We'll wait.
72
73Got it?  Okay, so now you're an expert in blobs, trees, commits, and refs,
74the four building blocks of a git repository.  bup uses these four things,
75and they're formatted in exactly the same way as git does it, so you can use
76git to manipulate the bup repository if you want, and you probably won't
77break anything.  It's also a comfort to know you can squeeze data out using
78git, just in case bup fails you, and as a developer, git offers some nice
79tools (like 'git rev-list' and 'git log' and 'git diff' and 'git show' and
80so on) that allow you to explore your repository and help debug when things
81go wrong.
82
83Now, bup does use these tools a little bit differently than plain git.  We
84need to do this in order to address two deficiencies in git when used for
85large backups, namely a) git bogs down and crashes if you give it really
86large files; b) git is too slow when you give it too many files; and c) git
87doesn't store detailed filesystem metadata.
88
89Let's talk about each of those problems in turn.
90
91
92Handling large files (cmd/split, hashsplit.split_to_blob_or_tree)
93--------------------
94
95The primary reason git can't handle huge files is that it runs them through
96xdelta, which generally means it tries to load the entire contents of a file
97into memory at once.  If it didn't do this, it would have to store the
98entire contents of every single revision of every single file, even if you
99only changed a few bytes of that file.  That would be a terribly inefficient
100use of disk space, and git is well known for its amazingly efficient
101repository format.
102
103Unfortunately, xdelta works great for small files and gets amazingly slow
104and memory-hungry for large files.  For git's main purpose, ie. managing
105your source code, this isn't a problem.  But when backing up your
106filesystem, you're going to have at least a few large files, and so it's a
107non-starter.  bup has to do something totally different.
108
109What bup does instead of xdelta is what we call "hashsplitting."  We wanted
110a general-purpose way to efficiently back up *any* large file that might
111change in small ways, without storing the entire file every time.  In fact,
112the original versions of bup could only store a single file at a time;
113surprisingly enough, this was enough to give us a large part of bup's
114functionality.  If you just take your entire filesystem and put it in a
115giant tarball each day, then send that tarball to bup, bup will be able to
116efficiently store only the changes to that tarball from one day to the next.
117For small files, bup's compression won't be as good as xdelta's, but for
118anything over a few megabytes in size, bup's compression will actually
119*work*, which is a big advantage over xdelta.
120
121How does hashsplitting work?  It's deceptively simple.  We read through the
122file one byte at a time, calculating a rolling checksum of the last 64
123bytes.  (Why 64?  No reason.  Literally.  We picked it out of the air.
124Probably some other number is better.  Feel free to join the mailing list
125and tell us which one and why.)  (The rolling checksum idea is actually
126stolen from rsync and xdelta, although we use it differently.  And they use
127some kind of variable window size based on a formula we don't totally
128understand.)
129
130The original rolling checksum algorithm we used was called "stupidsum,"
131because it was based on the only checksum Avery remembered how to calculate at
132the time.  He also remembered that it was the introductory checksum
133algorithm in a whole article about how to make good checksums that he read
134about 15 years ago, and it was thoroughly discredited in that article for
135being very stupid.  But, as so often happens, Avery couldn't remember any
136better algorithms from the article.  So what we got is stupidsum.
137
138Since then, we have replaced the stupidsum algorithm with what we call
139"rollsum," based on code in librsync.  It's essentially the same as what
140rsync does, except we use a fixed window size.
141
142(If you're a computer scientist and can demonstrate that some other rolling
143checksum would be faster and/or better and/or have fewer screwy edge cases,
144we need your help!  Avery's out of control!  Join our mailing list!  Please!
145Save us! ...  oh boy, I sure hope he doesn't read this)
146
147In any case, rollsum seems to do pretty well at its job.
148You can find it in bupsplit.c.  Basically, it converts the last 64 bytes
149read into a 32-bit integer.  What we then do is take the lowest 13
150bits of the rollsum, and if they're all 1's, we consider that to be the end
151of a chunk.  This happens on average once every 2^13 = 8192 bytes, so the
152average chunk size is 8192 bytes.
153
154(Why 13 bits?  Well, we picked the number at random and... eugh.  You're
155getting the idea, right?  Join the mailing list and tell us why we're
156wrong.)
157
158(Incidentally, even though the average chunk size is 8192 bytes, the actual
159probability distribution of block sizes ends up being non-uniform; if we
160remember our stats classes correctly, which we probably don't, it's probably
161an "exponential distribution."  The idea is that for each byte in the block,
162the probability that it's the last block is one in 8192.  Thus, the
163block sizes end up being skewed toward the smaller end.  That's not
164necessarily for the best, but maybe it is.  Computer science to the rescue?
165You know the drill.)
166
167Anyway, so we're dividing up those files into chunks based on the rolling
168checksum.  Then we store each chunk separately (indexed by its sha1sum) as a
169git blob.  Why do we split this way?  Well, because the results are actually
170really nice.  Let's imagine you have a big mysql database dump (produced by
171mysqldump) and it's basically 100 megs of SQL text.  Tomorrow's database
172dump adds 100 rows to the middle of the file somewhere, soo it's 100.01 megs
173of text.
174
175A naive block splitting algorithm - for example, just dividing the file into
1768192-byte blocks - would be a disaster.  After the first bit of text has
177changed, every block after that would have a different boundary, so most of
178the blocks in the new backup would be different from the previous ones, and
179you'd have to store the same data all over again.  But with hashsplitting,
180no matter how much data you add, modify, or remove in the middle of the
181file, all the chunks *before* and *after* the affected chunk are absolutely
182the same.  All that matters to the hashsplitting algorithm is the
183"separator" sequence, and a single change can only affect, at most, one
184separator sequence or the bytes between two separator sequences.  And
185because of rollsum, about one in 8192 possible 64-byte sequences is a
186separator sequence.  Like magic, the hashsplit chunking algorithm will chunk
187your file the same way every time, even without knowing how it had chunked
188it previously.
189
190The next problem is less obvious: after you store your series of chunks as
191git blobs, how do you store their sequence?  Each blob has a 20-byte sha1
192identifier, which means the simple list of blobs is going to be 20/8192 =
1930.25% of the file length.  For a 200GB file, that's 488 megs of just
194sequence data.
195
196As an overhead percentage, 0.25% basically doesn't matter.  488 megs sounds
197like a lot, but compared to the 200GB you have to store anyway, it's
198irrelevant.  What *is* relevant is that 488 megs is a lot of memory you have
199to use in order to keep track of the list.  Worse, if you back up an
200almost-identical file tomorrow, you'll have *another* 488 meg blob to keep
201track of, and it'll be almost but not quite the same as last time.
202
203Hmm, big files, each one almost the same as the last... you know where this
204is going, right?
205
206Actually no!  Ha!  We didn't split this list in the same way.  We could
207have, in fact, but it wouldn't have been very "git-like", since we'd like to
208store the list as a git 'tree' object in order to make sure git's
209refcounting and reachability analysis doesn't get confused.  Never mind the
210fact that we want you to be able to 'git checkout' your data without any
211special tools.
212
213What we do instead is we extend the hashsplit algorithm a little further
214using what we call "fanout." Instead of checking just the last 13 bits of
215the checksum, we use additional checksum bits to produce additional splits.
216Note that (most likely due to an implementation bug), the next higher bit
217after the 13 bits (marked 'x'):
218
219  ...... '..x1'1111'1111'1111
220
221is actually ignored next. Now, let's say we use a 4-bit fanout. That means
222we'll break a series of chunks into its own tree object whenever the next
2234 bits of the rolling checksum are 1, in addition to the 13 lowest ones.
224Since the 13 lowest bits already have to be 1, the boundary of a group of
225chunks is necessarily also always the boundary of a particular chunk.
226
227And so on.  Eventually you'll have too many chunk groups, but you can group
228them into supergroups by using another 4 bits, and continue from there.
229
230What you end up with is an actual tree of blobs - which git 'tree' objects
231are ideal to represent.  And if you think about it, just like the original
232list of chunks, the tree itself is pretty stable across file modifications.
233Any one modification will only affect the chunks actually containing the
234modifications, thus only the groups containing those chunks, and so on up
235the tree.  Essentially, the number of changed git objects is O(log n)
236where n is the number of chunks.  Since log 200 GB, using a base of 16 or
237so, is not a very big number, this is pretty awesome.  Remember, any git
238object we *don't* change in a new backup is one we can reuse from last time,
239so the deduplication effect is pretty awesome.
240
241Better still, the hashsplit-tree format is good for a) random instead of
242sequential access to data (which you can see in action with 'bup fuse'); and
243b) quickly showing the differences between huge files (which we haven't
244really implemented because we don't need it, but you can try 'git diff -M -C
245-C backup1 backup2 -- filename' for a good start).
246
247So now we've split out 200 GB file into about 24 million pieces.  That
248brings us to git limitation number 2.
249
250
251Handling huge numbers of files (git.PackWriter)
252------------------------------
253
254git is designed for handling reasonably-sized repositories that change
255relatively infrequently.  (You might think you change your source code
256"frequently" and that git handles much more frequent changes than, say, svn
257can handle.  But that's not the same kind of "frequently" we're talking
258about.  Imagine you're backing up all the files on your disk, and one of
259those files is a 100 GB database file with hundreds of daily users.  Your
260disk changes so frequently you can't even back up all the revisions even if
261you were backing stuff up 24 hours a day.  That's "frequently.")
262
263git's way of doing things works really nicely for the way software
264developers write software, but it doesn't really work so well for everything
265else.  The #1 killer is the way it adds new objects to the repository: it
266creates one file per blob.  Then you later run 'git gc' and combine those
267files into a single file (using highly efficient xdelta compression, and
268ignoring any files that are no longer relevant).
269
270'git gc' is slow, but for source code repositories, the resulting
271super-efficient storage (and associated really fast access to the stored
272files) is worth it.  For backups, it's not; you almost never access your
273backed-up data, so storage time is paramount, and retrieval time is mostly
274unimportant.
275
276To back up that 200 GB file with git and hashsplitting, you'd have to create
27724 million little 8k files, then copy them into a 200 GB packfile, then
278delete the 24 million files again.  That would take about 400 GB of disk
279space to run, require lots of random disk seeks, and require you to go
280through your data twice.
281
282So bup doesn't do that.  It just writes packfiles directly.  Luckily, these
283packfiles are still git-formatted, so git can happily access them once
284they're written.
285
286But that leads us to our next problem.
287
288
289Huge numbers of huge packfiles (midx.py, bloom.py, cmd/midx, cmd/bloom)
290------------------------------
291
292Git isn't actually designed to handle super-huge repositories.  Most git
293repositories are small enough that it's reasonable to merge them all into a
294single packfile, which 'git gc' usually does eventually.
295
296The problematic part of large packfiles isn't the packfiles themselves - git
297is designed to expect the total size of all packs to be larger than
298available memory, and once it can handle that, it can handle virtually any
299amount of data about equally efficiently.  The problem is the packfile
300indexes (.idx) files.  In bup we call these idx (pronounced "idix") files
301instead of using the word "index," because the word index is already used
302for something totally different in git (and thus bup) and we'll become
303hopelessly confused otherwise.
304
305Anyway, each packfile (*.pack) in git has an associated idx (*.idx) that's a
306sorted list of git object hashes and file offsets.  If you're looking for a
307particular object based on its sha1, you open the idx, binary search it to
308find the right hash, then take the associated file offset, seek to that
309offset in the packfile, and read the object contents.
310
311The performance of the binary search is about O(log n) with the number of
312hashes in the pack, with an optimized first step (you can read about it
313elsewhere) that somewhat improves it to O(log(n)-7).
314
315Unfortunately, this breaks down a bit when you have *lots* of packs.  Say
316you have 24 million objects (containing around 200 GB of data) spread across
317200 packfiles of 1GB each.  To look for an object requires you search
318through about 122000 objects per pack; ceil(log2(122000)-7) = 10, so you'll
319have to search 10 times.  About 7 of those searches will be confined to a
320single 4k memory page, so you'll probably have to page in about 3-4 pages
321per file, times 200 files, which makes 600-800 4k pages (2.4-3.6 megs)...
322every single time you want to look for an object.
323
324This brings us to another difference between git's and bup's normal use
325case.  With git, there's a simple optimization possible here: when looking
326for an object, always search the packfiles in MRU (most recently used)
327order.  Related objects are usually clusted together in a single pack, so
328you'll usually end up searching around 3 pages instead of 600, which is a
329tremendous improvement.  (And since you'll quickly end up swapping in all
330the pages in a particular idx file this way, it isn't long before searching
331for a nearby object doesn't involve any swapping at all.)
332
333bup isn't so lucky.  git users spend most of their time examining existing
334objects (looking at logs, generating diffs, checking out branches), which
335lends itself to the above optimization.  bup, on the other hand, spends most
336of its time looking for *nonexistent* objects in the repository so that it
337can back them up.  When you're looking for objects that aren't in the
338repository, there's no good way to optimize; you have to exhaustively check
339all the packs, one by one, to ensure that none of them contain the data you
340want.
341
342To improve performance of this sort of operation, bup introduces midx
343(pronounced "midix" and short for "multi-idx") files.  As the name implies,
344they index multiple packs at a time.
345
346Imagine you had a midx file for your 200 packs.  midx files are a lot like
347idx files; they have a lookup table at the beginning that narrows down the
348initial search, followed by a binary search.  Then unlike idx files (which
349have a fixed-size 256-entry lookup table) midx tables have a variably-sized
350table that makes sure the entire binary search can be contained to a single
351page of the midx file.  Basically, the lookup table tells you which page to
352load, and then you binary search inside that page.  A typical search thus
353only requires the kernel to swap in two pages, which is better than results
354with even a single large idx file.  And if you have lots of RAM, eventually
355the midx lookup table (at least) will end up cached in memory, so only a
356single page should be needed for each lookup.
357
358You generate midx files with 'bup midx'.  The downside of midx files is that
359generating one takes a while, and you have to regenerate it every time you
360add a few packs.
361
362UPDATE: Brandon Low contributed an implementation of "bloom filters", which
363have even better characteristics than midx for certain uses.  Look it up in
364Wikipedia.  He also massively sped up both midx and bloom by rewriting the
365key parts in C.  The nicest thing about bloom filters is we can update them
366incrementally every time we get a new idx, without regenerating from
367scratch.  That makes the update phase much faster, and means we can also get
368away with generating midxes less often.
369
370midx files are a bup-specific optimization and git doesn't know what to do
371with them.  However, since they're stored as separate files, they don't
372interfere with git's ability to read the repository.
373
374
375Detailed Metadata
376-----------------
377
378So that's the basic structure of a bup repository, which is also a git
379repository.  There's just one more thing we have to deal with:
380filesystem metadata.  Git repositories are really only intended to
381store file contents with a small bit of extra information, like
382symlink targets and executable bits, so we have to store the rest
383some other way.
384
385Bup stores more complete metadata in the VFS in a file named .bupm in
386each tree.  This file contains one entry for each file in the tree
387object, sorted in the same order as the tree.  The first .bupm entry
388is for the directory itself, i.e. ".", and its name is the empty
389string, "".
390
391Each .bupm entry contains a variable length sequence of records
392containing the metadata for the corresponding path.  Each record
393records one type of metadata.  Current types include a common record
394type (containing the normal stat information), a symlink target type,
395a hardlink target type, a POSIX1e ACL type, etc.  See metadata.py for
396the complete list.
397
398The .bupm file is optional, and when it's missing, bup will behave as
399it did before the addition of metadata, and restore files using the
400tree information.
401
402The nice thing about this design is that you can walk through each
403file in a tree just by opening the tree and the .bupm contents, and
404iterating through both at the same time.
405
406Since the contents of any .bupm file should match the state of the
407filesystem when it was *indexed*, bup must record the detailed
408metadata in the index.  To do this, bup records four values in the
409index, the atime, mtime, and ctime (as timespecs), and an integer
410offset into a secondary "metadata store" which has the same name as
411the index, but with ".meta" appended.  This secondary store contains
412the encoded Metadata object corresponding to each path in the index.
413
414Currently, in order to decrease the storage required for the metadata
415store, bup only writes unique values there, reusing offsets when
416appropriate across the index.  The effectiveness of this approach
417relies on the expectation that there will be many duplicate metadata
418records.  Storing the full timestamps in the index is intended to make
419that more likely, because it makes it unnecessary to record those
420values in the secondary store.  So bup clears them before encoding the
421Metadata objects destined for the index, and timestamp differences
422don't contribute to the uniqueness of the metadata.
423
424Bup supports recording and restoring hardlinks, and it does so by
425tracking sets of paths that correspond to the same dev/inode pair when
426indexing.  This information is stored in an optional file with the
427same name as the index, but ending with ".hlink".
428
429If there are multiple index runs, and the hardlinks change, bup will
430notice this (within whatever subtree it is asked to reindex) and
431update the .hlink information accordingly.
432
433The current hardlink implementation will refuse to link to any file
434that resides outside the restore tree, and if the restore tree spans a
435different set of filesystems than the save tree, complete sets of
436hardlinks may not be restored.
437
438
439Filesystem Interaction
440======================
441
442Storing data is just half of the problem of making a backup; figuring out
443what to store is the other half.
444
445At the most basic level, piping the output of 'tar' into 'bup split' is an
446easy way to offload that decision; just let tar do all the hard stuff.  And
447if you like tar files, that's a perfectly acceptable way to do it.  But we
448can do better.
449
450Backing up with tarballs would totally be the way to go, except for two
451serious problems:
452
4531. The result isn't easily "seekable."  Tar files have no index, so if (as
454   commonly happens) you only want to restore one file in a 200 GB backup,
455   you'll have to read up to 200 GB before you can get to the beginning of
456   that file.  tar is short for "tape archive"; on a tape, there was no
457   better way to do it anyway, so they didn't try.  But on a disk, random
458   file access is much, much better when you can figure out how.
459
4602. tar doesn't remember which files it backed up last time, so it has to
461   read through the entire file contents again in order to generate the
462   tarball, large parts of which will then be skipped by bup since they've
463   already been stored.  This is much slower than necessary.
464
465(The second point isn't entirely true for all versions of tar. For example,
466GNU tar has an "incremental" mode that can somewhat mitigate this problem,
467if you're smart enough to know how to use it without hurting yourself.  But
468you still have to decide which backups are "incremental" and which ones will
469be "full" and so on, so even when it works, it's more error-prone than bup.)
470
471bup divides the backup process into two major steps: a) indexing the
472filesystem, and b) saving file contents into the repository.  Let's look at
473those steps in detail.
474
475
476Indexing the filesystem (cmd/drecurse, cmd/index, index.py)
477-----------------------
478
479Splitting the filesystem indexing phase into its own program is
480nontraditional, but it gives us several advantages.
481
482The first advantage is trivial, but might be the most important: you can
483index files a lot faster than you can back them up.  That means we can
484generate the index (.bup/bupindex) first, then have a nice, reliable,
485non-lying completion bar that tells you how much of your filesystem remains
486to be backed up.  The alternative would be annoying failures like counting
487the number of *files* remaining (as rsync does), even though one of the
488files is a virtual machine image of 80 GB, and the 1000 other files are each
489under 10k.  With bup, the percentage complete is the *real* percentage
490complete, which is very pleasant.
491
492Secondly, it makes it easier to debug and test; you can play with the index
493without actually backing up any files.
494
495Thirdly, you can replace the 'bup index' command with something else and not
496have to change anything about the 'bup save' command.  The current 'bup
497index' implementation just blindly walks the whole filesystem looking for
498files that have changed since the last time it was indexed; this works fine,
499but something using inotify instead would be orders of magnitude faster.
500Windows and MacOS both have inotify-like services too, but they're totally
501different; if we want to support them, we can simply write new bup commands
502that do the job, and they'll never interfere with each other.
503
504And fourthly, git does it that way, and git is awesome, so who are we to
505argue?
506
507So let's look at how the index file works.
508
509First of all, note that the ".bup/bupindex" file is not the same as git's
510".git/index" file.  The latter isn't used in bup; as far as git is
511concerned, your bup repository is a "bare" git repository and doesn't have a
512working tree, and thus it doesn't have an index either.
513
514However, the bupindex file actually serves exactly the same purpose as git's
515index file, which is why we still call it "the index." We just had to
516redesign it for the usual bup-vs-git reasons, mostly that git just isn't
517designed to handle millions of files in a single repository.  (The only way
518to find a file in git's index is to search it linearly; that's very fast in
519git-sized repositories, but very slow in bup-sized ones.)
520
521Let's not worry about the exact format of the bupindex file; it's still not
522optimal, and will probably change again.  The most important things to know
523about bupindex are:
524
525 - You can iterate through it much faster than you can iterate through the
526   "real" filesystem (using something like the 'find' command).
527
528 - If you delete it, you can get it back just by reindexing your filesystem
529   (although that can be annoying to wait for); it's not critical to the
530   repository itself.
531
532 - You can iterate through only particular subtrees if you want.
533
534 - There is no need to have more than one index for a particular filesystem,
535   since it doesn't store anything about backups; it just stores file
536   metadata.  It's really just a cache (or 'index') of your filesystem's
537   existing metadata.  You could share the bupindex between repositories, or
538   between multiple users on the same computer.  If you back up your
539   filesystem to multiple remote repositories to be extra safe, you can
540   still use the same bupindex file across all of them, because it's the
541   same filesystem every time.
542
543 - Filenames in the bupindex are absolute paths, because that's the best way
544   to ensure that you only need one bupindex file and that they're
545   interchangeable.
546
547
548A note on file "dirtiness"
549--------------------------
550
551The concept on which 'bup save' operates is simple enough; it reads through
552the index and backs up any file that is "dirty," that is, doesn't already
553exist in the repository.
554
555Determination of dirtiness is a little more complicated than it sounds.  The
556most dirtiness-relevant flag in the bupindex is IX_HASHVALID; if
557this flag is reset, the file *definitely* is dirty and needs to be backed
558up.  But a file may be dirty even if IX_HASHVALID is set, and that's the
559confusing part.
560
561The index stores a listing of files, their attributes, and
562their git object ids (sha1 hashes), if known.  The "if known" is what
563IX_HASHVALID is about.  When 'bup save' backs up a file, it sets
564the sha1 and sets IX_HASHVALID; when 'bup index' sees that a file has
565changed, it leaves the sha1 alone and resets IX_HASHVALID.
566
567Remember that the index can be shared between users, repositories, and
568backups.  So IX_HASHVALID doesn't mean your repository *has* that sha1 in
569it; it only means that if you *do* have it, that you don't need to back up
570the file.  Thus, 'bup save' needs to check every file in the index to make
571sure its hash exists, not just that it's valid.
572
573There's an optimization possible, however: if you know a particular tree's
574hash is valid and exists (say /usr), then you don't need to check the
575validity of all its children; because of the way git trees and blobs work,
576if your repository is valid and you have a tree object, then you have all
577the blobs it points to.  You won't back up a tree object without backing up
578its blobs first, so you don't need to double check it next time.  (If you
579really want to double check this, it belongs in a tool like 'bup fsck' or
580'git fsck'.) So in short, 'bup save' on a "clean" index (all files are
581marked IX_HASHVALID) can be very fast; we just check our repository and see
582if the top level IX_HASHVALID sha1 exists.  If it does, then we're done.
583
584Similarly, if not the entire index is valid, you can still avoid recursing
585into subtrees if those particular subtrees are IX_HASHVALID and their sha1s
586are in the repository.  The net result is that, as long as you never lose
587your index, 'bup save' can always run very fast.
588
589Another interesting trick is that you can skip backing up files even if
590IX_HASHVALID *isn't* set, as long as you have that file's sha1 in the
591repository.  What that means is you've chosen not to backup the latest
592version of that file; instead, your new backup set just contains the
593most-recently-known valid version of that file.  This is a good trick if you
594want to do frequent backups of smallish files and infrequent backups of
595large ones.  Each of your backups will be "complete," in that they contain
596all the small files and the large ones, but intermediate ones will just
597contain out-of-date copies of the large files. Note that this isn't done
598right now, and 'bup save --smaller' doesn't store bigger files _at all_.
599
600A final game we can play with the bupindex involves restoring: when you
601restore a directory from a previous backup, you can update the bupindex
602right away.  Then, if you want to restore a different backup on top, you can
603compare the files in the index against the ones in the backup set, and
604update only the ones that have changed.  (Even more interesting things
605happen if people are using the files on the restored system and you haven't
606updated the index yet; the net result would be an automated merge of all
607non-conflicting files.) This would be a poor man's distributed filesystem.
608The only catch is that nobody has written this feature for 'bup restore'
609yet.  Someday!
610
611
612How 'bup save' works (cmd/save)
613--------------------
614
615This section is too boring and has been omitted.  Once you understand the
616index, there's nothing special about bup save.
617
618
619Retrieving backups: the bup vfs layer (vfs.py, cmd/ls, cmd/ftp, cmd/fuse)
620=====================================
621
622One of the neat things about bup's storage format, at least compared to most
623backup tools, is it's easy to read a particular file, or even part of a
624file.  That means a read-only virtual filesystem is easy to generate and
625it'll have good performance characteristics.  Because of git's commit
626structure, you could even use branching and merging to make a transactional
627read-write filesystem... but that's probably getting a little out of bup's
628scope.  Who knows what the future might bring, though?
629
630Read-only filesystems are well within our reach today, however.  The 'bup
631ls', 'bup ftp', and 'bup fuse' commands all use a "VFS" (virtual filesystem)
632layer to let you access your repositories.  Feel free to explore the source
633code for these tools and vfs.py - they're pretty straightforward.  Some
634things to note:
635
636 - None of these use the bupindex for anything.
637
638 - For user-friendliness, they present your refs/commits/trees as a single
639   hierarchy (ie.  a filesystem), which isn't really how git repositories
640   are formatted.  So don't get confused!
641
642
643Handling Python 3's insistence on strings
644=========================================
645
646In Python 2 strings were bytes, and bup used them for all kinds of
647data.  Python 3 made a pervasive backward-incompatible change to treat
648all strings as Unicode, i.e. in Python 2 'foo' and b'foo' were the
649same thing, while u'foo' was a Unicode string.  In Python 3 'foo'
650became synonymous with u'foo', completely changing the type and
651potential content, depending on the locale.
652
653In addition, and particularly bad for bup, Python 3 also (initially)
654insisted that all kinds of things were strings that just aren't (at
655least not on many platforms), i.e. user names, groups, filesystem
656paths, etc.  There's no guarantee that any of those are always
657representable in Unicode.
658
659Over the years, Python 3 has gradually backed away from that initial
660position, adding alternate interfaces like os.environb or allowing
661bytes arguments to many functions like open(b'foo'...), so that in
662those cases it's at least possible to accurately retrieve the system
663data.
664
665After a while, they devised the concept of
666[bytesmuggling](https://www.python.org/dev/peps/pep-0383/) as a more
667comprehensive solution.  In theory, this might be sufficient, but our
668initial randomized testing discovered that some binary arguments would
669crash Python during startup[1].  Eventually Johannes Berg tracked down
670the [cause](https://sourceware.org/bugzilla/show_bug.cgi?id=26034),
671and we hope that the problem will be fixed eventually in glibc or
672worked around by Python, but in either case, it will be a long time
673before any fix is widely available.
674
675Before we tracked down that bug we were pursuing an approach that
676would let us side step the issue entirely by manipulating the
677LC_CTYPE, but that approach was somewhat complicated, and once we
678understood what was causing the crashes, we decided to just let Python
6793 operate "normally", and work around the issues.
680
681Consequently, we've had to wrap a number of things ourselves that
682incorrectly return Unicode strings (libacl, libreadline, hostname,
683etc.)  and we've had to come up with a way to avoid the fatal crashes
684caused by some command line arguments (sys.argv) described above.  To
685fix the latter, for the time being, we just use a trivial sh wrapper
686to redirect all of the command line arguments through the environment
687in BUP_ARGV_{0,1,2,...} variables, since the variables are unaffected,
688and we can access them directly in Python 3 via environb.
689
690[1] Our randomized argv testing found that the byte smuggling approach
691    was not working correctly for some values (initially discovered in
692    Python 3.7, and observed in other versions).  The interpreter
693    would just crash while starting up like this:
694
695    Fatal Python error: _PyMainInterpreterConfig_Read: memory allocation failed
696    ValueError: character U+134bd2 is not in range [U+0000; U+10ffff]
697
698    Current thread 0x00007f2f0e1d8740 (most recent call first):
699    Traceback (most recent call last):
700      File "t/test-argv", line 28, in <module>
701        out = check_output(cmd)
702      File "/usr/lib/python3.7/subprocess.py", line 395, in check_output
703        **kwargs).stdout
704      File "/usr/lib/python3.7/subprocess.py", line 487, in run
705        output=stdout, stderr=stderr)
706
707We hope you'll enjoy bup.  Looking forward to your patches!
708
709-- apenwarr and the rest of the bup team
710
711Local Variables:
712mode: markdown
713End:
714