• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..08-Jan-2016-

contrib/H08-Jan-2016-2214

djgpp/H08-Jan-2016-186130

doc/H08-Jan-2016-345275

libedsio/H08-Jan-2016-7,8275,715

test/H08-Jan-2016-706521

AUTHORSH A D08-Jan-2016166 43

COPYINGH A D08-Jan-201617.6 KiB340281

INSTALLH A D08-Jan-20168.3 KiB218163

Makefile.amH A D08-Jan-20161.1 KiB5126

NEWSH A D08-Jan-20168.3 KiB218156

READMEH A D08-Jan-20166.9 KiB181131

autogen.shH A D08-Jan-20161.4 KiB6047

configure.inH A D08-Jan-20161.9 KiB9271

getopt.cH A D08-Jan-201621.2 KiB750397

getopt.hH A D08-Jan-20164.3 KiB13048

getopt1.cH A D08-Jan-20164.1 KiB181118

runtestH A D08-Jan-2016936 4937

xd.serH A D08-Jan-20163.1 KiB139113

xdapply.cH A D08-Jan-20162.4 KiB9247

xdelta-0.13.READMEH A D08-Jan-201611.9 KiB312242

xdelta-config.inH A D08-Jan-20161.8 KiB116100

xdelta.cH A D08-Jan-201635.2 KiB1,5421,087

xdelta.hH A D08-Jan-20166.9 KiB19671

xdelta.m4H A D08-Jan-20168.2 KiB184176

xdelta.magicH A D08-Jan-2016404 109

xdelta.prjH A D08-Jan-20164.2 KiB124114

xdeltapriv.hH A D08-Jan-20163.9 KiB13282

xdmain.cH A D08-Jan-201642.6 KiB1,9661,469

xdrsync.cH A D08-Jan-20168.7 KiB444312

README

1 -*- Text -*-
2
3	       Xdelta -- A binary delta generator
4
5Announcing version 1.1.2 of Xdelta.  Xdelta is an application program
6designed to compute changes between files.  These changes (deltas) are
7similar to the output of the "diff" program in that they may be used
8to store and transmit only the changes between files.  However, unlike
9diff, the output of Xdelta is not expressed in a human-readable
10format--Xdelta can also also apply these deltas to a copy of the
11original file.  Xdelta uses a fast, linear algorithm and performs well
12on both binary and text files.
13
14Xdelta 1.1.2 is a stable, maintenence release.  New, ongoing work on
15Xdelta has focused on a new storage system with features similar to
16the RCS command set.  For more information on new development, see the
17Xdelta-2.0 release series at http://xdelta.sourceforge.net.
18
19Xdelta was designed and implemented by Joshua MacDonald.  The delta
20algorithm is based on the Rsync algorithm, though implementation and
21interface considerations leave the two programs quite distinct.  The
22Rsync algorithm is due to Andrew Tridgell and Paul Mackerras.
23
24To compile and install Xdelta, read the instructions in the INSTALL
25file.  Once you have done this, you should at least read the first few
26sections of the documentation.  It is available in info format.  All
27documentation is located in the doc/ subdirectory.
28
29This release, version 1.1.2, and future releases of Xdelta can be
30found at http://xdelta.sourceforge.net.
31
32Xdelta is released under the GNU Library Public License (GPL), see the
33file COPYING for details.
34
35There is mailing list for announcements:
36
37   xdelta-announce@lists.sourceforge.net
38
39you can subscribe to the mailing list or file bug reports through
40Sourceforge at:
41
42   http://sourceforge.net/projects/xdelta/
43
44Comments about Xdelta can be addressed to the following addresses:
45
46   jmacd@cs.berkeley.edu
47
48The man page describes how to use Xdelta in more detail:
49
50NAME
51       xdelta - Invoke Xdelta
52
53SYNOPSIS
54       xdelta subcommand [ option...  ] [ operand...  ]
55
56DESCRIPTION
57
58       Xdelta provides the ability to generate deltas between a pair
59       of files and later apply those deltas.  It operates similar to
60       the diff and patch commands, but works on binary files and does
61       not produce a human readable output.
62
63       Xdelta has three subcommands, delta, patch, and info.  Delta
64       accepts two file versions and produces a delta, while patch
65       accepts the original file version and delta and produces the
66       second version.  The info command prints useful information
67       about a delta.  Each subcommand will be detailed seperately.
68
69   Gzip processing
70
71       Attempting to compute a delta between compressed input files
72       usually results in poor compression.  This is because small
73       differences between the original contents causes changes in the
74       compression of whole blocks of data.  To simplify things,
75       Xdelta implements a special case for gzip(1) compressed files.
76       If any version input to the delta command is recognized as
77       having gzip compression, it will be automatically decompressed
78       into a temporary location prior to comparison.  This temporary
79       location is either the value of the TMPDIR environment
80       variable, if set, otherwise "/tmp".
81
82       The Xdelta patch header contains a flag indicating that the
83       reconstructed version should be recompressed after applying
84       the patch.  In general, this allows Xdelta to operate
85       transparently on gzip compressed inputs.
86
87       There is one potential problem when automatically processing
88       gzip compressed files, which is that the recompressed content
89       does not always match byte-for-byte with the original
90       compressed content.  The uncompressed content still matches,
91       but if there is an external integrity check such as
92       cryptographic signature verification, it may fail.  To prevent
93       this from happening, the --pristine option disables automatic
94       gzip processing.
95
96   MD5 integrity check
97
98       By default, Xdelta always verifies the MD5 checksum of the
99       files it reconstructs.  This prevents you from supplying an
100       incorrect input during patch, which would result in corrupt
101       output.  Because of this feature, you can feel confident that
102       patch has produced valid results.  The --noverify option
103       disables MD5 verification, but this is only recommended for
104       performance testing.
105
106   Compressed patch format
107
108       Xdelta uses a fairly simple encoding for its delta, then
109       applies zlib compression to the result.  You should not have to
110       post-compress an Xdelta delta.
111
112   Delta
113
114       The delta subcommand has the following synopsis:
115
116       xdelta delta [ option...  ] fromfile tofile patchout
117
118       Computes a delta from fromfile to tofile and writes it to patchout
119
120   Patch
121
122       The patch subcommand has the following synopsis:
123
124       xdelta patch [ option...  ] patchin [ fromfile [ tofile ]]
125
126       Applies patchin to fromfile and produces a reconstructed
127       version of tofile.
128
129       If fromfile was omitted, Xdelta attempts to use the original
130       fromfile name, which is stored in the delta.  The from file
131       must be identical to the one used to create the delta.  If its
132       length or MD5 checksum differs, patch will abort with an error
133       message.
134
135       If tofile was omitted, Xdelta attempts to use the original
136       tofile name, which is also stored in the delta.  If the
137       original tofile name already exists, a unique filename
138       extension will be added to avoid destroying any existing data.
139
140   Info
141       The info subcommand has the following synopsis:
142
143       xdelta info patchinfo
144
145       Prints information about patchinfo and the version it
146       reconstructs, including file names, lengths, and MD5 checksums.
147
148   Options
149       -0..9  Set  the  zlib compression level.  Zero indicates no
150              compression.  Nine indicates maximum compression.
151
152       -h, --help
153              Print a short help message and exit.
154
155       -q, --quiet
156              Quiet.  Surpresses several warning messages.
157
158       -v, --version
159              Print the Xdelta version number and exit.
160
161       -V, --verbose
162              Verbose.  Prints a bit of extra information.
163
164       -n, --noverify
165              No verify.  Turns off MD5 checksum verification of the
166	      input and output files.
167
168       -m=SIZE, --maxmem=SIZE
169              Set an upper bound on the size of an in-memory page
170              cache. For example, --maxmem=32M will use a 32 megabyte
171              page cache.
172
173       -s=BLOCK_SIZE
174              Set the block size, unless it was hard coded (20% speed
175              improvement).  Should be a power of 2.
176
177       -p, --pristine
178              Disable  the  automatic  decompression of gzipped
179              inputs,  to prevent unexpected differences in the
180              re-compressed content.
181

xdelta-0.13.README

1
2I have placed version 0.13 of XDELTA, new delta generator and
3prototype RCS-replacement program at:
4
5  ftp://www.xcf.berkeley.edu/pub/jmacd/xdelta-0.13.tar.gz
6
7======
8
9Hereafter, the algorithm and program are described.  A delta algorithm
10operates on two files, a FROM file and a TO file.  It computes enough
11informnation to reconstruct the TO file from the FROM file.  So the
12delta and it's inverse, patch, work as follows:
13
14P = delta (FROM, TO)
15TO = patch (FROM, P)
16
17The delta generator portion of this program is a delta algorithm which
18searches for substring matches between the files and then outputs
19instructions to reconstruct the new file from the old file.  It
20produces a set of copy/insert instructions that tell how to
21reconstruct the file as a sequence of copies from the FROM file and
22inserts from the delta itself.  In this regard, the program is much
23closer to a compression program than to a diff program.  However, the
24delta is not "compressed", in that the delta's entropy H(P) will be
25very similar to the entropy of the portions of the TO file not found
26within the FROM file.  The delta will compress just as well as the TO
27file will.  This is a fundamentally different method of computing
28deltas than in the traditional "diff" program.  The diff program and
29it's variants use a least-common-subsequence (LCS) algorithm to find a
30list of inserts and deletes that will modify the FROM file into the TO
31file.  LCS is more expensive to compute and is sometimes more useful,
32especially to the human reader.  Since LCS is a fairly expensive
33algorithm, diff programs usually divide the input files into
34newline-separated "atoms" before computing a delta.  This is a fine
35approximation for text files, but not binary files.
36
37I propose this delta generator as a replacement for "diff" in
38applications where the insert/delete delta is not required.  Since the
39copy/insert delta is easier to compute, it does not have to make the
40reduction in input size by breaking the file into lines.  The delta
41generator works well on binary files.
42
43The next question is whether it is actually desirable to compute
44deltas between binary files.  The answer is certainly yes, even though
45some binary file formats will not have a great degree of similarity
46between versions which were generated between minor modifications to
47their sources.  First, I have evidence that some file formats (notably
48FrameMaker documents) work very well.  Machine-dependant object files
49and executables don't work very well, but it is still worth doing.
50The reason it is still worthwhile is that compression takes longer
51than finding these deltas, so any space savings the delta generator
52produces it very likely to reduce the total archival time.  Even if
53the delta generator saves no space, the total time well end up less
54than twice the time of compression.  I will include some measurements
55at the end of this writing.
56
57With the delta generator as a building block, I have written a
58replacement for RCS.  It does not have all the user-level features
59someone would want in RCS, but these things only get in the way when
60the code is used by higher level version control programs such as
61PRCS.  For now, the prototype supports the following actions:
62
63register		create an empty version file
64checkin			create a new version
65checkout		extract an old version
66
67The program departs from the RCS model in that it does not branch it's
68versions.  The reason for doing this is to solve several problems with
69RCS (aside from it's use of diff which is solved by using my delta
70generator).
71
721. Development very often ends up off the "main trunk" of the RCS
73file, which causes an O(N) extract, where N deltas must be applied to
74the head of the trunk before getting the desired file.  Once this
75occurs, things get worse and worse.  Of course, it is possible to
76remedy the situation by checking it in once on the trunk, but that
77defeats the purpose of branching, because you lose the information
78about branch of development you are on.  Further, the notion of
79branching individual files conflicts with the model presented by
80higher level version control programs such as PRCS, which would rather
81not have to go to extra trouble to insure that development does not
82end up off the trunk.
83
842. Deltas are never used more than once in a particular version
85history, even if they contain the same data as another delta.
86
87Conceptually, to get rid of RCS-like branching and retain the "deltas
88are computed between a file and its parent", the xdelta version
89control library computes a reverse delta between the new file and the
90entire version file.  The most recently checked in version is always
91the "head", in RCS terms.  It is always the quickest and easiest to
92extract.  For a repository with N versions, version I requires the
93application of (N-I) deltas, regardless of how the set of versions are
94related.
95
96Each delta contains a set of instructions and all data which was
97originally part of a file version but missing in the FROM file.  This
98data is represented without compression or modification.  As a result,
99each delta may be used as an input for finding future deltas just as a
100regular file may be.
101
102Several equations describe the process, first some notation:
103
104	V0 is the version file with 1 version
105	V1 is the version file with 2 versions,
106	...
107	VN is the version file with N+1 versions
108
109	F0 is the 1st file checked in, upon checkin V0 is produced
110	F1 is the 2nd file checked in, upon checkin V1 is produced
111	...
112	FN is the Nth file checked in, upon checkin VN is produced
113
114	D0 is the 1st delta produced, upon checkin of F1
115	D1 is the 2nd delta produced, upon checkin of F2
116	...
117	DN is the Nth delta produced, upon checkin of FN+1
118
119Checkin goes as follows:
120
121	V0 <- F0
122
123	D0 <- delta (F1, F0)
124	V1 <- F1 + D0
125
126	D1 <- delta (F2 + D0, F1)
127	V2 <- F2 + D1 + D0
128
129	D2 <- delta (F3 + D1 + D0, F2)
130	V3 <- F3 + D2 + D1 + D0
131
132an so on.  To checkout:
133
134	F3 is immediate
135	F2 <- patch (F3 + D1 + D0, D2)
136	F1 <- patch (F2 + D0, D1)
137	F0 <- patch (F1, D0)
138
139Now I will describe why this is just as good as branching, even though
140the branches do not affect which files are used for computing deltas,
141and why it is just as good as the RCS model.
142
143Suppose the we have a sequence of checkins with RCS version numbers
144and their sequence numbers:
145
146F0: 1.1
147F1: 1.2
148F2: 1.3
149F3: 1.1.1.1
150F4: 1.1.1.2
151F5: 1.1.1.3
152F6: 1.4
153F7: 1.5
154
155Of interest are the 4th checkin, from 1.3 to 1.1.1.1, and the 7th
156checkin, from 1.1.1.3 to 1.4.  I will now use the '-' sign to denote
157the taking of a delta and '+' for patch.  Thus, delta(A,B) is
158equivalent to B-A, and patch(A,xdelta(A,B)) is equivalent to A+(B-A),
159or B, as it should be.  In the above sequence of checkins, these
160deltas are produced:
161
162D0=F0-F1
163D1=F1-F2-D0
164D2=F2-F3-D1-D0
165D3=F3-F4-D2-D1-D0
166D4=F4-F5-D3-D2-D1-D0
167D5=F5-F6-D4-D3-D2-D1-D0
168D6=F6-F7-D5-D4-D3-D2-D1-D0
169
170The checkin of F3 produces D2.  The concern is whether any deletions
171between 1.1 and 1.3 have to be inserted into D2.  They have not,
172because any deletions between 1.1 and 1.3 appear in D0 and D1.  Thus,
173using D0, D1, and F3 as an input for computing D2 achieves the desired
174results: the deletion is not reflected twice in separate deltas.
175
176The properties of this version format differ from the RCS format.  It
177ensures that recently checked-in versions are quick to check out, no
178matter where in the "tree" they occur.  One potential problem with
179this formulation is that according to the formulae above, the ammount
180of memory and computation required for checking in each new version
181grows with the number of previous versions.  To prevent this it only
182uses up to some limiting value (currently set at 10) of previous
183deltas in computing new deltas.  This means that deletions on a branch
184over 10 checkins old will be duplicated on a checkin.  Another
185property is that to delete a version all older versions must first be
186deleted or off-line processing may be done to reconstruct the delta
187log with the version removed (an expensive but not time-critical
188operation).  It also lends itself to keeping windowed histories, such
189as "only keep the last 20 revisions, anything earlier than that may be
190thrown away".
191
192---
193
194The current prototype implementation is complete.  The program,
195'xdelta', has a command syntax similar to PRCS:
196
197xdelta COMMAND [OPTIONS] ARG1 ARG2 ...
198
199There are 5 commands:
200
201 register  Create a versionfile
202 checkin   Checkin a new version
203 checkout  Checkout a version (latest or -r VERSION)
204 info      Show information on a version (all or -r VERSION)
205 delta     Produce a delta from ARG1 to ARG2 producing ARG3
206 patch     Patch file ARG1 with ARG2 producing ARG3
207
208The -r option specifies a version (the latest version is always the
209default).  Versions are sequentially numbered.  The file itself is a
210GNU DBM database which stores the deltas and latest version.  The file
211format is subject to change as I look for storage methods that are
212portable to Windows (otherwise, the delta generator has already been
213tested on Windows).
214
215The DELTA and PATCH commands are unrelated to the rest, and simply
216serve to generate deltas (and could be used as replacements for "diff"
217and "patch").  All operations are verified with MD5 checksums, which
218are saved for each version in the version file and the FROM and TO
219files when generating and applying deltas.
220
221---
222
223Some timing comparisons (remember all times include MD5 computations)
224on a 300MHz Alpha:
225
226Two kernels:
227
228   7875136 /genvmunix
229   6677320 /vmunix
230
231And the times to compress and delta them:
232
233					time(s)	size(bytes)
234
235xdelta delta /genvmunix /vmunix patch:	10.1	4338029
236gzip --best /genvmunix:			34.0	3413963
237gzip --best /vmunix:			63.7	2501766
238gzip --best patch:			51.9	1531163
239
240(I'm not sure why there's such a variation in gzip times--it was
241repeatable.)
242
243xdelta produces a patch which is 4338029 bytes, approximately 2.6
244megabytes smaller than its TO file.  The total space required for
245saving the two kernels compressed is 5915729 bytes, produced in 97.7
246seconds.  The total space for storing /genvmunix and the delta
247compressed is 4945126 bytes, produced in 96.0 seconds.  Using the
248delta saved approximately one megabyte and two seconds.
249
250---
251
252All the files in PRCS 1.0.0 and 1.2.0 concatenated (files ordered
253lexigraphically):
254
255   1678837 /tmp/1.0.0
256   2021819 /tmp/1.2.0
257
258							time(s)	space(bytes)
259
260xdelta delta /tmp/1.0.0 /tmp/1.2.0 /tmp/patch:		1.5	695893
261gdiff -a --rcs /tmp/1.0.0 /tmp/1.2.0 /tmp/patch:	3.7	1524705
262
263Even though both files were completely text, gdiff missed a lot of
264similar regions because a line had been partially modified.  Xdelta
265performs better in both dimensions.
266
267---
268
269All the .c and .h files in GNU gcc 2.7.2 and 2.7.2.3 concatenated (files
270ordered lexigraphically).  These files are highly similar:
271
272  15590954 /tmp/2.7.2
273  15599644 /tmp/2.7.2.3
274							time(s)	space(bytes)
275
276xdelta delta /tmp/2.7.2 /tmp/2.7.2.3 /tmp/patch:	7.3	11922
277gdiff -a --rcs /tmp/2.7.2 /tmp/2.7.2.3:			4.8	19477
278
279Here xdelta takes a about 50% longer and produces a slightly smaller
280diff.  I think the improved performance in the other two examples
281makes xdelta an all around winner.
282
283---
284
285A note on FrameMaker files, which I happen to know a fair bit about.
286Even making radical modifications to a Frame+SGML document leaves much
287of its content unchanged, due to the fact that the EDD and various
288catalogs do not change.  When only minimal modifications are performed
289
290Take a 24 page document, chapter 15 of the SGML developers guide for
291version 5.5. I replaced 38 occurances of "document" with
292"abcdefghijklmnop" and added the source of this README from here up at
293the end, making it 33 pages.
294
295  250880 graphics.before.fm
296  272384 graphics.after.fm
297
298The document grew by about 22 kbytes.  The patch 32780 bytes.  The
299same diff generated by GNU diff is 101097 bytes.  Clearly, Frame
300documents are very good for binary version control.
301
302---
303
304I also have evidence that the program works well on uncompressed image
305formats such as the GIMP's XCF file format.
306
307This program will be the basis of PRCS version 2, which will support
308client/server distributed version control.
309
310Please send any comments, questions, thoughts, or interest to me,
311jmacd@cs.berkeley.edu.
312