1<title>Fossil Delta Encoding Algorithm</title>
2<nowiki>
3<h2>Abstract</h2>
4
5<p>A key component for the efficient storage of multiple revisions of
6a file in fossil repositories is the use of delta-compression, i.e. to
7store only the changes between revisions instead of the whole
8file.</p>
9
10<p>This document describes the encoding algorithm used by Fossil to
11generate deltas. It is targeted at developers working on either
12<a href="index.wiki">fossil</a> itself, or on tools compatible with
13it. The exact format of the generated byte-sequences, while in general
14not necessary to understand encoder operation, can be found in the
15companion specification titled "<a href="delta_format.wiki">Fossil
16Delta Format</a>".
17</p>
18
19<p>The algorithm is inspired
20by <a href="http://samba.anu.edu.au/rsync/">rsync</a>.</p>
21
22<h2 id="argresparam">1.0 Arguments, Results, and Parameters</h2>
23
24<p>The encoder takes two byte-sequences as input, the "original", and
25the "target", and returns a single byte-sequence containing the
26"delta" which transforms the original into the target upon its
27application.</p>
28
29<p>Note that the data of a "byte-sequence" includes its length,
30i.e. the number of bytes contained in the sequence.</p>
31
32<p>The algorithm has one parameter named "NHASH", the size of the
33"sliding window" for the "rolling hash", in bytes. These two terms are
34explained in the next section. The value of this parameter has to be a
35power of two for the algorithm to work. For Fossil the value of this
36parameter is set to "16".</p>
37
38<h2 id="operation">2.0 Operation</h2>
39
40<p>The algorithm is split into three phases which generate
41the <a href="delta_format.wiki#header">header</a>,
42<a href="delta_format.wiki#slist">segment list</a>,
43and <a href="delta_format.wiki#trailer">trailer</a> of the delta, per
44its general <a href="delta_format.wiki#structure">structure</a>.</p>
45
46<p>The two phases generating header and trailer are not covered here
47as their implementation trivially follows directly from the
48specification of the <a href="delta_format.wiki">delta format</a>.</p>
49
50<p>This leaves the segment-list. Its generation is done in two phases,
51a pre-processing step operating on the "original" byte-sequence,
52followed by the processing of the "target" byte-sequence using the
53information gathered by the first step.</p>
54
55<h3 id="preprocessing">2.1 Preprocessing the original</h3>
56
57<p>A major part of the processing of the "target" is to find a range
58in the "original" which contains the same content as found at the
59current location in the "target".</p>
60
61<p>A naive approach to this would be to search the whole "original"
62for such content. This however is very inefficient as it would search
63the same parts of the "original" over and over. What is done instead
64is to sample the "original" at regular intervals, compute signatures
65for the sampled locations and store them in a hash table keyed by
66these signatures.</p>
67
68<p>That is what happens in this step. The following processing step
69can then the compute signature for its current location and then has
70to search only a narrow set of locations in the "original" for
71possible matches, namely those which have the same signature.</p>
72
73<p>In detail:</p>
74
75<ol>
76<li>The "original" is split into chunks of NHASH bytes. Note that a
77partial chunk of less than NHASH bytes at the end of "original" is
78ignored.
79</li>
80<li>The <a href="#rollhash">rolling hash</a> of each chunk is
81computed.
82</li>
83<li>A hash table is filled, mapping from the hashes of the chunks to
84the list of chunk locations having this hash.
85</li>
86</ol>
87
88<h3 id="processing">2.1 Processing the target</h3>
89
90<p>This, the main phase of the encoder, processes the target in a loop
91from beginning to end. The state of the encoder is captured by two
92locations, the "base" and the "slide". "base" points to the first byte
93of the target for which no delta output has been generated yet, and
94"slide" is the location of the window used to look in the "origin" for
95commonalities. This window is NHASH bytes long.</p>
96
97<p>Initially both "base" and "slide" point to the beginning of the
98"target". In each iteration of the loop the encoder decides whether to
99<ul>
100<li>emit a single instruction
101to <a href="delta_format.wiki#copyrange">copy a range</a>, or
102</li>
103<li>emit two instructions, first
104to <a href="delta_format.wiki#insertlit">insert a literal</a>, then
105to <a href="delta_format.wiki#copyrange">copy a range</a>, or
106</li>
107<li>move the window forward one byte.
108</li>
109</ul>
110</p>
111
112<div style="float:right">
113<verbatim type="pikchr" style="float:right">
114TARGET: [
115       down
116       "Target" bold
117       box fill palegreen width 150% height 200% "Processed"
118   GI: box same as first box fill yellow height 25% "Gap → Insert"
119   CC: box same fill orange height 200% "Common → Copy"
120    W: box same as GI fill lightgray width 125% height 200% "Window" bold
121       box same as CC height 125% ""
122       box same fill white ""
123]
124
125[ "Base"  bold ; right ; arrow 33% ] with .e at TARGET.GI.nw
126[ "Slide" bold ; right ; arrow 33% ] with .e at TARGET.W.nw
127
128ORIGIN: [
129       down
130       "Origin" bold
131   B1: box fill white
132   B2: box fill orange height 200%
133   B3: box fill white  height 200%
134] with .nw at 0.75 right of TARGET.ne
135
136arrow from TARGET.W.e to ORIGIN.B2.w "Signature" aligned above
137</verbatim>
138</div>
139
140<p>To make this decision the encoder first computes the hash value for
141the NHASH bytes in the window and then looks at all the locations in
142the "origin" which have the same signature. This part uses the hash
143table created by the pre-processing step to efficiently find these
144locations.</p>
145
146<p>For each of the possible candidates the encoder finds the maximal
147range of bytes common to both "origin" and "target", going forward and
148backward from "slide" in the "target", and the candidate location in
149the "origin". This search is constrained on the side of the "target"
150by the "base" (backward search), and the end of the "target" (forward
151search), and on the side of the "origin" by the beginning and end of
152the "origin", respectively.</p>
153
154<p>There are input files for which the hash chains generated by the
155pre-processing step can become very long, leading to long search times
156and affecting the performance of the delta generator. To limit the
157effect such long chains can have the actual search for candidates is
158bounded, looking at most N candidates. Currently N is set to 250.</p>
159
160<p>From the ranges for all the candidates the best (= largest) common
161range is taken and it is determined how many bytes are needed to
162encode the bytes between the "base" and the end of that range. If the
163range extended back to the "base" then this can be done in a single
164copy instruction. Otherwise, i.e if there is a gap between the "base"
165and the beginning of the range then two instructions are needed, one
166to insert the bytes in the gap as a literal, and a copy instruction
167for the range itself. The general situation at this point can be seen
168in the picture to the right.</p>
169
170<p>If the number of bytes needed to encode both gap (if present), and
171range is less than the number of bytes we are encoding the encoder
172will emit the necessary instructions as described above, set "base"
173and "slide" to the end of the encoded range and start the next
174iteration at that point.</p>
175
176<p>If, on the other hand, the encoder either did not find candidate
177locations in the origin, or the best range coming out of the search
178needed more bytes to encode the range than there were bytes in the
179range, then no instructions are emitted and the window is moved one
180byte forward. The "base" is left unchanged in that case.</p>
181
182<p>The processing loop stops at one of two conditions:
183<ol>
184<li>The encoder decided to move the window forward, but the end of the
185window reached the end of the "target".
186</li>
187<li>After the emission of instructions the new "base" location is
188within NHASH bytes of end of the "target", i.e. there are no more than
189at most NHASH bytes left.
190</li>
191</ol>
192</p>
193
194<p>If the processing loop left bytes unencoded, i.e. "base" not
195exactly at the end of the "target", as is possible for both end
196conditions, then one last insert instruction is emitted to put these
197bytes into the delta.<p>
198
199<h2 id="exceptions">3.0 Exceptions</h2>
200
201<p>If the "original" is at most NHASH bytes long no compression of
202changes is possible, and the segment-list of the delta consists of a
203single literal which contains the entire "target".</p>
204
205<p>This is actually equivalent to the second end condition of the
206processing loop described in the previous section, just checked before
207actually entering the loop.</p>
208
209<h2 id="rollhash">4.0 The rolling hash</h2>
210
211<p>The rolling hash described below and used to compute content
212signatures was chosen not only for good hashing properties, but also
213to enable the easy (incremental) recalculation of its value for a
214sliding window, i.e. where the oldest byte is removed from the window
215and a new byte is shifted in.<p>
216
217<h3 id="rhdef">4.1 Definition</h3>
218
219<p>Assuming an array Z of NHASH bytes (indexing starting at 0) the
220hash V is computed via</p>
221
222<p align=center><table><tr><td>
223<p><img src="encode1.gif" align="center"></p>
224<p><img src="encode2.gif" align="center"></p>
225<p><img src="encode3.gif" align="center"></p>
226</td></tr></table></p>
227
228where A and B are unsigned 16-bit integers (hence the <u>mod</u>), and
229V is a 32-bit unsigned integer with B as MSB, A as LSB.
230
231<h3 id="rhincr">4.2 Incremental recalculation</h3>
232
233<p>Assuming an array Z of NHASH bytes (indexing starting at 0) with
234hash V (and components A and B), the dropped
235byte <img src="encode4.gif" align="center">, and the new byte
236<img src="encode5.gif" align="center"> , the new hash can
237be computed incrementally via: </p>
238
239<p align=center><table><tr><td>
240<p><img src="encode6.gif" align="center"></p>
241<p><img src="encode7.gif" align="center"></p>
242<p><img src="encode8.gif" align="center"></p>
243</td></tr></table></p>
244
245<p>For A, the regular sum, it can be seen easily that this the correct
246way recomputing that component.</p>
247
248<p>For B, the weighted sum, note first that <img src="encode4.gif"
249align="center"> has the weight NHASH in the sum, so that is what has
250to be removed. Then adding in <img src="encode9.gif" align="center">
251adds one weight factor to all the other values of Z, and at last adds
252in <img src="encode5.gif" align="center"> with weight 1, also
253generating the correct new sum</p>
254