1<title>Fossil Delta Encoding Algorithm</title> 2<nowiki> 3<h2>Abstract</h2> 4 5<p>A key component for the efficient storage of multiple revisions of 6a file in fossil repositories is the use of delta-compression, i.e. to 7store only the changes between revisions instead of the whole 8file.</p> 9 10<p>This document describes the encoding algorithm used by Fossil to 11generate deltas. It is targeted at developers working on either 12<a href="index.wiki">fossil</a> itself, or on tools compatible with 13it. The exact format of the generated byte-sequences, while in general 14not necessary to understand encoder operation, can be found in the 15companion specification titled "<a href="delta_format.wiki">Fossil 16Delta Format</a>". 17</p> 18 19<p>The algorithm is inspired 20by <a href="http://samba.anu.edu.au/rsync/">rsync</a>.</p> 21 22<h2 id="argresparam">1.0 Arguments, Results, and Parameters</h2> 23 24<p>The encoder takes two byte-sequences as input, the "original", and 25the "target", and returns a single byte-sequence containing the 26"delta" which transforms the original into the target upon its 27application.</p> 28 29<p>Note that the data of a "byte-sequence" includes its length, 30i.e. the number of bytes contained in the sequence.</p> 31 32<p>The algorithm has one parameter named "NHASH", the size of the 33"sliding window" for the "rolling hash", in bytes. These two terms are 34explained in the next section. The value of this parameter has to be a 35power of two for the algorithm to work. For Fossil the value of this 36parameter is set to "16".</p> 37 38<h2 id="operation">2.0 Operation</h2> 39 40<p>The algorithm is split into three phases which generate 41the <a href="delta_format.wiki#header">header</a>, 42<a href="delta_format.wiki#slist">segment list</a>, 43and <a href="delta_format.wiki#trailer">trailer</a> of the delta, per 44its general <a href="delta_format.wiki#structure">structure</a>.</p> 45 46<p>The two phases generating header and trailer are not covered here 47as their implementation trivially follows directly from the 48specification of the <a href="delta_format.wiki">delta format</a>.</p> 49 50<p>This leaves the segment-list. Its generation is done in two phases, 51a pre-processing step operating on the "original" byte-sequence, 52followed by the processing of the "target" byte-sequence using the 53information gathered by the first step.</p> 54 55<h3 id="preprocessing">2.1 Preprocessing the original</h3> 56 57<p>A major part of the processing of the "target" is to find a range 58in the "original" which contains the same content as found at the 59current location in the "target".</p> 60 61<p>A naive approach to this would be to search the whole "original" 62for such content. This however is very inefficient as it would search 63the same parts of the "original" over and over. What is done instead 64is to sample the "original" at regular intervals, compute signatures 65for the sampled locations and store them in a hash table keyed by 66these signatures.</p> 67 68<p>That is what happens in this step. The following processing step 69can then the compute signature for its current location and then has 70to search only a narrow set of locations in the "original" for 71possible matches, namely those which have the same signature.</p> 72 73<p>In detail:</p> 74 75<ol> 76<li>The "original" is split into chunks of NHASH bytes. Note that a 77partial chunk of less than NHASH bytes at the end of "original" is 78ignored. 79</li> 80<li>The <a href="#rollhash">rolling hash</a> of each chunk is 81computed. 82</li> 83<li>A hash table is filled, mapping from the hashes of the chunks to 84the list of chunk locations having this hash. 85</li> 86</ol> 87 88<h3 id="processing">2.1 Processing the target</h3> 89 90<p>This, the main phase of the encoder, processes the target in a loop 91from beginning to end. The state of the encoder is captured by two 92locations, the "base" and the "slide". "base" points to the first byte 93of the target for which no delta output has been generated yet, and 94"slide" is the location of the window used to look in the "origin" for 95commonalities. This window is NHASH bytes long.</p> 96 97<p>Initially both "base" and "slide" point to the beginning of the 98"target". In each iteration of the loop the encoder decides whether to 99<ul> 100<li>emit a single instruction 101to <a href="delta_format.wiki#copyrange">copy a range</a>, or 102</li> 103<li>emit two instructions, first 104to <a href="delta_format.wiki#insertlit">insert a literal</a>, then 105to <a href="delta_format.wiki#copyrange">copy a range</a>, or 106</li> 107<li>move the window forward one byte. 108</li> 109</ul> 110</p> 111 112<div style="float:right"> 113<verbatim type="pikchr" style="float:right"> 114TARGET: [ 115 down 116 "Target" bold 117 box fill palegreen width 150% height 200% "Processed" 118 GI: box same as first box fill yellow height 25% "Gap → Insert" 119 CC: box same fill orange height 200% "Common → Copy" 120 W: box same as GI fill lightgray width 125% height 200% "Window" bold 121 box same as CC height 125% "" 122 box same fill white "" 123] 124 125[ "Base" bold ; right ; arrow 33% ] with .e at TARGET.GI.nw 126[ "Slide" bold ; right ; arrow 33% ] with .e at TARGET.W.nw 127 128ORIGIN: [ 129 down 130 "Origin" bold 131 B1: box fill white 132 B2: box fill orange height 200% 133 B3: box fill white height 200% 134] with .nw at 0.75 right of TARGET.ne 135 136arrow from TARGET.W.e to ORIGIN.B2.w "Signature" aligned above 137</verbatim> 138</div> 139 140<p>To make this decision the encoder first computes the hash value for 141the NHASH bytes in the window and then looks at all the locations in 142the "origin" which have the same signature. This part uses the hash 143table created by the pre-processing step to efficiently find these 144locations.</p> 145 146<p>For each of the possible candidates the encoder finds the maximal 147range of bytes common to both "origin" and "target", going forward and 148backward from "slide" in the "target", and the candidate location in 149the "origin". This search is constrained on the side of the "target" 150by the "base" (backward search), and the end of the "target" (forward 151search), and on the side of the "origin" by the beginning and end of 152the "origin", respectively.</p> 153 154<p>There are input files for which the hash chains generated by the 155pre-processing step can become very long, leading to long search times 156and affecting the performance of the delta generator. To limit the 157effect such long chains can have the actual search for candidates is 158bounded, looking at most N candidates. Currently N is set to 250.</p> 159 160<p>From the ranges for all the candidates the best (= largest) common 161range is taken and it is determined how many bytes are needed to 162encode the bytes between the "base" and the end of that range. If the 163range extended back to the "base" then this can be done in a single 164copy instruction. Otherwise, i.e if there is a gap between the "base" 165and the beginning of the range then two instructions are needed, one 166to insert the bytes in the gap as a literal, and a copy instruction 167for the range itself. The general situation at this point can be seen 168in the picture to the right.</p> 169 170<p>If the number of bytes needed to encode both gap (if present), and 171range is less than the number of bytes we are encoding the encoder 172will emit the necessary instructions as described above, set "base" 173and "slide" to the end of the encoded range and start the next 174iteration at that point.</p> 175 176<p>If, on the other hand, the encoder either did not find candidate 177locations in the origin, or the best range coming out of the search 178needed more bytes to encode the range than there were bytes in the 179range, then no instructions are emitted and the window is moved one 180byte forward. The "base" is left unchanged in that case.</p> 181 182<p>The processing loop stops at one of two conditions: 183<ol> 184<li>The encoder decided to move the window forward, but the end of the 185window reached the end of the "target". 186</li> 187<li>After the emission of instructions the new "base" location is 188within NHASH bytes of end of the "target", i.e. there are no more than 189at most NHASH bytes left. 190</li> 191</ol> 192</p> 193 194<p>If the processing loop left bytes unencoded, i.e. "base" not 195exactly at the end of the "target", as is possible for both end 196conditions, then one last insert instruction is emitted to put these 197bytes into the delta.<p> 198 199<h2 id="exceptions">3.0 Exceptions</h2> 200 201<p>If the "original" is at most NHASH bytes long no compression of 202changes is possible, and the segment-list of the delta consists of a 203single literal which contains the entire "target".</p> 204 205<p>This is actually equivalent to the second end condition of the 206processing loop described in the previous section, just checked before 207actually entering the loop.</p> 208 209<h2 id="rollhash">4.0 The rolling hash</h2> 210 211<p>The rolling hash described below and used to compute content 212signatures was chosen not only for good hashing properties, but also 213to enable the easy (incremental) recalculation of its value for a 214sliding window, i.e. where the oldest byte is removed from the window 215and a new byte is shifted in.<p> 216 217<h3 id="rhdef">4.1 Definition</h3> 218 219<p>Assuming an array Z of NHASH bytes (indexing starting at 0) the 220hash V is computed via</p> 221 222<p align=center><table><tr><td> 223<p><img src="encode1.gif" align="center"></p> 224<p><img src="encode2.gif" align="center"></p> 225<p><img src="encode3.gif" align="center"></p> 226</td></tr></table></p> 227 228where A and B are unsigned 16-bit integers (hence the <u>mod</u>), and 229V is a 32-bit unsigned integer with B as MSB, A as LSB. 230 231<h3 id="rhincr">4.2 Incremental recalculation</h3> 232 233<p>Assuming an array Z of NHASH bytes (indexing starting at 0) with 234hash V (and components A and B), the dropped 235byte <img src="encode4.gif" align="center">, and the new byte 236<img src="encode5.gif" align="center"> , the new hash can 237be computed incrementally via: </p> 238 239<p align=center><table><tr><td> 240<p><img src="encode6.gif" align="center"></p> 241<p><img src="encode7.gif" align="center"></p> 242<p><img src="encode8.gif" align="center"></p> 243</td></tr></table></p> 244 245<p>For A, the regular sum, it can be seen easily that this the correct 246way recomputing that component.</p> 247 248<p>For B, the weighted sum, note first that <img src="encode4.gif" 249align="center"> has the weight NHASH in the sum, so that is what has 250to be removed. Then adding in <img src="encode9.gif" align="center"> 251adds one weight factor to all the other values of Z, and at last adds 252in <img src="encode5.gif" align="center"> with weight 1, also 253generating the correct new sum</p> 254