1.EQ 2delim $$ 3define <- ?< "\h'-0.5m'" up 10 "\(em" down 10 ? 4define gtorder ?"\z>\d\~\u"? 5define EXIST ?"\z\-\d\z\-\r\-\d\v'0.2m'\(br\v'-0.2m'"? 6define ALL ?"\o'V-'"? 7define 0M '0~...~M-1' 8define LH 'lo~...~hi' 9define RR 'bold R' 10define HH 'bold H' 11define KK 'bold K' 12define or '"\fBor\fI"~' 13define and '"\fBand\fI"~' 14define if '"\fBif\fI"~' 15define then '"\fBthen\fI"~' 16define else '"\fBelse\fI"~' 17define repeat '"\fBrepeat\fI"~' 18define until '"\fBuntil\fI"~' 19define while '"\fBwhile\fI"~' 20define do '"\fBdo\fI"~' 21define case '"\fBcase\fI"~' 22define end '"\fBend\fI"~' 23define begin '"\fBbegin\fI"~' 24define elseif '"\fBelseif\fI"~' 25define for '"\fBfor\fI"~' 26define From '"\fBfrom\fI"~' 27define To '"\fBto\fI"~' 28define exit '"\fBexit\fI"~' 29.EN 30.ls 1 31.ce 32COMPACT HASH TABLES USING BIDIRECTIONAL LINEAR PROBING 33.sp 3 34.ce 35John G. Cleary 36.ce 37The University of Calgary, Alberta, Canada. 38.sp 3 39.sp 20 40\u1\dAuthors Present Address: Man-Machine Systems Group, Department of 41Computer Science, The University of Calgary, 2500 University Drive NW 42Calgary, Canada T2N 1N4. 43.sp 44\u2\dThis research was supported by 45the Natural Sciences and Engineering Research Council of Canada. 46.sp 2 47.ls 2 48.bp 49Index Terms -- Searching, hash storage, open addressing, 50bidirectional linear probing, 51address calculation, information retrieval, scatter storage, 52performance analysis, memory compaction. 53.bp 54.pp 55Abstract -- An algorithm is developed which reduces the memory 56requirements of hash tables. 57This is achieved by storing only 58a part of each key along with a few extra bits needed to ensure that 59all keys are stored unambiguously. The fraction of each key stored 60decreases as the size of the hash table increases. Significant reductions 61in total memory usage can be achieved especially when the key size is not 62much larger than the size of a memory index and when only a small amount 63of data is stored with each key. 64The algorithm is based on 65bidirectional linear probing. 66Search and insertion times are shown by simulation 67to be similar to those 68for ordinary bidirectional linear probing. 69.bp 70.sh "1 Introduction" 71.pp 72The retrieval of a single item from among many others is a common problem 73in computer science. I am particularly concerned here with the case where 74the item is retrieved on the basis of a single label 75or key attached to each entry and where the keys are not ordered in any 76particular way. 77There is a well known solution 78to this problem in the form of hash tables. 79Knuth [8], Knott [7] and Maurer and Lewis [11] provide good introductions to 80this subject. 81.pp 82An efficient version of hashing called 83.ul 84bidirectional linear probing 85(BLP), 86was developed by Amble and Knuth [1]. 87As it forms the basis of what follows it is described in more detail in the 88following section. Section 3 shows how it can be modified so as to 89significantly reduce its memory requirements. This is done by storing only 90a small part of each key -- a few extra bits are needed to ensure 91that different keys, that look the same after truncation, are correctly 92distinguished. 93.pp 94The execution time of this compact hashing algorithm is considered in 95Section 4. It is shown by simulation to be 96similar to ordinary BLP 97for both successful searches and insertion. It is significantly 98better for unsuccessful searches. 99.pp 100A hashing scheme similar to compact hashing in that not all of the key is 101stored has been proposed by Andreae [2] (Chapter 1). However, his technique 102has a small but finite probability of retrieving an incorrect key. 103Although compact hashing 104is not based on this earlier technique it provided the impetus to 105seek the current solution. 106.pp 107In hashing algorithms using an overflow area and a linked list of synonyms 108or by variations of this using buckets (see Maurer and Lewis [11]) only the 109remainder of each key need be stored. This has been known since at least 1101965 (Feldman and Low [6] and Knuth [8] sec. 6.4, exercise 13, p543). 111However, each entry (including the original hash location) requires a pointer 112to the next overflow record. This pointer will about the same size as the 113reduction in the key size. So, there is no net memory saving over 114open addressing techniques such as BLP. 115.pp 116Amongst the possible applications of compact hashing is the storage 117of trees and TRIES without the use of pointers but still preserving 118a $log N$ retrieval time. 119It is hoped to report on this application in more detail later. 120.pp 121Pascal versions of the algorithms described below are available 122from the author. 123.sh "2 Bidirectional linear probing." 124.pp 125I will now introduce the open addressing technique which forms the basis 126of compact hashing. 127The 128.ul 129hash table 130in which the keys will be stored is an array $T[ 0M ]$ . I will 131be concerned only with the the keys themselves as the 132items associated with each key do not 133significantly affect the algorithms. In order to compute the location 134for each key I will use two functions: $t$ which randomises the original 135keys, and $h$ which computes a value in the range $0M$. 136.pp 137Let $KK$ be the set of all possible keys and $HH$ be the set of all possible 138transformed keys. Then $t: KK -> HH$ is an invertible function. 139This function is introduced 140to ensure that the keys stored are random and so, as a consequence, 141the hashing 142procedure has a satisfactory 143average performance. In what follows these transformed 144keys will be used rather than the original keys. For example, it is the 145transformed keys that are stored in $T$. (-1 is used to indicate an unoccupied 146location in $T$.) 147.pp 148$h: HH ->"{" 0M "}"$ and has the 149property that for 150$H sub 1 ~, H sub 2 ~ "\(mo" HH$ 151$H sub 1 ~<=~ H sub 2~~ "\fBiff\fP"~~h(H sub 1 ) ~<=~ h(H sub 2 )$. 152As a consequence the keys will be mapped 153into the hash table in the same order as the values of their transformed 154keys. 155This ordering is essential to the compaction attained later. 156Suitable functions $t$ and $h$ have been extensively discussed 157(Carter and Wegman, [3]; Knott [7]; Lum, [9]; Lum, Yuen and Dodd, [10]). 158These authors show that there are functions which almost always make 159the distribution of transformed keys random. I will not consider any 160particular functions for $t$ although some examples of $h$ will be introduced 161later. 162.pp 163To retrieve a key, $K$, from the hash table the transformed key and the 164hash location are first computed. If the (transformed) key stored at the 165hash location is greater than $t(K)$ then the table is searched upward 166until one of three things happen. Either an empty location will be found, 167$T[j]=-1$, or the sought key will be found, $T[j]=t(K)$, or a key greater 168than the sought key will be found, $T[j]>t(K)$. If the first key examined 169is less than $t(K)$ then an analogous search is done down the hash table. 170The search is successful if the sought key is found, that is 171if the last location examined is equal to $t(K)$, and is unsuccessful 172otherwise. (See Amble and Knuth [1] for the details of this algorithm). 173.pp 174For a given set of keys there are many ways that they can be arranged in $T$ 175so that the search algorithm above will still work correctly. 176There is thus 177freedom, when designing an algorithm to insert new keys, to choose different 178strategies for positioning the keys. 179There are two conditions that must be satisfied when a new key is inserted. 180One is that all keys in the memory must remain in ascending order 181and the other is that there must be no empty locations between the original hash 182location of any key and its actual storage position. These imply that all 183keys sharing the same initial hash location must form a single unbroken group. 184.pp 185Within these constraints one would like to insert a new key so as to minimise 186later retrieval times and the time to do the insertion itself. Intuitively 187keys which share the same initial hash location should be centered around that 188initial address. There are two ways of inserting keys which cause little 189disturbance to the memory. One is to find the position where the key should 190be placed according to its ordering and then to create a gap for it by 191moving 192.ul 193up 194all entries from this position up to the next empty location. The second way is 195symmetric to this and creates a gap by moving entries 196.ul 197down 198one location. 199The insertion algorithm given by Amble and Knuth [1] chooses which of these 200two moves to make using a strategy which is guaranteed to minimise the number 201of locations in $T$ which are examined during later successful or unsuccessful 202searches, although it is not guaranteed to minimise the insertion time itself. 203.pp 204One consequence of this insertion strategy is that sometimes it is necessary 205to move entries below 0 and above $M$ in the array $T$. One solution to this 206would be to make the array circular and move entries from 0 to $M-1$ and 207vice versa. However, following Amble and Knuth [1], I will instead extend 208the array $T$ and other arrays to be defined later at their top and bottom. 209This gives 'breathing room' for the array to expand. An extra 20 entries 210at the top and bottom were found to be quite sufficient for all 211the simulation runs reported in Section 4. Accordingly I will define 212$lo ~=~-20$ and $hi~=~M+19$ and define the array $T$ over the range 213$lo$ to $hi$. 214.sh "3 Compact Hashing Using Bidirectional Linear Probing" 215.pp 216I will now show that the memory required to store the keys in BLP can be 217significantly reduced. First consider the case when 218the number of possible keys in $KK$ is less than $M$, then every possible key 219can be assigned its own location in $T$ without possibility of collision. 220In this case $T$ degenerates to an ordinary indexed array and the keys need 221never be stored. At worst a single bit might be needed to say whether 222a particular key has been stored or not. This raises the question of whether 223it is necessary to hold the entire key in memory if the key space $KK$ is slightly 224larger than $M$. For example if $KK$ were, say, four times larger than $M$ 225then it might be possible to hold only two bits of the key rather than the entire 226key. The reasoning here is that the main function of the stored keys is to 227ensure that entries which collide at the same location can be correctly 228separated. 229Provided $h$ is suitably chosen at most four keys can be mapped to a 230single location. The two bits might then be sufficient to store four 231different values for these four keys. It is in fact 232possible to realise this 233reduction in stored key size although a fixed amount of extra information 234is needed 235at each location in order to correctly handle collisions. 236.pp 237So that I can talk about the part of the key which is in excess of the 238address space I will now introduce a 239.ul 240remainder function 241$r$. $r$ maps from the transformed keys $HH$ to a set of remainders 242$RR~==~"{"0,~1,~2,~...,~Rm-1"}"$. 243It is these remainders that will be stored in lieu 244of the transformed keys. 245The essential property 246of $r$ is that $r(H)$ and $h(H)$ together are sufficient to uniquely 247determine $H$. 248.pp 249.ne 9 250Formally, 251.sp 252 $RR ~~==~~ "{"0,~1,~2,~...,~Rm-1"}"$ 253.sp 254 $r: HH -> RR$ 255.sp 256and $h( H sub 1 )~=~h( H sub 2 )~and~r( H sub 1 )~=~r( H sub 2 ) 257~~ "\fBiff\fP" ~~ H sub 1 ~~=~~ H sub 2$ . 258.sp 259For a given function $h$ there are usually many possible functions $r$. 260One particularly simple pair of functions, referred to by Maurer and Lewis [10] 261as the 262.ul 263division method, 264is $h(H)~~=~~ left floor^ H/Rm right floor$ and 265$r(H)~~=~~ H~ "\fBmod\fP"~Rm$ . 266.sp 267When $r$ is defined as above and $Rm$ is between $2 sup d$ and $2 sup d+1$ 268the number of bits needed to 269specify a remainder is the number of bits in the key less $d$. 270.pp 271Consider a new array 272$R [ LH ]$ into which the remainders will be stored. 273In what follows $R$ will be kept in place of $T$ but it will be useful to 274talk about $T$ as if it were still there. $R$ and the additional arrays to 275be introduced shortly specify just the information in $T$, albeit 276more compactly. Each value $R [i]$ will hold the value $r(T[i])$ with the 277exception that when $T[i]$ is $-1$ (marking an empty location) then $R[i]$ 278is also set to $-1$. If 279there have been no collisions then each $R[i]$ paired with the value $i$ 280unambiguously gives the transformed key that would have been stored in $T[i]$. 281However, if there have been collisions it is not possible 282to tell if a value of $R[i]$ is at its home location or if it has been moved 283from, say, $i-1$ and corresponds to a key, $H$, where $r(H)~=~ R[i]$ and $h(H)~=~i-1$. 284If there were some way to locate for each $R[i]$ where it was originally 285hashed then the original keys could all be unambiguously determined. 286This can be done by maintaining two extra arrays of bits, the virgin array $V$, 287and the change array $C$. 288.pp 289The virgin array 290$V[ LH ]$ marks those 291locations which have never been hashed to. That is, $V[i]$ has a value of $1$ 292stored if any of the stored keys in the hash table has $i$ as its hash 293address, and $0$ otherwise. $V$ is maintained by initialising it to $0$ 294and thereafter setting $V[h(H)] <-~1$ whenever a key $H$ is inserted in the 295memory. The virginity of a location is unaffected by the move operations 296during insertion. 297The $V$ array is similar to the array of pass bits recommended in [1]. 298.pp 299To understand the change array $C[ LH ]$ it is necessary to look more closely 300at the distribution of values of $R[i]$. These remainders can be grouped 301according to whether or not they share the same original hash address. 302Also recall that the hash table, as in BLP, is ordered, so, 303all the remainders in a particular group will occur at 304consecutive locations. 305The change bits $C[i]$ are used to delimit the 306boundaries of these groups. This is done by marking the first remainder 307(the one stored at the lowest address) of each group with a $1$. All other 308members of a group have $C[i]=0$. To simplify the search and insertion 309algorithms it is also convenient to set $C[i]$ to 1 for all locations 310which are empty ($R[i]=-1$). 311Thus we have the formal definitions of the 312values of $V$ and $C$ in terms of the now notional array $T$ (the array 313$A$ is described later): 314.bp 315.nf 316.ls 1 317.ta 0.5i +0.75i +0.9i 318 \(lt\|$r(T[i])$ $T[i] != -1$ 319 $R[i]~~==~~$ \(lk\| 320 \(lb\|$-1$ $T[i]=-1$ 321.sp 322 \(lt\|$1 EXIST~ j~h(T[j])=i$ 323 $V[i]~~==~~$ \(lk\| 324 \(lb\|$0$ otherwise 325.sp 326 \(lt\|$1 T[i] != T[i-1]~ roman or ~T[i]=-1$ 327 $C[i]~~==~~$ \(lk\| 328 \(lb\|$0$ otherwise 329.sp 2 330 \(lt\|$a(i) -Na <= a(i) <= Na$ 331 $A[i]~~==~~$ \(lk\| 332 \(lb\|$inf$ otherwise 333.sp 334 where 335.sp 336 $Na ~>=~ 0$ 337.br 338 $a(i)~==~ sum from j=lo to i |C[j]=1~"and"~R[j] != -1|~-~ 339sum from j=lo to i V[j]$ 340.fi 341.ls 2 342.ta 0.5i +0.5i +0.5i +0.5i +0.5i +0.5i +0.5i +0.5i +0.5i +0.5i +0.5i 343.rh "Searching. 344For every group of remainders there will somewhere be a $V$ bit equal to $1$ 345and a $C$ 346bit at a non-empty location equal to $1$. That is, 347for every $V$ bit which is $1$ there is a corresponding $C$ bit 348which is also $1$. 349.FC "Fig. 1." 350This correspondence is indicated in 351Fig. 1 by the dotted lines. When searching for a key $H$ in the table 352the location $h(H)$ is examined. If the $V$ bit is $0$ then the search 353can stop 354immediately. Otherwise a search is made for the corresponding $C$ bit 355which is $1$. To do this a search is made down (or up) the hash table until 356an empty location is found. The number of $V$ bits which are $1$ 357from $h(H)$ to this empty 358location are counted. The correct $C$ bit is then found by counting back 359up (or down) the array from the empty location 360for the same number of $C$ bits which are $1$. Details of this algorithm 361follow. 362.ls 1 363.sp 364.nf 365.ta 1.5c +1.35c +1.35c +1.35c +1.35c +1.35c +1.35c +1.35c +1.35c 366.sp 367.ne 2 368Step 1: {initialise variables} 369 $H <-~ t(K);~~j <-~ h(H);~~rem <-~ r(H);~i <-~ j;~~count <-~ 0;$ 370 {check virgin bit} 371 $if~ V[j]=0~then$ {search unsuccessful} $exit ;$ 372.sp 373.ne 3 374Step 2: {find first empty location down the table} 375 $while ~R[i] != -1~do~~begin~~count <-~count - V[i];~i <-~ i-1 ~end ;$ 376.sp 377.ne 4 378Step 3: {search back to find uppermost member of relevant group} 379 $while count < 0 ~do~begin~ i <-~i+1;~count <-~count +C[i];~end ;$ 380 {$i$ now points at the first(lowest) member of the group associated} 381 {with the original location $j$} 382.sp 383.ne 6 384Step 4: {search group associated with $j$} 385 $while R[i+1] <= rem ~and C[i+1]=0~do i <-~i+1 ;$ 386 {check last location to see if key found} 387 $if R[i]=rem~ mark then$ {search successful} 388 $lineup else$ {search unsuccessful} ; 389.sp 2 390.ls 2 391.fi 392.pp 393An example search is illustrated in Fig. 1 for the key 75. 394For this example $h$ is computed by dividing by 10 and rounding down, 395$r$ is computed by taking the remainder modulo 10. 396.br 397Step 1: The initial hash location 398for 75 is 7 and its remainder is 5. The $V$ bit at location 7 is 1 so the 399search continues. 400.br 401Step 2: 402The first empty location found by searching down the table is at location 3. 403There are three $V$ bits with a value of 1 between 7 and 3 at locations 4044, 6 and 7. 405.br 406Step 3: 407Counting back from location 3 three $C$ bits are 1 at locations 4, 5 and 8. 408So the $C$ bit at location 8 corresponds to the $V$ bit at the 409original hash location 7. 410.br 411Step 4: 412The group of remainders which share the same initial location 7 can then be 413found in locations 8 and 9. Thus the remainder 5 at location 8 can be 414unambiguously associated with the original key 75 and so it can be 415concluded that the information associated with the key 75 is present 416at location 8 in the memory. 417.pp 418It still remains to specify the update 419algorithm and to address some issues of efficiency. To this end a third 420array will be added. 421.rh "Speeding up search." 422It was found during the simulations reported in Section 4 423that the most time consuming element of this search 424is step 2 when the table is scanned for an empty location. The essential 425role played by the empty locations here is to provide a synchronisation 426between the 1 bits in the $V$ and $C$ arrays. 427This lengthy search could be eliminated by maintaining two additional arrays, 428$#C[ LH ]$ and $#V[ LH ]$, which count from the start of memory the number of 429$C$ and $V$ bits which are 1. That is: 430.br 431 $#C[i] ~==~ sum from j=lo to i |C[j]=1~and~R[j] != -1 |$ 432.br 433and $#V[i] ~==~ sum from j=lo to i V[j]$ . 434.br 435.pp 436In order to find the $C$ bit corresponding to some $V[i]=1$ then all that 437is necessary is to compute the difference $count <-~#C[i]-#V[i]$. 438If $count$ is zero then the remainder stored at $i$ was originally 439hashed there and has not been moved. If $count$ is positive then it is 440necessary to scan down the memory until $'count'$ $C$ bits equal to 1 have been 441found. If $count$ is negative then it is necessary to scan up the memory 442until $'-count'$ $C$ bits which are 1 have been found. Fig. 2 shows some 443examples of the various situations which can arise. 444.FC "Fig. 2." 445.pp 446In fact, it is not necessary to store $#C$ and $#V$ explicitly, it is 447sufficient merely to store the differences $#C[i]-#V[i]$. To do this the 448.ul 449At home 450array, $A[ LH ]$, will be used. 451.pp 452At this point it might seem that all earlier gains have been lost because 453in the most extreme case $#C[i]-#V[i]~=~M$. To store a value of $A$ 454will require as many bits as a memory index -- precisely the gain made by 455storing remainders rather than keys!\ However, all is not lost. The values 456of $A$ tend to cluster closely about 0. Simulation 457shows that a hash memory which is 95% full has 99% of the $A$ values 458in the range -15 to +15. Therefore the following strategy can be 459adopted. Assign a fixed number of bits for storing each value of $A$, say 4605 bits. Use these bits to represent the 31 values -15 to +15 and a 32nd 461value for $inf$. Then anywhere that $#C[i]-#V[i]~<~-15~"or"~>+15$ assign $inf$ 462to $A[i]$ otherwise assign the true difference. 463.pp 464When searching for a key a scan can now be done down (or up) the memory 465until a location $i$ where $A[i] != inf$ is found. (At worst this will occur 466at the first unoccupied location where $A[i]$ will be zero.)\ From there 467a count can be made up (or down) the memory for the appropriate number of 468$C$ bits which are 1. 469.pp 470In the detailed algorithm given below some differences from the simpler search 471can be noted. 472In step 3, $count$ can be both 473positive and negative. Therefore code is included to scan both up and down 474the memory as appropriate. At the end of step 3, $i$ can be pointing at any 475member of the group associated with the original hash location. (Above 476$i$ was always left pointing at the lowest member of the 477group.)\ Therefore code is included for scanning both up and down the 478members of the group. In order to prevent redundant checking of locations 479by this code a flag $direction$ is used. It can take on three values 480depending on the direction of the memory scan: $"up"$, $"down"$, and $here$ 481(no further searching need be done). 482.ls 1 483.sp 484.nf 485.ta 1.5c +1.45c +1.45c +1.35c +1.35c +1.35c +1.35c +1.35c +1.35c 486.sp 487.ne 2 488{Search using at-home count} 489Step 1: {initialise variables} 490 $H <-~ t(K);~~j <-~ h(H);~~rem <-~ r(H);~~i <-~ j;~~count <-~ 0;$ 491 {check virgin bit} 492 $if~ V[j]=0~then$ {search unsuccessful} $exit ;$ 493.sp 494.ne 5 495Step 2: {find first well defined $A$ value down the memory} 496 $while ~A[i] = inf~do~begin~count <-~count - V[i];~i <-~i-1 ~end ;$ 497 $count <-~count +A[i];$ 498.sp 499.ne 16 500Step 3: {Search either up or down until a member of sought group is found} 501 {Also ensure $direction$ is set for Step 4.} 502 $if count < 0 ~then$ 503 $direction <-~"up";$ 504 $repeat i <-~i+1;~count <-~count +C[i]~ until count = 0 ;$ 505 $if R[i] ~>=~ rem ~then direction <-~here;$ 506 $else if count > 0 ~then$ 507 $direction <-~"down";$ 508 $repeat ~count <-~count -C[i];~i <-~i-1~ until count = 0 ;$ 509 $if R[i] ~<=~ rem ~then direction <-~here;$ 510 $else${$count = 0$} 511 $if R[i] > rem ~then direction <-~"down"$ 512 $else if R[i] < rem ~then direction <-~"up"$ 513 $else direction <-~here ;$ 514.sp 515.ne 16 516Step 4: {search group associated with $j$} 517 $case direction~ "\fBof\fP"$ 518 $here: ;${do nothing} 519 $"down": repeat if C[i] = 1 ~then direction <-~here$ 520 $else$ 521 $begin$ 522 $i <-~i-1;$ 523 $if R[i] ~<=~ rem ~then direction <-~here;$ 524 $end$ 525 $until direction = here ;$ 526 $"up": repeat if C[i+1] = 1 ~then direction <-~here$ 527 $else$ 528 $begin$ 529 $i <-~i+1;$ 530 $if R[i] ~>=~ rem ~then direction <-~here;$ 531 $end$ 532 $until direction = here ;$ 533 $end ;$ 534.sp 535.ne 4 536Step 5: {check last location to see if key found} 537 $if R[i]=rem~ mark then$ {search successful} 538 $lineup else$ {search unsuccessful} ; 539.sp 2 540.ls 2 541.fi 542.FC "Fig. 3." 543.pp 544Fig. 3, gives an example of this searching algorithm. 545The same memory and key (75) as in Fig. 1 are used. For the 546purposes of the example each $A$ value is allocated one bit. This allows 547only two values 0 and $inf$. The search proceeds as follows: 548.br 549Step 1: The initial hash location 550for 75 is 7 and its remainder is 5. The $V$ bit at location 7 is 1 so the 551search continues. 552.br 553Step 2: 554The first $A$ value not equal to $inf$ found by searching down the table 555is at location 6. 556There is one $V$ bit with a value of 1 between 7 and 6, at location 7. 557$count$ is then set to $A[6]+1~=~1$. So on the next step one $C$ 558bit will be sought. 559.br 560Step 3: 561Counting back up from 6 the first $C$ bit equal to 1 is at location 8. 562So the $C$ bit at location 8 corresponds to the $V$ bit at the 563original hash location 7. 564.br 565Step 4: 566The group of remainders which share the same initial location 7 can then be 567found in locations 8 and 9. The remainder 5 at location 8 can thus be 568unambiguously associated with the original key 75 and it can be 569concluded that the information associated with the key 75 is present 570at location 8 in the memory. 571.rh "Insertion." 572Insertion of a new key into the memory requires three distinct steps: 573first locating whereabouts in the memory the key is to be placed; 574second deciding how the memory is to be rearranged to make room for the new 575key; and third moving the remainders whilst correctly preserving the 576$A$ and $C$ values. (The $V$ bits remain fixed during the move.)\ 577The initial search can be done as explained above with the small addition that 578the correct insertion point must still be located when the key is not present. 579The second and third steps follow the algorithm in Amble and Knuth [1] 580with the addition that the values of the $A$ array must be re-calculated 581over the shifted memory locations and the $C$ but not the $V$ bits must 582be moved with the keys. 583Details of this can be found in an earlier draft of this paper, [4]. 584.sh "4 Performance" 585.pp 586Now I consider how long these algorithms will take to run. The measure of 587run time that I will use is the number of 588.ul 589probes 590that each algorithm makes, that is, the number of times locations in the 591hash table are examined or updated. 592CPU time measures were taken as well and correlate well with the empirical 593counts of probes given below. 594.FC "Table I" 595.FC "Table II" 596.rh "Searching." 597Tables I and II list the results of simulations 598for successful and unsuccessful searches respectively. Results are tabulated 599for ordinary BLP and for compact hashing with 600different memory loadings and different sizes for 601the $A$ field. If the number of keys stored 602in the memory is $N$ then the memory loading is measured by 603$alpha ~==~N/M$, the fraction of locations in the memory which are full. 604Values of 605Na were chosen to correspond to $A$ field lengths of 1, 2, 3, 6064, and 5 bits, that is for Na equal to 0, 1, 3, 7, and 15 respectively, 607and also for the case where no $A$ field was used. 608Increasing the size of the $A$ field beyond 5 bits had no effect at 609the memory loadings investigated. So Na equal to 15 is effectively the 610same as an unbounded size for the $A$ values. 611.pp 612The insertion procedure is 613guaranteed to be optimum only for BLP, not for compact hashing. If none 614of the values in $A$ is $inf$ then the sequence of locations examined by 615compact 616hashing is the same as for BLP and so the strategy will still be optimum. 617(This is easily seen by noting that in compact hashing 618$A[h(t(K))]$ determines the direction 619of the search depending on whether it is positive or negative. During the 620subsequent search no 621locations past the sought key will be probed. This is exactly the same 622probing behaviour as in BLP.)\ 623However, if no $A$ array is being used or if some values of $A$ are $inf$ 624then extra locations need to be probed to find an empty location or one which 625is not equal to $inf$. 626.pp 627As expected the figures in Table I show that for Na at 15 and using optimum 628insertion the probes for a successful search are almost the same as for BLP. 629(The small differences are accounted for by statistical fluctuations 630in the simulation results.)\ 631.pp 632As Na is decreased the number of probes needed for searching increases. 633This 634reflects the greater distances that must be traversed to find a value of 635$A$ not equal to $inf$. It is notable however that even a single bit allocated 636to the $A$ fields dramatically improves the performance. Even at a 637memory density of 0.95 some 25% of non-empty locations have $A$ values of 0. 638.pp 639The pattern for unsuccessful searches is broadly the same as sketched above 640for successful searches except that in general unsuccessful searches 641are quicker than successful ones. This is a result of the $V$ bits 642which allow many unsuccessful searches to be stopped after a single probe. 643For example even at the maximum possible memory density of 1 some 36% of 644$V$ bits are zero. This results in compact hashing being faster than 645the reported values for ordinary BLP. 646However, unsuccessful BLP searches could be 647improved to a similar degree by incorporating $V$ bits. 648.FC "Table III" 649.rh "Insertion." 650The probes to insert a new key can be broken down into three components, 651those needed to locate the position where the key is to be inserted, 652those to decide the direction of movement 653and those to effect the movement of the memory. 654The first of these will be slightly larger than 655a successful search and so the results of Table I have not been repeated. 656The second two are independent of Na as they are dependent only on 657the lengths of blocks of adjacent non-empty locations. The values 658for these Na independent components are listed in Table III. 659In most cases 660this Na independent component is much larger than the search component. 661The exception occurs 662where no $A$ values are being used, when the two components 663are comparable. 664.pp 665Cleary [5] examines a random insertion strategy for ordinary BLP 666where blocks of entries in the hash table are moved in a randomly chosen 667direction 668to accomodate a 669new entry rather than in the optimum way described by 670Amble and Knuth [1]. 671It is shown that this strategy can 672improve insertion times by a factor of 4 at the expense of small degradations 673(at most 15%) in retrieval times. These 674results are shown by simulation to extend to compact hashing. 675Indeed for small values of 676Na the optimum and random strategies show no significant differences in 677retrieval times. 678.rh "Analytic approximation." 679While analytic results are not available for the number of probes 680needed for retrieval or insertion an 681approximation can be developed for some of the cases. It is shown by 682Amble and Knuth [1] and Knuth [8] (problem 6.4-47) that the average 683length of a block of consecutive non-empty locations when using 684the optimum insertion strategy is approximately 685$(1- alpha ) sup -2 ~-~1$. 686Let this block length be $L$. 687.pp 688Consider the case of a successful search when no $A$ field is used. 689A successful scan of a block from an arbitrary 690position to the end takes on average $L/2~+~1/2$ probes. 691During the initial scan down the memory in the simulations the initial check of the 692$V$ bit and the final empty location examined were each counted as a single probe. 693This gives a total of $L/2~+~5/2$ probes for the initial scan down. (This is not 694exact because there will be a correlation between the position 695of a key's home location within a block 696and the number of keys hashing to that home location). 697The scan back up a block will take $L/2~+1/2$ probes (exact for a successful search). 698This gives $(1- alpha ) sup -2 +2$ for the expected 699number of probes during a successful search. These values are listed in Table I 700and are consistently low by about 10%. 701.pp 702For an unsuccessful search with no $A$ field the initial scan down the 703memory will take $L/2~+5/2$ probes as above (again this will not be exact because 704the probability of a $V$ bit being one will be correlated with the 705size of a block and its 706position within the block). 707An unsuccessful scan of a block takes $L/2~+~1/2$ probes. (This assumes 708the keys in the block are distributed uniformly. 709This gives the following probabilities that the search will stop at a 710particular location in the block: the first location, $1/2L$; locations 2 711through $L$, $1/L$; the empty $(L+1)$st location, $1/~2L$. 712This will not be true for compact hashing because the probability of stopping at a key 713which shares its home location with a large number of other keys will be smaller than 714for one which shares it with few others.)\ \ Summing these two terms gives $L~+~7/2$ 715probes. 716Given that the keys are distributed randomly there is a probability of 717$e sup {- alpha}$ that a given $V$ bit will be zero. So the expected number 718of probes overall for an unsuccessful search is 719$e sup {- alpha}~+~(1-e sup {- alpha}) cdot ((1- alpha ) sup -2 + 5/2)$. 720These values are listed in Table II and are consistently low by about 5%. 721.pp 722Considering only the insertion component which is independent of Na then 723it is possible to derive an expression for the number of probes. 724There is an initial 725scan to move the memory down and insert the new key which will scan about half 726the block ($L/2~+~5/2$ probes) 727and a subsequent scan back up of the entire block ($L~+~1$ probes). 728Empirically the probability 729that the entire block will subsequently be moved back up is a half which gives 730an expected $1/2(L~+~1)$ probes. 731Summing these three contributions gives $2(1- alpha ) sup -2 ~+~2$ 732as the expected number of probes for an insertion (excluding the search time). 733Values for this expression are tabulated in Table III, they are in good 734agreement with the empirical values. 735.sh "Acknowledgements" 736.pp 737I would like to thank Ian Witten for careful checking of a draft version. 738Also John Andreae for discussions which showed that something like compact 739hashing might be possible. 740.sh "References" 741.ls 1 742.LB "[6] " 743.sp 744.NI "[1] " 745[1]\ \ O.\ Amble and D.\ E.\ Knuth, "Ordered Hash Tables," 746.ul 747Computer Journal, 748vol. 17, pp135-142, 1974. 749.sp 750.NI "[1] " 751[2]\ \ J.\ H.\ Andreae, 752.ul 753Thinking with the teachable machine. 754London: Academic Press, 1977. 755.sp 756.NI "[1] " 757[3]\ \ J.\ L.\ Carter and M.\ N.\ Wegman, "Universal classes of hash 758functions," 759.ul 760J. Computer System Sci., 761vol. 18, pp143-154, 1979. 762.sp 763.NI "[2] " 764[4]\ \ J.\ G.\ Cleary, "Compact hash tables," 765Research Report, 82/100/19, 766Department of Computer Science, University of Calgary, July 1982. 767.sp 768.NI "[3] " 769[5]\ \ J.\ G.\ Cleary, "Random insertion for bidirectional linear probing 770can be better than optimum," 771Research Report, 82/105/24, 772Department of Computer Science, University of Calgary, September 1982. 773.sp 774.NI "[5] " 775[6]\ \ J. A. Feldman and J. R. Low, "Comment on Brent's Scatter Storage 776Algorithm," 777.ul 778CACM, 779vol. 16, p703, 1973. 780.sp 781.NI "[7] " 782[7]\ \ G. D. Knott, "Hashing functions," 783.ul 784The Computer Journal, 785vol. 18, pp265-278, 1975. 786.sp 787.NI "[7] " 788[8]\ \ D.\ E.\ Knuth, 789.ul 790The art of computer programming:Sorting and searching. 791Vol III. 792Reading, Massachusetts: Addison Wesley, 1973. 793.sp 794.NI "[8] " 795[9]\ \ V.\ Y.\ Lum, "General performance analysis of key-to-address 796transformation methods using an abstract file concept," 797.ul 798CACM, 799vol. 16, pp603-612, 1973. 800.sp 801.NI "[12] " 802[10]\ \ V.\ Y.\ Lum,\ P.\ S.\ T.\ Yuen and M.\ Dodd, "Key-to-address 803transformation techniques," 804.ul 805CACM, 806vol. 14, pp228-239, 1971. 807.sp 808.NI "[13] " 809[11]\ \ W. D. Maurer and T. G. Lewis, "Hash table methods," 810.ul 811Comp. Surveys, 812vol. 7, pp5-19, 1975. 813.ls 2 814.in 0 815.bp 0 816\&\ 817.RF 818.ta 0.5i +0.75i +0.75i +0.75i +0.75i +0.75i 819.nf 820 821 $i T[i] R[i] V[i] C[i]$ 822 \l'3.5i' 823.br 824 825 12 \0\ -1 \ -1 0 1 826.br 827 11 101 \01 0 1 828.br 829 10 \087 \07 1 1 830.br 831 \09 \076 \06 0 0 832.br 833 \08 \075 \05 1 1 834.br 835 \07 \067 \07 1 0 836.br 837 \06 \066 \06 1 0 838.br 839 \05 \065 \05 0 1 840.br 841 \04 \041 \01 1 1 842.br 843 \03 \0\ -1 \ -1 0 1 844.br 845 \02 \019 \09 0 0 846.br 847 \01 \018 \08 1 0 848.br 849 \00 \016 \06 0 1 850.br 851 Step 1 Step 2 Step 3 Step 4 852.br 853 854 $h(H)~=~ left floor^ H/10 right floor$ 855.br 856 857 $r(H)~=~ H~ roman mod ~10$ 858.br 859 860.FG "" 861 862.bp 0 863\&\ 864.RF 865.nf 866.ta 0.5i +0.75i +0.75i +0.75i +0.75i 867 $count~=~A[i]~=~#C[i]-#V[i]$ 868.sp 869 $count = 0$ $count = 0$ 870 $C$ $V$ $C$ $V$ 871 0\|\(rt 1 0\|\(rt 1 872 0\|\(rk 0 0\|\(rk 1$<-~i$ 873 1\|\(rb 1$<-~i$ 1\|\(rb 0 874.sp 875 $count =1>0$ $count = 2 > 0$ 876 $C$ $V$ $C$ $V$ 877 0 1$<-~i$ 0 1$<-~i$ 878 1 0 1 0 879 0\|\(rt 1 1 1 880 0\|\(rk 0 0\|\(rt 0 881 1\|\(rb 0 0\|\(rk 0 882 1\|\(rb 0 883.sp 884 $count =-1<0$ 885 $C$ $V$ 886 0\|\(rt 0 \|\(rt 887 0\|\(rk 0 \|\(rk\ \ Group of entries which hash to 888 1\|\(rb 0 \|\(rb\ \ location i 889 0 0 890 1 1$<-~i$ \ \ \ Corresponding $C$ and $V$ bits 891.FG "" 892.bp 0 893\&\ 894.RF 895.ta 0.5i +0.5i +0.5i +0.5i +0.5i +0.9i +0.6i +0.4i 896$i R[i] V[i] C[i] #V[i] #C[i]~~#C[i]-#V[i] A[i]$ 897.br 898\l'4.5i' 899.br 90012 \ -1 0 1 6 6 \00 0 901.br 90211 \01 0 1 6 6 \00 0 903.br 90410 \07 1 1 6 5 \ -1 $inf$ 905.br 906\09 \06 0 0 5 4 \ -1 $inf$ 907.br 908\08 \05 1 1 5 4 \ -1 $inf$ 909.br 910\07 \07 1 0 4 3 \ -1 $inf$ 911.br 912\06 \06 1 0 3 3 \00 0 913.br 914\05 \05 0 1 2 3 \01 $inf$ 915.br 916\04 \01 1 1 2 2 \00 0 917.br 918\03 \ -1 0 1 1 1 \00 0 919.br 920\02 \09 0 0 1 1 \00 0 921.br 922\01 \08 1 0 1 1 \00 0 923.br 924\00 \06 0 1 0 1 \01 $inf$ 925.br 926 Step 1 Step 2 Step 3 Step 4 927.sp 928Note: Only one bit has been allowed for the values of $A$. 929.br 930 So the only two possible values are 0 and $inf$. 931 932.FG "" 933.bp 0 934\&\ 935.RF 936.ta 1.5i +0.5i +0.5i +0.5i +0.5i +0.5i +0.5i +0.5i +0.5i 937.ce 938Successful Search 939 \l'4i' 940 941 $alpha$ \0.25 \0.5 \0.75 \0.8 \0.85 \0.9 \0.95 942 \l'4i' 943 944 $BLP sup 1$ \0\01.1 \0\01.3 \0\01.7 \0\02.0 \0\02.3 \0\02.9 \0\04.2 945 946 Na 947 15 \0\01.1 \0\01.3 \0\01.7 \0\01.9 \0\02.2 \0\02.8 \0\04.6 948 \07 \0\01.1 \0\01.3 \0\01.7 \0\01.9 \0\02.2 \0\02.8 \0\09.7 949 \03 \0\01.1 \0\01.3 \0\01.7 \0\01.9 \0\02.4 \0\04.2 \025 950 \01 \0\01.1 \0\01.3 \0\02.0 \0\02.5 \0\04.1 \0\08.8 \045 951 \00 \0\01.1 \0\01.5 \0\03.3 \0\04.9 \0\07.9 \015 \061 952 \0- \0\04.2 \0\07.1 \020 \030 \049 110 370 953 \0* \0\03.77 \0\06.00 \018.0 \027.0 \046.4 102 402 954 955 $\& sup 1~$Taken from Amble and Knuth [1]. 956 - No $A$ field used. 957 * Analytic approximation to line above. 958.FG "" 959.bp 0 960\& 961.RF 962.ta 1.5i +0.5i +0.5i +0.5i +0.5i +0.5i +0.5i +0.5i +0.5i 963.ce 964Unsuccessful Search 965 \l'4i' 966 967 $alpha$ \0.25 \0.5 \0.75 \0.8 \0.85 \0.9 \0.95 968 \l'4i' 969 970 $BLP sup 1$ \0\01.3 \0\01.5 \0\02.1 \0\02.3 \0\02.6 \0\03.1 \0\04.4 971 972 Na 973 15 \0\01.2 \0\01.4 \0\01.8 \0\01.9 \0\02.1 \0\02.4 \0\03.5 974 \07 \0\01.2 \0\01.4 \0\01.8 \0\01.9 \0\02.1 \0\02.4 \0\09.7 975 \03 \0\01.2 \0\01.4 \0\01.8 \0\01.9 \0\02.2 \0\03.3 \015 976 \01 \0\01.2 \0\01.4 \0\01.9 \0\02.2 \0\03.2 \0\06.0 \028 977 \00 \0\01.2 \0\01.5 \0\02.6 \0\03.4 \0\05.3 \0\09.9 \036 978 \0- \0\01.7 \0\03.4 \011 \016 \028 \064 220 979 \0* \0\01.72 \0\03.16 \010.2 \015.6 \027.3 \061.2 247 980 981 $\& sup 1~$Taken from Amble and Knuth [1]. 982 - No $A$ field used. 983 * Analytic approximation to line above. 984.FG "" 985.bp 0 986\& 987.RF 988.ta 1.5i +0.5i +0.5i +0.5i +0.5i +0.5i +0.5i +0.5i +0.5i 989 \l'4i' 990 991 $alpha$ \0.25 \0.5 \0.75 \0.8 \0.85 \0.9 \0.95 992 \l'4i' 993 994 \0\04.3 \0\08.8 \032 \049 \086 200 700 995 * \0\04.56 \0\09.00 \033.0 \051.0 \089.9\ 996 201 801 997 998 * Analytic approximation to line above 999.FG "" 1000.bp 0 1001\& 1002.ce 1003List of Figures 1004.sp 2 1005Fig. 1. Example of compact hash memory and search for key. 1006.sp 2 1007Fig. 2. Examples showing different values of $#C[i]-#V[i]$. 1008.sp 2 1009Fig. 3. Example of calculation and use of array $A$. 1010.sp 2 1011.ce 1012List of Tables 1013.sp 2 1014Table I. Average number of probes during a successful search. 1015.sp 2 1016Table II. Average number of probes during an unsuccessful search. 1017.sp 2 1018Table III. Average number of probes to move block of memory. 1019.sp 2 1020