1src/backend/access/hash/README 2 3Hash Indexing 4============= 5 6This directory contains an implementation of hash indexing for Postgres. 7Most of the core ideas are taken from Margo Seltzer and Ozan Yigit, 8A New Hashing Package for UNIX, Proceedings of the Winter USENIX Conference, 9January 1991. (Our in-memory hashtable implementation, 10src/backend/utils/hash/dynahash.c, also relies on some of the same concepts; 11it is derived from code written by Esmond Pitt and later improved by Margo 12among others.) 13 14A hash index consists of two or more "buckets", into which tuples are 15placed whenever their hash key maps to the bucket number. The 16key-to-bucket-number mapping is chosen so that the index can be 17incrementally expanded. When a new bucket is to be added to the index, 18exactly one existing bucket will need to be "split", with some of its 19tuples being transferred to the new bucket according to the updated 20key-to-bucket-number mapping. This is essentially the same hash table 21management technique embodied in src/backend/utils/hash/dynahash.c for 22in-memory hash tables. 23 24Each bucket in the hash index comprises one or more index pages. The 25bucket's first page is permanently assigned to it when the bucket is 26created. Additional pages, called "overflow pages", are added if the 27bucket receives too many tuples to fit in the primary bucket page. 28The pages of a bucket are chained together in a doubly-linked list 29using fields in the index page special space. 30 31There is currently no provision to shrink a hash index, other than by 32rebuilding it with REINDEX. Overflow pages can be recycled for reuse 33in other buckets, but we never give them back to the operating system. 34There is no provision for reducing the number of buckets, either. 35 36As of PostgreSQL 8.4, hash index entries store only the hash code, not the 37actual data value, for each indexed item. This makes the index entries 38smaller (perhaps very substantially so) and speeds up various operations. 39In particular, we can speed searches by keeping the index entries in any 40one index page sorted by hash code, thus allowing binary search to be used 41within an index page. Note however that there is *no* assumption about the 42relative ordering of hash codes across different index pages of a bucket. 43 44 45Page Addressing 46--------------- 47 48There are four kinds of pages in a hash index: the meta page (page zero), 49which contains statically allocated control information; primary bucket 50pages; overflow pages; and bitmap pages, which keep track of overflow 51pages that have been freed and are available for re-use. For addressing 52purposes, bitmap pages are regarded as a subset of the overflow pages. 53 54Primary bucket pages and overflow pages are allocated independently (since 55any given index might need more or fewer overflow pages relative to its 56number of buckets). The hash code uses an interesting set of addressing 57rules to support a variable number of overflow pages while not having to 58move primary bucket pages around after they are created. 59 60Primary bucket pages (henceforth just "bucket pages") are allocated in 61power-of-2 groups, called "split points" in the code. That means at every new 62splitpoint we double the existing number of buckets. Allocating huge chunks 63of bucket pages all at once isn't optimal and we will take ages to consume 64those. To avoid this exponential growth of index size, we did use a trick to 65break up allocation of buckets at the splitpoint into 4 equal phases. If 66(2 ^ x) are the total buckets need to be allocated at a splitpoint (from now on 67we shall call this as a splitpoint group), then we allocate 1/4th (2 ^ (x - 2)) 68of total buckets at each phase of splitpoint group. Next quarter of allocation 69will only happen if buckets of the previous phase have been already consumed. 70For the initial splitpoint groups < 10 we will allocate all of their buckets in 71single phase only, as number of buckets allocated at initial groups are small 72in numbers. And for the groups >= 10 the allocation process is distributed 73among four equal phases. At group 10 we allocate (2 ^ 9) buckets in 4 74different phases {2 ^ 7, 2 ^ 7, 2 ^ 7, 2 ^ 7}, the numbers in curly braces 75indicate the number of buckets allocated within each phase of splitpoint group 7610. And, for splitpoint group 11 and 12 allocation phases will be 77{2 ^ 8, 2 ^ 8, 2 ^ 8, 2 ^ 8} and {2 ^ 9, 2 ^ 9, 2 ^ 9, 2 ^ 9} respectively. We 78can see that at each splitpoint group we double the total number of buckets 79from the previous group but in an incremental phase. The bucket pages 80allocated within one phase of a splitpoint group will appear consecutively in 81the index. This addressing scheme allows the physical location of a bucket 82page to be computed from the bucket number relatively easily, using only a 83small amount of control information. If we look at the function 84_hash_spareindex for a given bucket number we first compute the 85splitpoint group it belongs to and then the phase to which the bucket belongs 86to. Adding them we get the global splitpoint phase number S to which the 87bucket belongs and then simply add "hashm_spares[S] + 1" (where hashm_spares[] 88is an array stored in the metapage) with given bucket number to compute its 89physical address. The hashm_spares[S] can be interpreted as the total number 90of overflow pages that have been allocated before the bucket pages of 91splitpoint phase S. The hashm_spares[0] is always 0, so that buckets 0 and 1 92always appear at block numbers 1 and 2, just after the meta page. We always 93have hashm_spares[N] <= hashm_spares[N+1], since the latter count includes the 94former. The difference between the two represents the number of overflow pages 95appearing between the bucket page groups of splitpoints phase N and N+1. 96(Note: the above describes what happens when filling an initially minimally 97sized hash index. In practice, we try to estimate the required index size and 98allocate a suitable number of splitpoints phases immediately, to avoid 99expensive re-splitting during initial index build.) 100 101When S splitpoints exist altogether, the array entries hashm_spares[0] 102through hashm_spares[S] are valid; hashm_spares[S] records the current 103total number of overflow pages. New overflow pages are created as needed 104at the end of the index, and recorded by incrementing hashm_spares[S]. 105When it is time to create a new splitpoint phase's worth of bucket pages, we 106copy hashm_spares[S] into hashm_spares[S+1] and increment S (which is 107stored in the hashm_ovflpoint field of the meta page). This has the 108effect of reserving the correct number of bucket pages at the end of the 109index, and preparing to allocate additional overflow pages after those 110bucket pages. hashm_spares[] entries before S cannot change anymore, 111since that would require moving already-created bucket pages. 112 113The last page nominally used by the index is always determinable from 114hashm_spares[S]. To avoid complaints from smgr, the logical EOF as seen by 115the filesystem and smgr must always be greater than or equal to this page. 116We have to allow the case "greater than" because it's possible that during 117an index extension we crash after allocating filesystem space and before 118updating the metapage. Note that on filesystems that allow "holes" in 119files, it's entirely likely that pages before the logical EOF are not yet 120allocated: when we allocate a new splitpoint phase's worth of bucket pages, we 121physically zero the last such page to force the EOF up, and the first such 122page will be used immediately, but the intervening pages are not written 123until needed. 124 125Since overflow pages may be recycled if enough tuples are deleted from 126their bucket, we need a way to keep track of currently-free overflow 127pages. The state of each overflow page (0 = available, 1 = not available) 128is recorded in "bitmap" pages dedicated to this purpose. The entries in 129the bitmap are indexed by "bit number", a zero-based count in which every 130overflow page has a unique entry. We can convert between an overflow 131page's physical block number and its bit number using the information in 132hashm_spares[] (see hashovfl.c for details). The bit number sequence 133includes the bitmap pages, which is the reason for saying that bitmap 134pages are a subset of the overflow pages. It turns out in fact that each 135bitmap page's first bit represents itself --- this is not an essential 136property, but falls out of the fact that we only allocate another bitmap 137page when we really need one. Bit number zero always corresponds to the 138first bitmap page, which is allocated during index creation just after all 139the initially created buckets. 140 141 142Lock Definitions 143---------------- 144 145Concurrency control for hash indexes is provided using buffer content 146locks, buffer pins, and cleanup locks. Here as elsewhere in PostgreSQL, 147cleanup lock means that we hold an exclusive lock on the buffer and have 148observed at some point after acquiring the lock that we hold the only pin 149on that buffer. For hash indexes, a cleanup lock on a primary bucket page 150represents the right to perform an arbitrary reorganization of the entire 151bucket. Therefore, scans retain a pin on the primary bucket page for the 152bucket they are currently scanning. Splitting a bucket requires a cleanup 153lock on both the old and new primary bucket pages. VACUUM therefore takes 154a cleanup lock on every bucket page in order to remove tuples. It can also 155remove tuples copied to a new bucket by any previous split operation, because 156the cleanup lock taken on the primary bucket page guarantees that no scans 157which started prior to the most recent split can still be in progress. After 158cleaning each page individually, it attempts to take a cleanup lock on the 159primary bucket page in order to "squeeze" the bucket down to the minimum 160possible number of pages. 161 162To avoid deadlocks, we must be consistent about the lock order in which we 163lock the buckets for operations that requires locks on two different buckets. 164We choose to always lock the lower-numbered bucket first. The metapage is 165only ever locked after all bucket locks have been taken. 166 167 168Metapage Caching 169---------------- 170 171Both scanning the index and inserting tuples require locating the bucket 172where a given tuple ought to be located. To do this, we need the bucket 173count, highmask, and lowmask from the metapage; however, it's undesirable 174for performance reasons to have to have to lock and pin the metapage for 175every such operation. Instead, we retain a cached copy of the metapage 176in each backend's relcache entry. This will produce the correct 177bucket mapping as long as the target bucket hasn't been split since the 178last cache refresh. 179 180To guard against the possibility that such a split has occurred, the 181primary page of each bucket chain stores the number of buckets that 182existed as of the time the bucket was last split, or if never split as 183of the time it was created, in the space normally used for the 184previous block number (that is, hasho_prevblkno). This doesn't cost 185anything because the primary bucket page is always the first page in 186the chain, and the previous block number is therefore always, in 187reality, InvalidBlockNumber. 188 189After computing the ostensibly-correct bucket number based on our cached 190copy of the metapage, we lock the corresponding primary bucket page and 191check whether the bucket count stored in hasho_prevblkno is greater than 192the number of buckets stored in our cached copy of the metapage. If 193so, the bucket has certainly been split, because the count must originally 194have been less than the number of buckets that existed at that time and 195can't have increased except due to a split. If not, the bucket can't have 196been split, because a split would have created a new bucket with a higher 197bucket number than any we'd seen previously. In the latter case, we've 198locked the correct bucket and can proceed; in the former case, we must 199release the lock on this bucket, lock the metapage, update our cache, 200unlock the metapage, and retry. 201 202Needing to retry occasionally might seem expensive, but the number of times 203any given bucket can be split is limited to a few dozen no matter how 204many times the hash index is accessed, because the total number of 205buckets is limited to less than 2^32. On the other hand, the number of 206times we access a bucket is unbounded and will be several orders of 207magnitude larger even in unsympathetic cases. 208 209(The metapage cache is new in v10. Older hash indexes had the primary 210bucket page's hasho_prevblkno initialized to InvalidBuffer.) 211 212Pseudocode Algorithms 213--------------------- 214 215Various flags that are used in hash index operations are described as below: 216 217The bucket-being-split and bucket-being-populated flags indicate that split 218the operation is in progress for a bucket. During split operation, a 219bucket-being-split flag is set on the old bucket and bucket-being-populated 220flag is set on new bucket. These flags are cleared once the split operation 221is finished. 222 223The split-cleanup flag indicates that a bucket which has been recently split 224still contains tuples that were also copied to the new bucket; it essentially 225marks the split as incomplete. Once we're certain that no scans which 226started before the new bucket was fully populated are still in progress, we 227can remove the copies from the old bucket and clear the flag. We insist that 228this flag must be clear before splitting a bucket; thus, a bucket can't be 229split again until the previous split is totally complete. 230 231The moved-by-split flag on a tuple indicates that tuple is moved from old to 232new bucket. Concurrent scans will skip such tuples until the split operation 233is finished. Once the tuple is marked as moved-by-split, it will remain so 234forever but that does no harm. We have intentionally not cleared it as that 235can generate an additional I/O which is not necessary. 236 237The operations we need to support are: readers scanning the index for 238entries of a particular hash code (which by definition are all in the same 239bucket); insertion of a new tuple into the correct bucket; enlarging the 240hash table by splitting an existing bucket; and garbage collection 241(deletion of dead tuples and compaction of buckets). Bucket splitting is 242done at conclusion of any insertion that leaves the hash table more full 243than the target load factor, but it is convenient to consider it as an 244independent operation. Note that we do not have a bucket-merge operation 245--- the number of buckets never shrinks. Insertion, splitting, and 246garbage collection may all need access to freelist management, which keeps 247track of available overflow pages. 248 249The reader algorithm is: 250 251 lock the primary bucket page of the target bucket 252 if the target bucket is still being populated by a split: 253 release the buffer content lock on current bucket page 254 pin and acquire the buffer content lock on old bucket in shared mode 255 release the buffer content lock on old bucket, but not pin 256 retake the buffer content lock on new bucket 257 arrange to scan the old bucket normally and the new bucket for 258 tuples which are not moved-by-split 259-- then, per read request: 260 reacquire content lock on current page 261 step to next page if necessary (no chaining of content locks, but keep 262 the pin on the primary bucket throughout the scan) 263 save all the matching tuples from current index page into an items array 264 release pin and content lock (but if it is primary bucket page retain 265 its pin till the end of the scan) 266 get tuple from an item array 267-- at scan shutdown: 268 release all pins still held 269 270Holding the buffer pin on the primary bucket page for the whole scan prevents 271the reader's current-tuple pointer from being invalidated by splits or 272compactions. (Of course, other buckets can still be split or compacted.) 273 274To minimize lock/unlock traffic, hash index scan always searches the entire 275hash page to identify all the matching items at once, copying their heap tuple 276IDs into backend-local storage. The heap tuple IDs are then processed while not 277holding any page lock within the index thereby, allowing concurrent insertion 278to happen on the same index page without any requirement of re-finding the 279current scan position for the reader. We do continue to hold a pin on the 280bucket page, to protect against concurrent deletions and bucket split. 281 282To allow for scans during a bucket split, if at the start of the scan, the 283bucket is marked as bucket-being-populated, it scan all the tuples in that 284bucket except for those that are marked as moved-by-split. Once it finishes 285the scan of all the tuples in the current bucket, it scans the old bucket from 286which this bucket is formed by split. 287 288The insertion algorithm is rather similar: 289 290 lock the primary bucket page of the target bucket 291-- (so far same as reader, except for acquisition of buffer content lock in 292 exclusive mode on primary bucket page) 293 if the bucket-being-split flag is set for a bucket and pin count on it is 294 one, then finish the split 295 release the buffer content lock on current bucket 296 get the "new" bucket which was being populated by the split 297 scan the new bucket and form the hash table of TIDs 298 conditionally get the cleanup lock on old and new buckets 299 if we get the lock on both the buckets 300 finish the split using algorithm mentioned below for split 301 release the pin on old bucket and restart the insert from beginning. 302 if current page is full, first check if this page contains any dead tuples. 303 if yes, remove dead tuples from the current page and again check for the 304 availability of the space. If enough space found, insert the tuple else 305 release lock but not pin, read/exclusive-lock 306 next page; repeat as needed 307 >> see below if no space in any page of bucket 308 take buffer content lock in exclusive mode on metapage 309 insert tuple at appropriate place in page 310 mark current page dirty 311 increment tuple count, decide if split needed 312 mark meta page dirty 313 write WAL for insertion of tuple 314 release the buffer content lock on metapage 315 release buffer content lock on current page 316 if current page is not a bucket page, release the pin on bucket page 317 if split is needed, enter Split algorithm below 318 release the pin on metapage 319 320To speed searches, the index entries within any individual index page are 321kept sorted by hash code; the insertion code must take care to insert new 322entries in the right place. It is okay for an insertion to take place in a 323bucket that is being actively scanned, because readers can cope with this 324as explained above. We only need the short-term buffer locks to ensure 325that readers do not see a partially-updated page. 326 327To avoid deadlock between readers and inserters, whenever there is a need 328to lock multiple buckets, we always take in the order suggested in Lock 329Definitions above. This algorithm allows them a very high degree of 330concurrency. (The exclusive metapage lock taken to update the tuple count 331is stronger than necessary, since readers do not care about the tuple count, 332but the lock is held for such a short time that this is probably not an 333issue.) 334 335When an inserter cannot find space in any existing page of a bucket, it 336must obtain an overflow page and add that page to the bucket's chain. 337Details of that part of the algorithm appear later. 338 339The page split algorithm is entered whenever an inserter observes that the 340index is overfull (has a higher-than-wanted ratio of tuples to buckets). 341The algorithm attempts, but does not necessarily succeed, to split one 342existing bucket in two, thereby lowering the fill ratio: 343 344 pin meta page and take buffer content lock in exclusive mode 345 check split still needed 346 if split not needed anymore, drop buffer content lock and pin and exit 347 decide which bucket to split 348 try to take a cleanup lock on that bucket; if fail, give up 349 if that bucket is still being split or has split-cleanup work: 350 try to finish the split and the cleanup work 351 if that succeeds, start over; if it fails, give up 352 mark the old and new buckets indicating split is in progress 353 mark both old and new buckets as dirty 354 write WAL for allocation of new page for split 355 copy the tuples that belongs to new bucket from old bucket, marking 356 them as moved-by-split 357 write WAL record for moving tuples to new page once the new page is full 358 or all the pages of old bucket are finished 359 release lock but not pin for primary bucket page of old bucket, 360 read/shared-lock next page; repeat as needed 361 clear the bucket-being-split and bucket-being-populated flags 362 mark the old bucket indicating split-cleanup 363 write WAL for changing the flags on both old and new buckets 364 365The split operation's attempt to acquire cleanup-lock on the old bucket number 366could fail if another process holds any lock or pin on it. We do not want to 367wait if that happens, because we don't want to wait while holding the metapage 368exclusive-lock. So, this is a conditional LWLockAcquire operation, and if 369it fails we just abandon the attempt to split. This is all right since the 370index is overfull but perfectly functional. Every subsequent inserter will 371try to split, and eventually one will succeed. If multiple inserters failed 372to split, the index might still be overfull, but eventually, the index will 373not be overfull and split attempts will stop. (We could make a successful 374splitter loop to see if the index is still overfull, but it seems better to 375distribute the split overhead across successive insertions.) 376 377If a split fails partway through (e.g. due to insufficient disk space or an 378interrupt), the index will not be corrupted. Instead, we'll retry the split 379every time a tuple is inserted into the old bucket prior to inserting the new 380tuple; eventually, we should succeed. The fact that a split is left 381unfinished doesn't prevent subsequent buckets from being split, but we won't 382try to split the bucket again until the prior split is finished. In other 383words, a bucket can be in the middle of being split for some time, but it can't 384be in the middle of two splits at the same time. 385 386The fourth operation is garbage collection (bulk deletion): 387 388 next bucket := 0 389 pin metapage and take buffer content lock in exclusive mode 390 fetch current max bucket number 391 release meta page buffer content lock and pin 392 while next bucket <= max bucket do 393 acquire cleanup lock on primary bucket page 394 loop: 395 scan and remove tuples 396 mark the target page dirty 397 write WAL for deleting tuples from target page 398 if this is the last bucket page, break out of loop 399 pin and x-lock next page 400 release prior lock and pin (except keep pin on primary bucket page) 401 if the page we have locked is not the primary bucket page: 402 release lock and take exclusive lock on primary bucket page 403 if there are no other pins on the primary bucket page: 404 squeeze the bucket to remove free space 405 release the pin on primary bucket page 406 next bucket ++ 407 end loop 408 pin metapage and take buffer content lock in exclusive mode 409 check if number of buckets changed 410 if so, release content lock and pin and return to for-each-bucket loop 411 else update metapage tuple count 412 mark meta page dirty and write WAL for update of metapage 413 release buffer content lock and pin 414 415Note that this is designed to allow concurrent splits and scans. If a split 416occurs, tuples relocated into the new bucket will be visited twice by the 417scan, but that does no harm. See also "Interlocking Between Scans and 418VACUUM", below. 419 420We must be careful about the statistics reported by the VACUUM operation. 421What we can do is count the number of tuples scanned, and believe this in 422preference to the stored tuple count if the stored tuple count and number of 423buckets did *not* change at any time during the scan. This provides a way of 424correcting the stored tuple count if it gets out of sync for some reason. But 425if a split or insertion does occur concurrently, the scan count is 426untrustworthy; instead, subtract the number of tuples deleted from the stored 427tuple count and use that. 428 429Interlocking Between Scans and VACUUM 430------------------------------------- 431 432Since we release the lock on bucket page during a cleanup scan of a bucket, a 433concurrent scan could start in that bucket before we've finished vacuuming it. 434If a scan gets ahead of cleanup, we could have the following problem: (1) the 435scan sees heap TIDs that are about to be removed before they are processed by 436VACUUM, (2) the scan decides that one or more of those TIDs are dead, (3) 437VACUUM completes, (4) one or more of the TIDs the scan decided were dead are 438reused for an unrelated tuple, and finally (5) the scan wakes up and 439erroneously kills the new tuple. 440 441Note that this requires VACUUM and a scan to be active in the same bucket at 442the same time. If VACUUM completes before the scan starts, the scan never has 443a chance to see the dead tuples; if the scan completes before the VACUUM 444starts, the heap TIDs can't have been reused meanwhile. Furthermore, VACUUM 445can't start on a bucket that has an active scan, because the scan holds a pin 446on the primary bucket page, and VACUUM must take a cleanup lock on that page 447in order to begin cleanup. Therefore, the only way this problem can occur is 448for a scan to start after VACUUM has released the cleanup lock on the bucket 449but before it has processed the entire bucket and then overtake the cleanup 450operation. 451 452Currently, we prevent this using lock chaining: cleanup locks the next page 453in the chain before releasing the lock and pin on the page just processed. 454 455Free Space Management 456--------------------- 457 458(Question: why is this so complicated? Why not just have a linked list 459of free pages with the list head in the metapage? It's not like we 460avoid needing to modify the metapage with all this.) 461 462Free space management consists of two sub-algorithms, one for reserving 463an overflow page to add to a bucket chain, and one for returning an empty 464overflow page to the free pool. 465 466Obtaining an overflow page: 467 468 take metapage content lock in exclusive mode 469 determine next bitmap page number; if none, exit loop 470 release meta page content lock 471 pin bitmap page and take content lock in exclusive mode 472 search for a free page (zero bit in bitmap) 473 if found: 474 set bit in bitmap 475 mark bitmap page dirty 476 take metapage buffer content lock in exclusive mode 477 if first-free-bit value did not change, 478 update it and mark meta page dirty 479 else (not found): 480 release bitmap page buffer content lock 481 loop back to try next bitmap page, if any 482-- here when we have checked all bitmap pages; we hold meta excl. lock 483 extend index to add another overflow page; update meta information 484 mark meta page dirty 485 return page number 486 487It is slightly annoying to release and reacquire the metapage lock 488multiple times, but it seems best to do it that way to minimize loss of 489concurrency against processes just entering the index. We don't want 490to hold the metapage exclusive lock while reading in a bitmap page. 491(We can at least avoid repeated buffer pin/unpin here.) 492 493The normal path for extending the index does not require doing I/O while 494holding the metapage lock. We do have to do I/O when the extension 495requires adding a new bitmap page as well as the required overflow page 496... but that is an infrequent case, so the loss of concurrency seems 497acceptable. 498 499The portion of tuple insertion that calls the above subroutine looks 500like this: 501 502 -- having determined that no space is free in the target bucket: 503 remember last page of bucket, drop write lock on it 504 re-write-lock last page of bucket 505 if it is not last anymore, step to the last page 506 execute free-page-acquire (obtaining an overflow page) mechanism 507 described above 508 update (former) last page to point to the new page and mark buffer dirty 509 write-lock and initialize new page, with back link to former last page 510 write WAL for addition of overflow page 511 release the locks on meta page and bitmap page acquired in 512 free-page-acquire algorithm 513 release the lock on former last page 514 release the lock on new overflow page 515 insert tuple into new page 516 -- etc. 517 518Notice this handles the case where two concurrent inserters try to extend 519the same bucket. They will end up with a valid, though perhaps 520space-inefficient, configuration: two overflow pages will be added to the 521bucket, each containing one tuple. 522 523The last part of this violates the rule about holding write lock on two 524pages concurrently, but it should be okay to write-lock the previously 525free page; there can be no other process holding lock on it. 526 527Bucket splitting uses a similar algorithm if it has to extend the new 528bucket, but it need not worry about concurrent extension since it has 529buffer content lock in exclusive mode on the new bucket. 530 531Freeing an overflow page requires the process to hold buffer content lock in 532exclusive mode on the containing bucket, so need not worry about other 533accessors of pages in the bucket. The algorithm is: 534 535 delink overflow page from bucket chain 536 (this requires read/update/write/release of fore and aft siblings) 537 pin meta page and take buffer content lock in shared mode 538 determine which bitmap page contains the free space bit for page 539 release meta page buffer content lock 540 pin bitmap page and take buffer content lock in exclusive mode 541 retake meta page buffer content lock in exclusive mode 542 move (insert) tuples that belong to the overflow page being freed 543 update bitmap bit 544 mark bitmap page dirty 545 if page number is still less than first-free-bit, 546 update first-free-bit field and mark meta page dirty 547 write WAL for delinking overflow page operation 548 release buffer content lock and pin 549 release meta page buffer content lock and pin 550 551We have to do it this way because we must clear the bitmap bit before 552changing the first-free-bit field (hashm_firstfree). It is possible that 553we set first-free-bit too small (because someone has already reused the 554page we just freed), but that is okay; the only cost is the next overflow 555page acquirer will scan more bitmap bits than he needs to. What must be 556avoided is having first-free-bit greater than the actual first free bit, 557because then that free page would never be found by searchers. 558 559The reason of moving tuples from overflow page while delinking the later is 560to make that as an atomic operation. Not doing so could lead to spurious reads 561on standby. Basically, the user might see the same tuple twice. 562 563 564WAL Considerations 565------------------ 566 567The hash index operations like create index, insert, delete, bucket split, 568allocate overflow page, and squeeze in themselves don't guarantee hash index 569consistency after a crash. To provide robustness, we write WAL for each of 570these operations. 571 572CREATE INDEX writes multiple WAL records. First, we write a record to cover 573the initializatoin of the metapage, followed by one for each new bucket 574created, followed by one for the initial bitmap page. It's not important for 575index creation to appear atomic, because the index isn't yet visible to any 576other transaction, and the creating transaction will roll back in the event of 577a crash. It would be difficult to cover the whole operation with a single 578write-ahead log record anyway, because we can log only a fixed number of 579pages, as given by XLR_MAX_BLOCK_ID (32), with current XLog machinery. 580 581Ordinary item insertions (that don't force a page split or need a new overflow 582page) are single WAL entries. They touch a single bucket page and the 583metapage. The metapage is updated during replay as it is updated during 584original operation. 585 586If an insertion causes the addition of an overflow page, there will be one 587WAL entry for the new overflow page and second entry for insert itself. 588 589If an insertion causes a bucket split, there will be one WAL entry for insert 590itself, followed by a WAL entry for allocating a new bucket, followed by a WAL 591entry for each overflow bucket page in the new bucket to which the tuples are 592moved from old bucket, followed by a WAL entry to indicate that split is 593complete for both old and new buckets. A split operation which requires 594overflow pages to complete the operation will need to write a WAL record for 595each new allocation of an overflow page. 596 597As splitting involves multiple atomic actions, it's possible that the system 598crashes between moving tuples from bucket pages of the old bucket to new 599bucket. In such a case, after recovery, the old and new buckets will be 600marked with bucket-being-split and bucket-being-populated flags respectively 601which indicates that split is in progress for those buckets. The reader 602algorithm works correctly, as it will scan both the old and new buckets when 603the split is in progress as explained in the reader algorithm section above. 604 605We finish the split at next insert or split operation on the old bucket as 606explained in insert and split algorithm above. It could be done during 607searches, too, but it seems best not to put any extra updates in what would 608otherwise be a read-only operation (updating is not possible in hot standby 609mode anyway). It would seem natural to complete the split in VACUUM, but since 610splitting a bucket might require allocating a new page, it might fail if you 611run out of disk space. That would be bad during VACUUM - the reason for 612running VACUUM in the first place might be that you run out of disk space, 613and now VACUUM won't finish because you're out of disk space. In contrast, 614an insertion can require enlarging the physical file anyway. 615 616Deletion of tuples from a bucket is performed for two reasons: to remove dead 617tuples, and to remove tuples that were moved by a bucket split. A WAL entry 618is made for each bucket page from which tuples are removed, and then another 619WAL entry is made when we clear the needs-split-cleanup flag. If dead tuples 620are removed, a separate WAL entry is made to update the metapage. 621 622As deletion involves multiple atomic operations, it is quite possible that 623system crashes after (a) removing tuples from some of the bucket pages, (b) 624before clearing the garbage flag, or (c) before updating the metapage. If the 625system crashes before completing (b), it will again try to clean the bucket 626during next vacuum or insert after recovery which can have some performance 627impact, but it will work fine. If the system crashes before completing (c), 628after recovery there could be some additional splits until the next vacuum 629updates the metapage, but the other operations like insert, delete and scan 630will work correctly. We can fix this problem by actually updating the 631metapage based on delete operation during replay, but it's not clear whether 632it's worth the complication. 633 634A squeeze operation moves tuples from one of the buckets later in the chain to 635one of the bucket earlier in chain and writes WAL record when either the 636bucket to which it is writing tuples is filled or bucket from which it 637is removing the tuples becomes empty. 638 639As a squeeze operation involves writing multiple atomic operations, it is 640quite possible that the system crashes before completing the operation on 641entire bucket. After recovery, the operations will work correctly, but 642the index will remain bloated and this can impact performance of read and 643insert operations until the next vacuum squeeze the bucket completely. 644 645 646Other Notes 647----------- 648 649Clean up locks prevent a split from occurring while *another* process is stopped 650in a given bucket. It also ensures that one of our *own* backend's scans is not 651stopped in the bucket. 652