• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..08-Nov-2021-

MakefileH A D08-Nov-2021536 196

READMEH A D08-Nov-202133.6 KiB633551

hash.cH A D08-Nov-202127.4 KiB1,009578

hash_xlog.cH A D08-Nov-202134.3 KiB1,298791

hashfunc.cH A D08-Nov-202115 KiB549283

hashinsert.cH A D08-Nov-202111.9 KiB426212

hashovfl.cH A D08-Nov-202131.2 KiB1,084589

hashpage.cH A D08-Nov-202147.8 KiB1,620774

hashsearch.cH A D08-Nov-202116.6 KiB578302

hashsort.cH A D08-Nov-20213.9 KiB14561

hashutil.cH A D08-Nov-202115.3 KiB584323

hashvalidate.cH A D08-Nov-20219.8 KiB314218

README

1src/backend/access/hash/README
2
3Hash Indexing
4=============
5
6This directory contains an implementation of hash indexing for Postgres.
7Most of the core ideas are taken from Margo Seltzer and Ozan Yigit,
8A New Hashing Package for UNIX, Proceedings of the Winter USENIX Conference,
9January 1991.  (Our in-memory hashtable implementation,
10src/backend/utils/hash/dynahash.c, also relies on some of the same concepts;
11it is derived from code written by Esmond Pitt and later improved by Margo
12among others.)
13
14A hash index consists of two or more "buckets", into which tuples are
15placed whenever their hash key maps to the bucket number.  The
16key-to-bucket-number mapping is chosen so that the index can be
17incrementally expanded.  When a new bucket is to be added to the index,
18exactly one existing bucket will need to be "split", with some of its
19tuples being transferred to the new bucket according to the updated
20key-to-bucket-number mapping.  This is essentially the same hash table
21management technique embodied in src/backend/utils/hash/dynahash.c for
22in-memory hash tables.
23
24Each bucket in the hash index comprises one or more index pages.  The
25bucket's first page is permanently assigned to it when the bucket is
26created.  Additional pages, called "overflow pages", are added if the
27bucket receives too many tuples to fit in the primary bucket page.
28The pages of a bucket are chained together in a doubly-linked list
29using fields in the index page special space.
30
31There is currently no provision to shrink a hash index, other than by
32rebuilding it with REINDEX.  Overflow pages can be recycled for reuse
33in other buckets, but we never give them back to the operating system.
34There is no provision for reducing the number of buckets, either.
35
36As of PostgreSQL 8.4, hash index entries store only the hash code, not the
37actual data value, for each indexed item.  This makes the index entries
38smaller (perhaps very substantially so) and speeds up various operations.
39In particular, we can speed searches by keeping the index entries in any
40one index page sorted by hash code, thus allowing binary search to be used
41within an index page.  Note however that there is *no* assumption about the
42relative ordering of hash codes across different index pages of a bucket.
43
44
45Page Addressing
46---------------
47
48There are four kinds of pages in a hash index: the meta page (page zero),
49which contains statically allocated control information; primary bucket
50pages; overflow pages; and bitmap pages, which keep track of overflow
51pages that have been freed and are available for re-use.  For addressing
52purposes, bitmap pages are regarded as a subset of the overflow pages.
53
54Primary bucket pages and overflow pages are allocated independently (since
55any given index might need more or fewer overflow pages relative to its
56number of buckets).  The hash code uses an interesting set of addressing
57rules to support a variable number of overflow pages while not having to
58move primary bucket pages around after they are created.
59
60Primary bucket pages (henceforth just "bucket pages") are allocated in
61power-of-2 groups, called "split points" in the code.  That means at every new
62splitpoint we double the existing number of buckets.  Allocating huge chunks
63of bucket pages all at once isn't optimal and we will take ages to consume
64those.  To avoid this exponential growth of index size, we did use a trick to
65break up allocation of buckets at the splitpoint into 4 equal phases.  If
66(2 ^ x) are the total buckets need to be allocated at a splitpoint (from now on
67we shall call this as a splitpoint group), then we allocate 1/4th (2 ^ (x - 2))
68of total buckets at each phase of splitpoint group.  Next quarter of allocation
69will only happen if buckets of the previous phase have been already consumed.
70For the initial splitpoint groups < 10 we will allocate all of their buckets in
71single phase only, as number of buckets allocated at initial groups are small
72in numbers.  And for the groups >= 10 the allocation process is distributed
73among four equal phases.  At group 10 we allocate (2 ^ 9) buckets in 4
74different phases {2 ^ 7, 2 ^ 7, 2 ^ 7, 2 ^ 7}, the numbers in curly braces
75indicate the number of buckets allocated within each phase of splitpoint group
7610.  And, for splitpoint group 11 and 12 allocation phases will be
77{2 ^ 8, 2 ^ 8, 2 ^ 8, 2 ^ 8} and {2 ^ 9, 2 ^ 9, 2 ^ 9, 2 ^ 9} respectively.  We
78can see that at each splitpoint group we double the total number of buckets
79from the previous group but in an incremental phase.  The bucket pages
80allocated within one phase of a splitpoint group will appear consecutively in
81the index.  This addressing scheme allows the physical location of a bucket
82page to be computed from the bucket number relatively easily, using only a
83small amount of control information.  If we look at the function
84_hash_spareindex for a given bucket number we first compute the
85splitpoint group it belongs to and then the phase to which the bucket belongs
86to.  Adding them we get the global splitpoint phase number S to which the
87bucket belongs and then simply add "hashm_spares[S] + 1" (where hashm_spares[]
88is an array stored in the metapage) with given bucket number to compute its
89physical address.  The hashm_spares[S] can be interpreted as the total number
90of overflow pages that have been allocated before the bucket pages of
91splitpoint phase S.  The hashm_spares[0] is always 0, so that buckets 0 and 1
92always appear at block numbers 1 and 2, just after the meta page.  We always
93have hashm_spares[N] <= hashm_spares[N+1], since the latter count includes the
94former.  The difference between the two represents the number of overflow pages
95appearing between the bucket page groups of splitpoints phase N and N+1.
96(Note: the above describes what happens when filling an initially minimally
97sized hash index.  In practice, we try to estimate the required index size and
98allocate a suitable number of splitpoints phases immediately, to avoid
99expensive re-splitting during initial index build.)
100
101When S splitpoints exist altogether, the array entries hashm_spares[0]
102through hashm_spares[S] are valid; hashm_spares[S] records the current
103total number of overflow pages.  New overflow pages are created as needed
104at the end of the index, and recorded by incrementing hashm_spares[S].
105When it is time to create a new splitpoint phase's worth of bucket pages, we
106copy hashm_spares[S] into hashm_spares[S+1] and increment S (which is
107stored in the hashm_ovflpoint field of the meta page).  This has the
108effect of reserving the correct number of bucket pages at the end of the
109index, and preparing to allocate additional overflow pages after those
110bucket pages.  hashm_spares[] entries before S cannot change anymore,
111since that would require moving already-created bucket pages.
112
113The last page nominally used by the index is always determinable from
114hashm_spares[S].  To avoid complaints from smgr, the logical EOF as seen by
115the filesystem and smgr must always be greater than or equal to this page.
116We have to allow the case "greater than" because it's possible that during
117an index extension we crash after allocating filesystem space and before
118updating the metapage.  Note that on filesystems that allow "holes" in
119files, it's entirely likely that pages before the logical EOF are not yet
120allocated: when we allocate a new splitpoint phase's worth of bucket pages, we
121physically zero the last such page to force the EOF up, and the first such
122page will be used immediately, but the intervening pages are not written
123until needed.
124
125Since overflow pages may be recycled if enough tuples are deleted from
126their bucket, we need a way to keep track of currently-free overflow
127pages.  The state of each overflow page (0 = available, 1 = not available)
128is recorded in "bitmap" pages dedicated to this purpose.  The entries in
129the bitmap are indexed by "bit number", a zero-based count in which every
130overflow page has a unique entry.  We can convert between an overflow
131page's physical block number and its bit number using the information in
132hashm_spares[] (see hashovfl.c for details).  The bit number sequence
133includes the bitmap pages, which is the reason for saying that bitmap
134pages are a subset of the overflow pages.  It turns out in fact that each
135bitmap page's first bit represents itself --- this is not an essential
136property, but falls out of the fact that we only allocate another bitmap
137page when we really need one.  Bit number zero always corresponds to the
138first bitmap page, which is allocated during index creation just after all
139the initially created buckets.
140
141
142Lock Definitions
143----------------
144
145Concurrency control for hash indexes is provided using buffer content
146locks, buffer pins, and cleanup locks.   Here as elsewhere in PostgreSQL,
147cleanup lock means that we hold an exclusive lock on the buffer and have
148observed at some point after acquiring the lock that we hold the only pin
149on that buffer.  For hash indexes, a cleanup lock on a primary bucket page
150represents the right to perform an arbitrary reorganization of the entire
151bucket.  Therefore, scans retain a pin on the primary bucket page for the
152bucket they are currently scanning.  Splitting a bucket requires a cleanup
153lock on both the old and new primary bucket pages.  VACUUM therefore takes
154a cleanup lock on every bucket page in order to remove tuples.  It can also
155remove tuples copied to a new bucket by any previous split operation, because
156the cleanup lock taken on the primary bucket page guarantees that no scans
157which started prior to the most recent split can still be in progress.  After
158cleaning each page individually, it attempts to take a cleanup lock on the
159primary bucket page in order to "squeeze" the bucket down to the minimum
160possible number of pages.
161
162To avoid deadlocks, we must be consistent about the lock order in which we
163lock the buckets for operations that requires locks on two different buckets.
164We choose to always lock the lower-numbered bucket first.  The metapage is
165only ever locked after all bucket locks have been taken.
166
167
168Metapage Caching
169----------------
170
171Both scanning the index and inserting tuples require locating the bucket
172where a given tuple ought to be located.  To do this, we need the bucket
173count, highmask, and lowmask from the metapage; however, it's undesirable
174for performance reasons to have to have to lock and pin the metapage for
175every such operation.  Instead, we retain a cached copy of the metapage
176in each backend's relcache entry.  This will produce the correct
177bucket mapping as long as the target bucket hasn't been split since the
178last cache refresh.
179
180To guard against the possibility that such a split has occurred, the
181primary page of each bucket chain stores the number of buckets that
182existed as of the time the bucket was last split, or if never split as
183of the time it was created, in the space normally used for the
184previous block number (that is, hasho_prevblkno).  This doesn't cost
185anything because the primary bucket page is always the first page in
186the chain, and the previous block number is therefore always, in
187reality, InvalidBlockNumber.
188
189After computing the ostensibly-correct bucket number based on our cached
190copy of the metapage, we lock the corresponding primary bucket page and
191check whether the bucket count stored in hasho_prevblkno is greater than
192our the number of buckets stored in our cached copy of the metapage.  If
193so, the bucket has certainly been split, because the must originally
194have been less than the number of buckets that existed at that time and
195can't have increased except due to a split.  If not, the bucket can't have
196been split, because a split would have created a new bucket with a higher
197bucket number than any we'd seen previously.  In the latter case, we've
198locked the correct bucket and can proceed; in the former case, we must
199release the lock on this bucket, lock the metapage, update our cache,
200unlock the metapage, and retry.
201
202Needing to retry occasionally might seem expensive, but the number of times
203any given bucket can be split is limited to a few dozen no matter how
204many times the hash index is accessed, because the total number of
205buckets is limited to less than 2^32.  On the other hand, the number of
206times we access a bucket is unbounded and will be several orders of
207magnitude larger even in unsympathetic cases.
208
209(The metapage cache is new in v10.  Older hash indexes had the primary
210bucket page's hasho_prevblkno initialized to InvalidBuffer.)
211
212Pseudocode Algorithms
213---------------------
214
215Various flags that are used in hash index operations are described as below:
216
217The bucket-being-split and bucket-being-populated flags indicate that split
218the operation is in progress for a bucket.  During split operation, a
219bucket-being-split flag is set on the old bucket and bucket-being-populated
220flag is set on new bucket.  These flags are cleared once the split operation
221is finished.
222
223The split-cleanup flag indicates that a bucket which has been recently split
224still contains tuples that were also copied to the new bucket; it essentially
225marks the split as incomplete.  Once we're certain that no scans which
226started before the new bucket was fully populated are still in progress, we
227can remove the copies from the old bucket and clear the flag.  We insist that
228this flag must be clear before splitting a bucket; thus, a bucket can't be
229split again until the previous split is totally complete.
230
231The moved-by-split flag on a tuple indicates that tuple is moved from old to
232new bucket.  Concurrent scans will skip such tuples until the split operation
233is finished.  Once the tuple is marked as moved-by-split, it will remain so
234forever but that does no harm.  We have intentionally not cleared it as that
235can generate an additional I/O which is not necessary.
236
237The operations we need to support are: readers scanning the index for
238entries of a particular hash code (which by definition are all in the same
239bucket); insertion of a new tuple into the correct bucket; enlarging the
240hash table by splitting an existing bucket; and garbage collection
241(deletion of dead tuples and compaction of buckets).  Bucket splitting is
242done at conclusion of any insertion that leaves the hash table more full
243than the target load factor, but it is convenient to consider it as an
244independent operation.  Note that we do not have a bucket-merge operation
245--- the number of buckets never shrinks.  Insertion, splitting, and
246garbage collection may all need access to freelist management, which keeps
247track of available overflow pages.
248
249The reader algorithm is:
250
251    lock the primary bucket page of the target bucket
252	if the target bucket is still being populated by a split:
253		release the buffer content lock on current bucket page
254		pin and acquire the buffer content lock on old bucket in shared mode
255		release the buffer content lock on old bucket, but not pin
256		retake the buffer content lock on new bucket
257		arrange to scan the old bucket normally and the new bucket for
258         tuples which are not moved-by-split
259-- then, per read request:
260	reacquire content lock on current page
261	step to next page if necessary (no chaining of content locks, but keep
262     the pin on the primary bucket throughout the scan; we also maintain
263     a pin on the page currently being scanned)
264	get tuple
265	release content lock
266-- at scan shutdown:
267	release all pins still held
268
269Holding the buffer pin on the primary bucket page for the whole scan prevents
270the reader's current-tuple pointer from being invalidated by splits or
271compactions.  (Of course, other buckets can still be split or compacted.)
272
273To keep concurrency reasonably good, we require readers to cope with
274concurrent insertions, which means that they have to be able to re-find
275their current scan position after re-acquiring the buffer content lock on
276page.  Since deletion is not possible while a reader holds the pin on bucket,
277and we assume that heap tuple TIDs are unique, this can be implemented by
278searching for the same heap tuple TID previously returned.  Insertion does
279not move index entries across pages, so the previously-returned index entry
280should always be on the same page, at the same or higher offset number,
281as it was before.
282
283To allow for scans during a bucket split, if at the start of the scan, the
284bucket is marked as bucket-being-populated, it scan all the tuples in that
285bucket except for those that are marked as moved-by-split.  Once it finishes
286the scan of all the tuples in the current bucket, it scans the old bucket from
287which this bucket is formed by split.
288
289The insertion algorithm is rather similar:
290
291    lock the primary bucket page of the target bucket
292-- (so far same as reader, except for acquisition of buffer content lock in
293	exclusive mode on primary bucket page)
294	if the bucket-being-split flag is set for a bucket and pin count on it is
295	 one, then finish the split
296		release the buffer content lock on current bucket
297		get the "new" bucket which was being populated by the split
298		scan the new bucket and form the hash table of TIDs
299		conditionally get the cleanup lock on old and new buckets
300		if we get the lock on both the buckets
301			finish the split using algorithm mentioned below for split
302		release the pin on old bucket and restart the insert from beginning.
303	if current page is full, first check if this page contains any dead tuples.
304	if yes, remove dead tuples from the current page and again check for the
305	availability of the space. If enough space found, insert the tuple else
306	release lock but not pin, read/exclusive-lock
307     next page; repeat as needed
308	>> see below if no space in any page of bucket
309	take buffer content lock in exclusive mode on metapage
310	insert tuple at appropriate place in page
311	mark current page dirty
312	increment tuple count, decide if split needed
313	mark meta page dirty
314	write WAL for insertion of tuple
315	release the buffer content lock on metapage
316	release buffer content lock on current page
317	if current page is not a bucket page, release the pin on bucket page
318	if split is needed, enter Split algorithm below
319	release the pin on metapage
320
321To speed searches, the index entries within any individual index page are
322kept sorted by hash code; the insertion code must take care to insert new
323entries in the right place.  It is okay for an insertion to take place in a
324bucket that is being actively scanned, because readers can cope with this
325as explained above.  We only need the short-term buffer locks to ensure
326that readers do not see a partially-updated page.
327
328To avoid deadlock between readers and inserters, whenever there is a need
329to lock multiple buckets, we always take in the order suggested in Lock
330Definitions above.  This algorithm allows them a very high degree of
331concurrency.  (The exclusive metapage lock taken to update the tuple count
332is stronger than necessary, since readers do not care about the tuple count,
333but the lock is held for such a short time that this is probably not an
334issue.)
335
336When an inserter cannot find space in any existing page of a bucket, it
337must obtain an overflow page and add that page to the bucket's chain.
338Details of that part of the algorithm appear later.
339
340The page split algorithm is entered whenever an inserter observes that the
341index is overfull (has a higher-than-wanted ratio of tuples to buckets).
342The algorithm attempts, but does not necessarily succeed, to split one
343existing bucket in two, thereby lowering the fill ratio:
344
345    pin meta page and take buffer content lock in exclusive mode
346    check split still needed
347    if split not needed anymore, drop buffer content lock and pin and exit
348    decide which bucket to split
349    try to take a cleanup lock on that bucket; if fail, give up
350    if that bucket is still being split or has split-cleanup work:
351       try to finish the split and the cleanup work
352       if that succeeds, start over; if it fails, give up
353	mark the old and new buckets indicating split is in progress
354	mark both old and new buckets as dirty
355	write WAL for allocation of new page for split
356	copy the tuples that belongs to new bucket from old bucket, marking
357     them as moved-by-split
358	write WAL record for moving tuples to new page once the new page is full
359	or all the pages of old bucket are finished
360	release lock but not pin for primary bucket page of old bucket,
361	 read/shared-lock next page; repeat as needed
362	clear the bucket-being-split and bucket-being-populated flags
363	mark the old bucket indicating split-cleanup
364	write WAL for changing the flags on both old and new buckets
365
366The split operation's attempt to acquire cleanup-lock on the old bucket number
367could fail if another process holds any lock or pin on it.  We do not want to
368wait if that happens, because we don't want to wait while holding the metapage
369exclusive-lock.  So, this is a conditional LWLockAcquire operation, and if
370it fails we just abandon the attempt to split.  This is all right since the
371index is overfull but perfectly functional.  Every subsequent inserter will
372try to split, and eventually one will succeed.  If multiple inserters failed
373to split, the index might still be overfull, but eventually, the index will
374not be overfull and split attempts will stop.  (We could make a successful
375splitter loop to see if the index is still overfull, but it seems better to
376distribute the split overhead across successive insertions.)
377
378If a split fails partway through (e.g. due to insufficient disk space or an
379interrupt), the index will not be corrupted.  Instead, we'll retry the split
380every time a tuple is inserted into the old bucket prior to inserting the new
381tuple; eventually, we should succeed.  The fact that a split is left
382unfinished doesn't prevent subsequent buckets from being split, but we won't
383try to split the bucket again until the prior split is finished.  In other
384words, a bucket can be in the middle of being split for some time, but it can't
385be in the middle of two splits at the same time.
386
387The fourth operation is garbage collection (bulk deletion):
388
389	next bucket := 0
390	pin metapage and take buffer content lock in exclusive mode
391	fetch current max bucket number
392	release meta page buffer content lock and pin
393	while next bucket <= max bucket do
394		acquire cleanup lock on primary bucket page
395		loop:
396			scan and remove tuples
397			mark the target page dirty
398			write WAL for deleting tuples from target page
399			if this is the last bucket page, break out of loop
400			pin and x-lock next page
401			release prior lock and pin (except keep pin on primary bucket page)
402		if the page we have locked is not the primary bucket page:
403			release lock and take exclusive lock on primary bucket page
404		if there are no other pins on the primary bucket page:
405			squeeze the bucket to remove free space
406		release the pin on primary bucket page
407		next bucket ++
408	end loop
409	pin metapage and take buffer content lock in exclusive mode
410	check if number of buckets changed
411	if so, release content lock and pin and return to for-each-bucket loop
412	else update metapage tuple count
413	mark meta page dirty and write WAL for update of metapage
414	release buffer content lock and pin
415
416Note that this is designed to allow concurrent splits and scans.  If a split
417occurs, tuples relocated into the new bucket will be visited twice by the
418scan, but that does no harm.  As we release the lock on bucket page during
419cleanup scan of a bucket, it will allow concurrent scan to start on a bucket
420and ensures that scan will always be behind cleanup.  It is must to keep scans
421behind cleanup, else vacuum could decrease the TIDs that are required to
422complete the scan.  Now, as the scan that returns multiple tuples from the
423same bucket page always expect next valid TID to be greater than or equal to
424the current TID, it might miss the tuples.  This holds true for backward scans
425as well (backward scans first traverse each bucket starting from first bucket
426to last overflow page in the chain).  We must be careful about the statistics
427reported by the VACUUM operation.  What we can do is count the number of
428tuples scanned, and believe this in preference to the stored tuple count if
429the stored tuple count and number of buckets did *not* change at any time
430during the scan.  This provides a way of correcting the stored tuple count if
431it gets out of sync for some reason.  But if a split or insertion does occur
432concurrently, the scan count is untrustworthy; instead, subtract the number of
433tuples deleted from the stored tuple count and use that.
434
435
436Free Space Management
437---------------------
438
439(Question: why is this so complicated?  Why not just have a linked list
440of free pages with the list head in the metapage?  It's not like we
441avoid needing to modify the metapage with all this.)
442
443Free space management consists of two sub-algorithms, one for reserving
444an overflow page to add to a bucket chain, and one for returning an empty
445overflow page to the free pool.
446
447Obtaining an overflow page:
448
449	take metapage content lock in exclusive mode
450	determine next bitmap page number; if none, exit loop
451	release meta page content lock
452	pin bitmap page and take content lock in exclusive mode
453	search for a free page (zero bit in bitmap)
454	if found:
455		set bit in bitmap
456		mark bitmap page dirty
457		take metapage buffer content lock in exclusive mode
458		if first-free-bit value did not change,
459			update it and mark meta page dirty
460	else (not found):
461	release bitmap page buffer content lock
462	loop back to try next bitmap page, if any
463-- here when we have checked all bitmap pages; we hold meta excl. lock
464	extend index to add another overflow page; update meta information
465	mark meta page dirty
466	return page number
467
468It is slightly annoying to release and reacquire the metapage lock
469multiple times, but it seems best to do it that way to minimize loss of
470concurrency against processes just entering the index.  We don't want
471to hold the metapage exclusive lock while reading in a bitmap page.
472(We can at least avoid repeated buffer pin/unpin here.)
473
474The normal path for extending the index does not require doing I/O while
475holding the metapage lock.  We do have to do I/O when the extension
476requires adding a new bitmap page as well as the required overflow page
477... but that is an infrequent case, so the loss of concurrency seems
478acceptable.
479
480The portion of tuple insertion that calls the above subroutine looks
481like this:
482
483	-- having determined that no space is free in the target bucket:
484	remember last page of bucket, drop write lock on it
485	re-write-lock last page of bucket
486	if it is not last anymore, step to the last page
487	execute free-page-acquire (obtaining an overflow page) mechanism
488      described above
489	update (former) last page to point to the new page and mark buffer dirty
490	write-lock and initialize new page, with back link to former last page
491	write WAL for addition of overflow page
492	release the locks on meta page and bitmap page acquired in
493      free-page-acquire algorithm
494	release the lock on former last page
495	release the lock on new overflow page
496	insert tuple into new page
497	-- etc.
498
499Notice this handles the case where two concurrent inserters try to extend
500the same bucket.  They will end up with a valid, though perhaps
501space-inefficient, configuration: two overflow pages will be added to the
502bucket, each containing one tuple.
503
504The last part of this violates the rule about holding write lock on two
505pages concurrently, but it should be okay to write-lock the previously
506free page; there can be no other process holding lock on it.
507
508Bucket splitting uses a similar algorithm if it has to extend the new
509bucket, but it need not worry about concurrent extension since it has
510buffer content lock in exclusive mode on the new bucket.
511
512Freeing an overflow page requires the process to hold buffer content lock in
513exclusive mode on the containing bucket, so need not worry about other
514accessors of pages in the bucket.  The algorithm is:
515
516	delink overflow page from bucket chain
517	(this requires read/update/write/release of fore and aft siblings)
518	pin meta page and take buffer content lock in shared mode
519	determine which bitmap page contains the free space bit for page
520	release meta page buffer content lock
521	pin bitmap page and take buffer content lock in exclusive mode
522	retake meta page buffer content lock in exclusive mode
523	move (insert) tuples that belong to the overflow page being freed
524	update bitmap bit
525	mark bitmap page dirty
526	if page number is still less than first-free-bit,
527		update first-free-bit field and mark meta page dirty
528	write WAL for delinking overflow page operation
529	release buffer content lock and pin
530	release meta page buffer content lock and pin
531
532We have to do it this way because we must clear the bitmap bit before
533changing the first-free-bit field (hashm_firstfree).  It is possible that
534we set first-free-bit too small (because someone has already reused the
535page we just freed), but that is okay; the only cost is the next overflow
536page acquirer will scan more bitmap bits than he needs to.  What must be
537avoided is having first-free-bit greater than the actual first free bit,
538because then that free page would never be found by searchers.
539
540The reason of moving tuples from overflow page while delinking the later is
541to make that as an atomic operation.  Not doing so could lead to spurious reads
542on standby.  Basically, the user might see the same tuple twice.
543
544
545WAL Considerations
546------------------
547
548The hash index operations like create index, insert, delete, bucket split,
549allocate overflow page, and squeeze in themselves don't guarantee hash index
550consistency after a crash.  To provide robustness, we write WAL for each of
551these operations.
552
553CREATE INDEX writes multiple WAL records.  First, we write a record to cover
554the initializatoin of the metapage, followed by one for each new bucket
555created, followed by one for the initial bitmap page.  It's not important for
556index creation to appear atomic, because the index isn't yet visible to any
557other transaction, and the creating transaction will roll back in the event of
558a crash.  It would be difficult to cover the whole operation with a single
559write-ahead log record anyway, because we can log only a fixed number of
560pages, as given by XLR_MAX_BLOCK_ID (32), with current XLog machinery.
561
562Ordinary item insertions (that don't force a page split or need a new overflow
563page) are single WAL entries.  They touch a single bucket page and the
564metapage.  The metapage is updated during replay as it is updated during
565original operation.
566
567If an insertion causes the addition of an overflow page, there will be one
568WAL entry for the new overflow page and second entry for insert itself.
569
570If an insertion causes a bucket split, there will be one WAL entry for insert
571itself, followed by a WAL entry for allocating a new bucket, followed by a WAL
572entry for each overflow bucket page in the new bucket to which the tuples are
573moved from old bucket, followed by a WAL entry to indicate that split is
574complete for both old and new buckets.  A split operation which requires
575overflow pages to complete the operation will need to write a WAL record for
576each new allocation of an overflow page.
577
578As splitting involves multiple atomic actions, it's possible that the system
579crashes between moving tuples from bucket pages of the old bucket to new
580bucket.  In such a case, after recovery, the old and new buckets will be
581marked with bucket-being-split and bucket-being-populated flags respectively
582which indicates that split is in progress for those buckets.  The reader
583algorithm works correctly, as it will scan both the old and new buckets when
584the split is in progress as explained in the reader algorithm section above.
585
586We finish the split at next insert or split operation on the old bucket as
587explained in insert and split algorithm above.  It could be done during
588searches, too, but it seems best not to put any extra updates in what would
589otherwise be a read-only operation (updating is not possible in hot standby
590mode anyway).  It would seem natural to complete the split in VACUUM, but since
591splitting a bucket might require allocating a new page, it might fail if you
592run out of disk space.  That would be bad during VACUUM - the reason for
593running VACUUM in the first place might be that you run out of disk space,
594and now VACUUM won't finish because you're out of disk space.  In contrast,
595an insertion can require enlarging the physical file anyway.
596
597Deletion of tuples from a bucket is performed for two reasons: to remove dead
598tuples, and to remove tuples that were moved by a bucket split.  A WAL entry
599is made for each bucket page from which tuples are removed, and then another
600WAL entry is made when we clear the needs-split-cleanup flag.  If dead tuples
601are removed, a separate WAL entry is made to update the metapage.
602
603As deletion involves multiple atomic operations, it is quite possible that
604system crashes after (a) removing tuples from some of the bucket pages, (b)
605before clearing the garbage flag, or (c) before updating the metapage.  If the
606system crashes before completing (b), it will again try to clean the bucket
607during next vacuum or insert after recovery which can have some performance
608impact, but it will work fine. If the system crashes before completing (c),
609after recovery there could be some additional splits until the next vacuum
610updates the metapage, but the other operations like insert, delete and scan
611will work correctly.  We can fix this problem by actually updating the
612metapage based on delete operation during replay, but it's not clear whether
613it's worth the complication.
614
615A squeeze operation moves tuples from one of the buckets later in the chain to
616one of the bucket earlier in chain and writes WAL record when either the
617bucket to which it is writing tuples is filled or bucket from which it
618is removing the tuples becomes empty.
619
620As a squeeze operation involves writing multiple atomic operations, it is
621quite possible that the system crashes before completing the operation on
622entire bucket.  After recovery, the operations will work correctly, but
623the index will remain bloated and this can impact performance of read and
624insert operations until the next vacuum squeeze the bucket completely.
625
626
627Other Notes
628-----------
629
630Clean up locks prevent a split from occurring while *another* process is stopped
631in a given bucket.  It also ensures that one of our *own* backend's scans is not
632stopped in the bucket.
633