hash - OpenGrok cross reference for /dports/databases/postgresql96-plperl/postgresql-9.6.24/src/backend/access/hash/

src/backend/access/hash/README

Hash Indexing
=============

This directory contains an implementation of hash indexing for Postgres.
Most of the core ideas are taken from Margo Seltzer and Ozan Yigit,
A New Hashing Package for UNIX, Proceedings of the Winter USENIX Conference,
January 1991.  (Our in-memory hashtable implementation,
src/backend/utils/hash/dynahash.c, also relies on some of the same concepts;
it is derived from code written by Esmond Pitt and later improved by Margo
among others.)

A hash index consists of two or more "buckets", into which tuples are
placed whenever their hash key maps to the bucket number.  The
key-to-bucket-number mapping is chosen so that the index can be
incrementally expanded.  When a new bucket is to be added to the index,
exactly one existing bucket will need to be "split", with some of its
tuples being transferred to the new bucket according to the updated
key-to-bucket-number mapping.  This is essentially the same hash table
management technique embodied in src/backend/utils/hash/dynahash.c for
in-memory hash tables.

Each bucket in the hash index comprises one or more index pages.  The
bucket's first page is permanently assigned to it when the bucket is
created.  Additional pages, called "overflow pages", are added if the
bucket receives too many tuples to fit in the primary bucket page.
The pages of a bucket are chained together in a doubly-linked list
using fields in the index page special space.

There is currently no provision to shrink a hash index, other than by
rebuilding it with REINDEX.  Overflow pages can be recycled for reuse
in other buckets, but we never give them back to the operating system.
There is no provision for reducing the number of buckets, either.

As of PostgreSQL 8.4, hash index entries store only the hash code, not the
actual data value, for each indexed item.  This makes the index entries
smaller (perhaps very substantially so) and speeds up various operations.
In particular, we can speed searches by keeping the index entries in any
one index page sorted by hash code, thus allowing binary search to be used
within an index page.  Note however that there is *no* assumption about the
relative ordering of hash codes across different index pages of a bucket.


Page Addressing
---------------

There are four kinds of pages in a hash index: the meta page (page zero),
which contains statically allocated control information; primary bucket
pages; overflow pages; and bitmap pages, which keep track of overflow
pages that have been freed and are available for re-use.  For addressing
purposes, bitmap pages are regarded as a subset of the overflow pages.

Primary bucket pages and overflow pages are allocated independently (since
any given index might need more or fewer overflow pages relative to its
number of buckets).  The hash code uses an interesting set of addressing
rules to support a variable number of overflow pages while not having to
move primary bucket pages around after they are created.

Primary bucket pages (henceforth just "bucket pages") are allocated in
power-of-2 groups, called "split points" in the code.  Buckets 0 and 1
are created when the index is initialized.  At the first split, buckets 2
and 3 are allocated; when bucket 4 is needed, buckets 4-7 are allocated;
when bucket 8 is needed, buckets 8-15 are allocated; etc.  All the bucket
pages of a power-of-2 group appear consecutively in the index.  This
addressing scheme allows the physical location of a bucket page to be
computed from the bucket number relatively easily, using only a small
amount of control information.  We take the log2() of the bucket number
to determine which split point S the bucket belongs to, and then simply
add "hashm_spares[S] + 1" (where hashm_spares[] is an array stored in the
metapage) to compute the physical address.  hashm_spares[S] can be
interpreted as the total number of overflow pages that have been allocated
before the bucket pages of splitpoint S.  hashm_spares[0] is always 0,
so that buckets 0 and 1 (which belong to splitpoint 0) always appear at
block numbers 1 and 2, just after the meta page.  We always have
hashm_spares[N] <= hashm_spares[N+1], since the latter count includes the
former.  The difference between the two represents the number of overflow
pages appearing between the bucket page groups of splitpoints N and N+1.

(Note: the above describes what happens when filling an initially minimally
sized hash index.  In practice, we try to estimate the required index size
and allocate a suitable number of splitpoints immediately, to avoid
expensive re-splitting during initial index build.)

When S splitpoints exist altogether, the array entries hashm_spares[0]
through hashm_spares[S] are valid; hashm_spares[S] records the current
total number of overflow pages.  New overflow pages are created as needed
at the end of the index, and recorded by incrementing hashm_spares[S].
When it is time to create a new splitpoint's worth of bucket pages, we
copy hashm_spares[S] into hashm_spares[S+1] and increment S (which is
stored in the hashm_ovflpoint field of the meta page).  This has the
effect of reserving the correct number of bucket pages at the end of the
index, and preparing to allocate additional overflow pages after those
bucket pages.  hashm_spares[] entries before S cannot change anymore,
since that would require moving already-created bucket pages.

The last page nominally used by the index is always determinable from
hashm_spares[S].  To avoid complaints from smgr, the logical EOF as seen by
the filesystem and smgr must always be greater than or equal to this page.
We have to allow the case "greater than" because it's possible that during
an index extension we crash after allocating filesystem space and before
updating the metapage.  Note that on filesystems that allow "holes" in
files, it's entirely likely that pages before the logical EOF are not yet
allocated: when we allocate a new splitpoint's worth of bucket pages, we
physically zero the last such page to force the EOF up, and the first such
page will be used immediately, but the intervening pages are not written
until needed.

Since overflow pages may be recycled if enough tuples are deleted from
their bucket, we need a way to keep track of currently-free overflow
pages.  The state of each overflow page (0 = available, 1 = not available)
is recorded in "bitmap" pages dedicated to this purpose.  The entries in
the bitmap are indexed by "bit number", a zero-based count in which every
overflow page has a unique entry.  We can convert between an overflow
page's physical block number and its bit number using the information in
hashm_spares[] (see hashovfl.c for details).  The bit number sequence
includes the bitmap pages, which is the reason for saying that bitmap
pages are a subset of the overflow pages.  It turns out in fact that each
bitmap page's first bit represents itself --- this is not an essential
property, but falls out of the fact that we only allocate another bitmap
page when we really need one.  Bit number zero always corresponds to the
first bitmap page, which is allocated during index creation just after all
the initially created buckets.


Lock Definitions
----------------

We use both lmgr locks ("heavyweight" locks) and buffer context locks
(LWLocks) to control access to a hash index.  lmgr locks are needed for
long-term locking since there is a (small) risk of deadlock, which we must
be able to detect.  Buffer context locks are used for short-term access
control to individual pages of the index.

LockPage(rel, page), where page is the page number of a hash bucket page,
represents the right to split or compact an individual bucket.  A process
splitting a bucket must exclusive-lock both old and new halves of the
bucket until it is done.  A process doing VACUUM must exclusive-lock the
bucket it is currently purging tuples from.  Processes doing scans or
insertions must share-lock the bucket they are scanning or inserting into.
(It is okay to allow concurrent scans and insertions.)

The lmgr lock IDs corresponding to overflow pages are currently unused.
These are available for possible future refinements.  LockPage(rel, 0)
is also currently undefined (it was previously used to represent the right
to modify the hash-code-to-bucket mapping, but it is no longer needed for
that purpose).

Note that these lock definitions are conceptually distinct from any sort
of lock on the pages whose numbers they share.  A process must also obtain
read or write buffer lock on the metapage or bucket page before accessing
said page.

Processes performing hash index scans must hold share lock on the bucket
they are scanning throughout the scan.  This seems to be essential, since
there is no reasonable way for a scan to cope with its bucket being split
underneath it.  This creates a possibility of deadlock external to the
hash index code, since a process holding one of these locks could block
waiting for an unrelated lock held by another process.  If that process
then does something that requires exclusive lock on the bucket, we have
deadlock.  Therefore the bucket locks must be lmgr locks so that deadlock
can be detected and recovered from.

Processes must obtain read (share) buffer context lock on any hash index
page while reading it, and write (exclusive) lock while modifying it.
To prevent deadlock we enforce these coding rules: no buffer lock may be
held long term (across index AM calls), nor may any buffer lock be held
while waiting for an lmgr lock, nor may more than one buffer lock
be held at a time by any one process.  (The third restriction is probably
stronger than necessary, but it makes the proof of no deadlock obvious.)


Pseudocode Algorithms
---------------------

The operations we need to support are: readers scanning the index for
entries of a particular hash code (which by definition are all in the same
bucket); insertion of a new tuple into the correct bucket; enlarging the
hash table by splitting an existing bucket; and garbage collection
(deletion of dead tuples and compaction of buckets).  Bucket splitting is
done at conclusion of any insertion that leaves the hash table more full
than the target load factor, but it is convenient to consider it as an
independent operation.  Note that we do not have a bucket-merge operation
--- the number of buckets never shrinks.  Insertion, splitting, and
garbage collection may all need access to freelist management, which keeps
track of available overflow pages.

The reader algorithm is:

	pin meta page and take buffer content lock in shared mode
	loop:
		compute bucket number for target hash key
		release meta page buffer content lock
		if (correct bucket page is already locked)
			break
		release any existing bucket page lock (if a concurrent split happened)
		take heavyweight bucket lock
		retake meta page buffer content lock in shared mode
-- then, per read request:
	release pin on metapage
	read current page of bucket and take shared buffer content lock
		step to next page if necessary (no chaining of locks)
	get tuple
	release buffer content lock and pin on current page
-- at scan shutdown:
	release bucket share-lock

We can't hold the metapage lock while acquiring a lock on the target bucket,
because that might result in an undetected deadlock (lwlocks do not participate
in deadlock detection).  Instead, we relock the metapage after acquiring the
bucket page lock and check whether the bucket has been split.  If not, we're
done.  If so, we release our previously-acquired lock and repeat the process
using the new bucket number.  Holding the bucket sharelock for
the remainder of the scan prevents the reader's current-tuple pointer from
being invalidated by splits or compactions.  Notice that the reader's lock
does not prevent other buckets from being split or compacted.

To keep concurrency reasonably good, we require readers to cope with
concurrent insertions, which means that they have to be able to re-find
their current scan position after re-acquiring the page sharelock.  Since
deletion is not possible while a reader holds the bucket sharelock, and
we assume that heap tuple TIDs are unique, this can be implemented by
searching for the same heap tuple TID previously returned.  Insertion does
not move index entries across pages, so the previously-returned index entry
should always be on the same page, at the same or higher offset number,
as it was before.

The insertion algorithm is rather similar:

	pin meta page and take buffer content lock in shared mode
	loop:
		compute bucket number for target hash key
		release meta page buffer content lock
		if (correct bucket page is already locked)
			break
		release any existing bucket page lock (if a concurrent split happened)
		take heavyweight bucket lock in shared mode
		retake meta page buffer content lock in shared mode
-- (so far same as reader)
	release pin on metapage
	pin current page of bucket and take exclusive buffer content lock
	if full, release, read/exclusive-lock next page; repeat as needed
	>> see below if no space in any page of bucket
	insert tuple at appropriate place in page
	mark current page dirty and release buffer content lock and pin
	release heavyweight share-lock
	pin meta page and take buffer content lock in shared mode
	increment tuple count, decide if split needed
	mark meta page dirty and release buffer content lock and pin
	done if no split needed, else enter Split algorithm below

To speed searches, the index entries within any individual index page are
kept sorted by hash code; the insertion code must take care to insert new
entries in the right place.  It is okay for an insertion to take place in a
bucket that is being actively scanned, because readers can cope with this
as explained above.  We only need the short-term buffer locks to ensure
that readers do not see a partially-updated page.

It is clearly impossible for readers and inserters to deadlock, and in
fact this algorithm allows them a very high degree of concurrency.
(The exclusive metapage lock taken to update the tuple count is stronger
than necessary, since readers do not care about the tuple count, but the
lock is held for such a short time that this is probably not an issue.)

When an inserter cannot find space in any existing page of a bucket, it
must obtain an overflow page and add that page to the bucket's chain.
Details of that part of the algorithm appear later.

The page split algorithm is entered whenever an inserter observes that the
index is overfull (has a higher-than-wanted ratio of tuples to buckets).
The algorithm attempts, but does not necessarily succeed, to split one
existing bucket in two, thereby lowering the fill ratio:

	pin meta page and take buffer content lock in exclusive mode
	check split still needed
	if split not needed anymore, drop buffer content lock and pin and exit
	decide which bucket to split
	Attempt to X-lock old bucket number (definitely could fail)
	Attempt to X-lock new bucket number (shouldn't fail, but...)
	if above fail, drop locks and pin and exit
	update meta page to reflect new number of buckets
	mark meta page dirty and release buffer content lock and pin
	-- now, accesses to all other buckets can proceed.
	Perform actual split of bucket, moving tuples as needed
	>> see below about acquiring needed extra space
	Release X-locks of old and new buckets

Note the metapage lock is not held while the actual tuple rearrangement is
performed, so accesses to other buckets can proceed in parallel; in fact,
it's possible for multiple bucket splits to proceed in parallel.

Split's attempt to X-lock the old bucket number could fail if another
process holds S-lock on it.  We do not want to wait if that happens, first
because we don't want to wait while holding the metapage exclusive-lock,
and second because it could very easily result in deadlock.  (The other
process might be out of the hash AM altogether, and could do something
that blocks on another lock this process holds; so even if the hash
algorithm itself is deadlock-free, a user-induced deadlock could occur.)
So, this is a conditional LockAcquire operation, and if it fails we just
abandon the attempt to split.  This is all right since the index is
overfull but perfectly functional.  Every subsequent inserter will try to
split, and eventually one will succeed.  If multiple inserters failed to
split, the index might still be overfull, but eventually, the index will
not be overfull and split attempts will stop.  (We could make a successful
splitter loop to see if the index is still overfull, but it seems better to
distribute the split overhead across successive insertions.)

A problem is that if a split fails partway through (eg due to insufficient
disk space) the index is left corrupt.  The probability of that could be
made quite low if we grab a free page or two before we update the meta
page, but the only real solution is to treat a split as a WAL-loggable,
must-complete action.  I'm not planning to teach hash about WAL in this
go-round.

The fourth operation is garbage collection (bulk deletion):

	next bucket := 0
	pin metapage and take buffer content lock in exclusive mode
	fetch current max bucket number
	release meta page buffer content lock and pin
	while next bucket <= max bucket do
		Acquire X lock on target bucket
		Scan and remove tuples, compact free space as needed
		Release X lock
		next bucket ++
	end loop
	pin metapage and take buffer content lock in exclusive mode
	check if number of buckets changed
	if so, release content lock and pin and return to for-each-bucket loop
	else update metapage tuple count
	mark meta page dirty and release buffer content lock and pin

Note that this is designed to allow concurrent splits.  If a split occurs,
tuples relocated into the new bucket will be visited twice by the scan,
but that does no harm.  (We must however be careful about the statistics
reported by the VACUUM operation.  What we can do is count the number of
tuples scanned, and believe this in preference to the stored tuple count
if the stored tuple count and number of buckets did *not* change at any
time during the scan.  This provides a way of correcting the stored tuple
count if it gets out of sync for some reason.  But if a split or insertion
does occur concurrently, the scan count is untrustworthy; instead,
subtract the number of tuples deleted from the stored tuple count and
use that.)

The exclusive lock request could deadlock in some strange scenarios, but
we can just error out without any great harm being done.


Free Space Management
---------------------

(Question: why is this so complicated?  Why not just have a linked list
of free pages with the list head in the metapage?  It's not like we
avoid needing to modify the metapage with all this.)

Free space management consists of two sub-algorithms, one for reserving
an overflow page to add to a bucket chain, and one for returning an empty
overflow page to the free pool.

Obtaining an overflow page:

	take metapage content lock in exclusive mode
	determine next bitmap page number; if none, exit loop
	release meta page content lock
	pin bitmap page and take content lock in exclusive mode
	search for a free page (zero bit in bitmap)
	if found:
		set bit in bitmap
		mark bitmap page dirty and release content lock
		take metapage buffer content lock in exclusive mode
		if first-free-bit value did not change,
			update it and mark meta page dirty
		release meta page buffer content lock
		return page number
	else (not found):
	release bitmap page buffer content lock
	loop back to try next bitmap page, if any
-- here when we have checked all bitmap pages; we hold meta excl. lock
	extend index to add another overflow page; update meta information
	mark meta page dirty and release buffer content lock
	return page number

It is slightly annoying to release and reacquire the metapage lock
multiple times, but it seems best to do it that way to minimize loss of
concurrency against processes just entering the index.  We don't want
to hold the metapage exclusive lock while reading in a bitmap page.
(We can at least avoid repeated buffer pin/unpin here.)

The normal path for extending the index does not require doing I/O while
holding the metapage lock.  We do have to do I/O when the extension
requires adding a new bitmap page as well as the required overflow page
... but that is an infrequent case, so the loss of concurrency seems
acceptable.

The portion of tuple insertion that calls the above subroutine looks
like this:

	-- having determined that no space is free in the target bucket:
	remember last page of bucket, drop write lock on it
	call free-page-acquire routine
	re-write-lock last page of bucket
	if it is not last anymore, step to the last page
	update (former) last page to point to new page
	write-lock and initialize new page, with back link to former last page
	write and release former last page
	insert tuple into new page
	-- etc.

Notice this handles the case where two concurrent inserters try to extend
the same bucket.  They will end up with a valid, though perhaps
space-inefficient, configuration: two overflow pages will be added to the
bucket, each containing one tuple.

The last part of this violates the rule about holding write lock on two
pages concurrently, but it should be okay to write-lock the previously
free page; there can be no other process holding lock on it.

Bucket splitting uses a similar algorithm if it has to extend the new
bucket, but it need not worry about concurrent extension since it has
exclusive lock on the new bucket.

Freeing an overflow page is done by garbage collection and by bucket
splitting (the old bucket may contain no-longer-needed overflow pages).
In both cases, the process holds exclusive lock on the containing bucket,
so need not worry about other accessors of pages in the bucket.  The
algorithm is:

	delink overflow page from bucket chain
	(this requires read/update/write/release of fore and aft siblings)
	pin meta page and take buffer content lock in shared mode
	determine which bitmap page contains the free space bit for page
	release meta page buffer content lock
	pin bitmap page and take buffer content lock in exclusive mode
	update bitmap bit
	mark bitmap page dirty and release buffer content lock and pin
	if page number is less than what we saw as first-free-bit in meta:
	retake meta page buffer content lock in exclusive mode
	if page number is still less than first-free-bit,
		update first-free-bit field and mark meta page dirty
	release meta page buffer content lock and pin

We have to do it this way because we must clear the bitmap bit before
changing the first-free-bit field (hashm_firstfree).  It is possible that
we set first-free-bit too small (because someone has already reused the
page we just freed), but that is okay; the only cost is the next overflow
page acquirer will scan more bitmap bits than he needs to.  What must be
avoided is having first-free-bit greater than the actual first free bit,
because then that free page would never be found by searchers.

All the freespace operations should be called while holding no buffer
locks.  Since they need no lmgr locks, deadlock is not possible.


Other Notes
-----------

All the shenanigans with locking prevent a split occurring while *another*
process is stopped in a given bucket.  They do not ensure that one of
our *own* backend's scans is not stopped in the bucket, because lmgr
doesn't consider a process's own locks to conflict.  So the Split
algorithm must check for that case separately before deciding it can go
ahead with the split.  VACUUM does not have this problem since nothing
else can be happening within the vacuuming backend.

Should we instead try to fix the state of any conflicting local scan?
Seems mighty ugly --- got to move the held bucket S-lock as well as lots
of other messiness.  For now, just punt and don't split.
Name		Date	Size	#Lines	LOC
..		08-Nov-2021	-
Makefile	H A D	08-Nov-2021	535	19	6
README	H A D	08-Nov-2021	23.4 KiB	468	402
hash.c	H A D	08-Nov-2021	18.8 KiB	713	419
hashfunc.c	H A D	08-Nov-2021	14.8 KiB	548	289
hashinsert.c	H A D	08-Nov-2021	6.3 KiB	224	102
hashovfl.c	H A D	08-Nov-2021	21.9 KiB	762	402
hashpage.c	H A D	08-Nov-2021	28 KiB	905	382
hashscan.c	H A D	08-Nov-2021	3.5 KiB	154	82
hashsearch.c	H A D	08-Nov-2021	11.5 KiB	428	244
hashsort.c	H A D	08-Nov-2021	3.5 KiB	136	57
hashutil.c	H A D	08-Nov-2021	8.9 KiB	355	197
hashvalidate.c	H A D	08-Nov-2021	9.8 KiB	313	217