• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..08-Nov-2021-

MakefileH A D08-Nov-2021535 196

READMEH A D08-Nov-202123.4 KiB468402

hash.cH A D08-Nov-202118.8 KiB713419

hashfunc.cH A D08-Nov-202114.8 KiB548289

hashinsert.cH A D08-Nov-20216.3 KiB224102

hashovfl.cH A D08-Nov-202121.9 KiB762402

hashpage.cH A D08-Nov-202128 KiB905382

hashscan.cH A D08-Nov-20213.5 KiB15482

hashsearch.cH A D08-Nov-202111.5 KiB428244

hashsort.cH A D08-Nov-20213.5 KiB13657

hashutil.cH A D08-Nov-20218.9 KiB355197

hashvalidate.cH A D08-Nov-20219.8 KiB313217

README

1src/backend/access/hash/README
2
3Hash Indexing
4=============
5
6This directory contains an implementation of hash indexing for Postgres.
7Most of the core ideas are taken from Margo Seltzer and Ozan Yigit,
8A New Hashing Package for UNIX, Proceedings of the Winter USENIX Conference,
9January 1991.  (Our in-memory hashtable implementation,
10src/backend/utils/hash/dynahash.c, also relies on some of the same concepts;
11it is derived from code written by Esmond Pitt and later improved by Margo
12among others.)
13
14A hash index consists of two or more "buckets", into which tuples are
15placed whenever their hash key maps to the bucket number.  The
16key-to-bucket-number mapping is chosen so that the index can be
17incrementally expanded.  When a new bucket is to be added to the index,
18exactly one existing bucket will need to be "split", with some of its
19tuples being transferred to the new bucket according to the updated
20key-to-bucket-number mapping.  This is essentially the same hash table
21management technique embodied in src/backend/utils/hash/dynahash.c for
22in-memory hash tables.
23
24Each bucket in the hash index comprises one or more index pages.  The
25bucket's first page is permanently assigned to it when the bucket is
26created.  Additional pages, called "overflow pages", are added if the
27bucket receives too many tuples to fit in the primary bucket page.
28The pages of a bucket are chained together in a doubly-linked list
29using fields in the index page special space.
30
31There is currently no provision to shrink a hash index, other than by
32rebuilding it with REINDEX.  Overflow pages can be recycled for reuse
33in other buckets, but we never give them back to the operating system.
34There is no provision for reducing the number of buckets, either.
35
36As of PostgreSQL 8.4, hash index entries store only the hash code, not the
37actual data value, for each indexed item.  This makes the index entries
38smaller (perhaps very substantially so) and speeds up various operations.
39In particular, we can speed searches by keeping the index entries in any
40one index page sorted by hash code, thus allowing binary search to be used
41within an index page.  Note however that there is *no* assumption about the
42relative ordering of hash codes across different index pages of a bucket.
43
44
45Page Addressing
46---------------
47
48There are four kinds of pages in a hash index: the meta page (page zero),
49which contains statically allocated control information; primary bucket
50pages; overflow pages; and bitmap pages, which keep track of overflow
51pages that have been freed and are available for re-use.  For addressing
52purposes, bitmap pages are regarded as a subset of the overflow pages.
53
54Primary bucket pages and overflow pages are allocated independently (since
55any given index might need more or fewer overflow pages relative to its
56number of buckets).  The hash code uses an interesting set of addressing
57rules to support a variable number of overflow pages while not having to
58move primary bucket pages around after they are created.
59
60Primary bucket pages (henceforth just "bucket pages") are allocated in
61power-of-2 groups, called "split points" in the code.  Buckets 0 and 1
62are created when the index is initialized.  At the first split, buckets 2
63and 3 are allocated; when bucket 4 is needed, buckets 4-7 are allocated;
64when bucket 8 is needed, buckets 8-15 are allocated; etc.  All the bucket
65pages of a power-of-2 group appear consecutively in the index.  This
66addressing scheme allows the physical location of a bucket page to be
67computed from the bucket number relatively easily, using only a small
68amount of control information.  We take the log2() of the bucket number
69to determine which split point S the bucket belongs to, and then simply
70add "hashm_spares[S] + 1" (where hashm_spares[] is an array stored in the
71metapage) to compute the physical address.  hashm_spares[S] can be
72interpreted as the total number of overflow pages that have been allocated
73before the bucket pages of splitpoint S.  hashm_spares[0] is always 0,
74so that buckets 0 and 1 (which belong to splitpoint 0) always appear at
75block numbers 1 and 2, just after the meta page.  We always have
76hashm_spares[N] <= hashm_spares[N+1], since the latter count includes the
77former.  The difference between the two represents the number of overflow
78pages appearing between the bucket page groups of splitpoints N and N+1.
79
80(Note: the above describes what happens when filling an initially minimally
81sized hash index.  In practice, we try to estimate the required index size
82and allocate a suitable number of splitpoints immediately, to avoid
83expensive re-splitting during initial index build.)
84
85When S splitpoints exist altogether, the array entries hashm_spares[0]
86through hashm_spares[S] are valid; hashm_spares[S] records the current
87total number of overflow pages.  New overflow pages are created as needed
88at the end of the index, and recorded by incrementing hashm_spares[S].
89When it is time to create a new splitpoint's worth of bucket pages, we
90copy hashm_spares[S] into hashm_spares[S+1] and increment S (which is
91stored in the hashm_ovflpoint field of the meta page).  This has the
92effect of reserving the correct number of bucket pages at the end of the
93index, and preparing to allocate additional overflow pages after those
94bucket pages.  hashm_spares[] entries before S cannot change anymore,
95since that would require moving already-created bucket pages.
96
97The last page nominally used by the index is always determinable from
98hashm_spares[S].  To avoid complaints from smgr, the logical EOF as seen by
99the filesystem and smgr must always be greater than or equal to this page.
100We have to allow the case "greater than" because it's possible that during
101an index extension we crash after allocating filesystem space and before
102updating the metapage.  Note that on filesystems that allow "holes" in
103files, it's entirely likely that pages before the logical EOF are not yet
104allocated: when we allocate a new splitpoint's worth of bucket pages, we
105physically zero the last such page to force the EOF up, and the first such
106page will be used immediately, but the intervening pages are not written
107until needed.
108
109Since overflow pages may be recycled if enough tuples are deleted from
110their bucket, we need a way to keep track of currently-free overflow
111pages.  The state of each overflow page (0 = available, 1 = not available)
112is recorded in "bitmap" pages dedicated to this purpose.  The entries in
113the bitmap are indexed by "bit number", a zero-based count in which every
114overflow page has a unique entry.  We can convert between an overflow
115page's physical block number and its bit number using the information in
116hashm_spares[] (see hashovfl.c for details).  The bit number sequence
117includes the bitmap pages, which is the reason for saying that bitmap
118pages are a subset of the overflow pages.  It turns out in fact that each
119bitmap page's first bit represents itself --- this is not an essential
120property, but falls out of the fact that we only allocate another bitmap
121page when we really need one.  Bit number zero always corresponds to the
122first bitmap page, which is allocated during index creation just after all
123the initially created buckets.
124
125
126Lock Definitions
127----------------
128
129We use both lmgr locks ("heavyweight" locks) and buffer context locks
130(LWLocks) to control access to a hash index.  lmgr locks are needed for
131long-term locking since there is a (small) risk of deadlock, which we must
132be able to detect.  Buffer context locks are used for short-term access
133control to individual pages of the index.
134
135LockPage(rel, page), where page is the page number of a hash bucket page,
136represents the right to split or compact an individual bucket.  A process
137splitting a bucket must exclusive-lock both old and new halves of the
138bucket until it is done.  A process doing VACUUM must exclusive-lock the
139bucket it is currently purging tuples from.  Processes doing scans or
140insertions must share-lock the bucket they are scanning or inserting into.
141(It is okay to allow concurrent scans and insertions.)
142
143The lmgr lock IDs corresponding to overflow pages are currently unused.
144These are available for possible future refinements.  LockPage(rel, 0)
145is also currently undefined (it was previously used to represent the right
146to modify the hash-code-to-bucket mapping, but it is no longer needed for
147that purpose).
148
149Note that these lock definitions are conceptually distinct from any sort
150of lock on the pages whose numbers they share.  A process must also obtain
151read or write buffer lock on the metapage or bucket page before accessing
152said page.
153
154Processes performing hash index scans must hold share lock on the bucket
155they are scanning throughout the scan.  This seems to be essential, since
156there is no reasonable way for a scan to cope with its bucket being split
157underneath it.  This creates a possibility of deadlock external to the
158hash index code, since a process holding one of these locks could block
159waiting for an unrelated lock held by another process.  If that process
160then does something that requires exclusive lock on the bucket, we have
161deadlock.  Therefore the bucket locks must be lmgr locks so that deadlock
162can be detected and recovered from.
163
164Processes must obtain read (share) buffer context lock on any hash index
165page while reading it, and write (exclusive) lock while modifying it.
166To prevent deadlock we enforce these coding rules: no buffer lock may be
167held long term (across index AM calls), nor may any buffer lock be held
168while waiting for an lmgr lock, nor may more than one buffer lock
169be held at a time by any one process.  (The third restriction is probably
170stronger than necessary, but it makes the proof of no deadlock obvious.)
171
172
173Pseudocode Algorithms
174---------------------
175
176The operations we need to support are: readers scanning the index for
177entries of a particular hash code (which by definition are all in the same
178bucket); insertion of a new tuple into the correct bucket; enlarging the
179hash table by splitting an existing bucket; and garbage collection
180(deletion of dead tuples and compaction of buckets).  Bucket splitting is
181done at conclusion of any insertion that leaves the hash table more full
182than the target load factor, but it is convenient to consider it as an
183independent operation.  Note that we do not have a bucket-merge operation
184--- the number of buckets never shrinks.  Insertion, splitting, and
185garbage collection may all need access to freelist management, which keeps
186track of available overflow pages.
187
188The reader algorithm is:
189
190	pin meta page and take buffer content lock in shared mode
191	loop:
192		compute bucket number for target hash key
193		release meta page buffer content lock
194		if (correct bucket page is already locked)
195			break
196		release any existing bucket page lock (if a concurrent split happened)
197		take heavyweight bucket lock
198		retake meta page buffer content lock in shared mode
199-- then, per read request:
200	release pin on metapage
201	read current page of bucket and take shared buffer content lock
202		step to next page if necessary (no chaining of locks)
203	get tuple
204	release buffer content lock and pin on current page
205-- at scan shutdown:
206	release bucket share-lock
207
208We can't hold the metapage lock while acquiring a lock on the target bucket,
209because that might result in an undetected deadlock (lwlocks do not participate
210in deadlock detection).  Instead, we relock the metapage after acquiring the
211bucket page lock and check whether the bucket has been split.  If not, we're
212done.  If so, we release our previously-acquired lock and repeat the process
213using the new bucket number.  Holding the bucket sharelock for
214the remainder of the scan prevents the reader's current-tuple pointer from
215being invalidated by splits or compactions.  Notice that the reader's lock
216does not prevent other buckets from being split or compacted.
217
218To keep concurrency reasonably good, we require readers to cope with
219concurrent insertions, which means that they have to be able to re-find
220their current scan position after re-acquiring the page sharelock.  Since
221deletion is not possible while a reader holds the bucket sharelock, and
222we assume that heap tuple TIDs are unique, this can be implemented by
223searching for the same heap tuple TID previously returned.  Insertion does
224not move index entries across pages, so the previously-returned index entry
225should always be on the same page, at the same or higher offset number,
226as it was before.
227
228The insertion algorithm is rather similar:
229
230	pin meta page and take buffer content lock in shared mode
231	loop:
232		compute bucket number for target hash key
233		release meta page buffer content lock
234		if (correct bucket page is already locked)
235			break
236		release any existing bucket page lock (if a concurrent split happened)
237		take heavyweight bucket lock in shared mode
238		retake meta page buffer content lock in shared mode
239-- (so far same as reader)
240	release pin on metapage
241	pin current page of bucket and take exclusive buffer content lock
242	if full, release, read/exclusive-lock next page; repeat as needed
243	>> see below if no space in any page of bucket
244	insert tuple at appropriate place in page
245	mark current page dirty and release buffer content lock and pin
246	release heavyweight share-lock
247	pin meta page and take buffer content lock in shared mode
248	increment tuple count, decide if split needed
249	mark meta page dirty and release buffer content lock and pin
250	done if no split needed, else enter Split algorithm below
251
252To speed searches, the index entries within any individual index page are
253kept sorted by hash code; the insertion code must take care to insert new
254entries in the right place.  It is okay for an insertion to take place in a
255bucket that is being actively scanned, because readers can cope with this
256as explained above.  We only need the short-term buffer locks to ensure
257that readers do not see a partially-updated page.
258
259It is clearly impossible for readers and inserters to deadlock, and in
260fact this algorithm allows them a very high degree of concurrency.
261(The exclusive metapage lock taken to update the tuple count is stronger
262than necessary, since readers do not care about the tuple count, but the
263lock is held for such a short time that this is probably not an issue.)
264
265When an inserter cannot find space in any existing page of a bucket, it
266must obtain an overflow page and add that page to the bucket's chain.
267Details of that part of the algorithm appear later.
268
269The page split algorithm is entered whenever an inserter observes that the
270index is overfull (has a higher-than-wanted ratio of tuples to buckets).
271The algorithm attempts, but does not necessarily succeed, to split one
272existing bucket in two, thereby lowering the fill ratio:
273
274	pin meta page and take buffer content lock in exclusive mode
275	check split still needed
276	if split not needed anymore, drop buffer content lock and pin and exit
277	decide which bucket to split
278	Attempt to X-lock old bucket number (definitely could fail)
279	Attempt to X-lock new bucket number (shouldn't fail, but...)
280	if above fail, drop locks and pin and exit
281	update meta page to reflect new number of buckets
282	mark meta page dirty and release buffer content lock and pin
283	-- now, accesses to all other buckets can proceed.
284	Perform actual split of bucket, moving tuples as needed
285	>> see below about acquiring needed extra space
286	Release X-locks of old and new buckets
287
288Note the metapage lock is not held while the actual tuple rearrangement is
289performed, so accesses to other buckets can proceed in parallel; in fact,
290it's possible for multiple bucket splits to proceed in parallel.
291
292Split's attempt to X-lock the old bucket number could fail if another
293process holds S-lock on it.  We do not want to wait if that happens, first
294because we don't want to wait while holding the metapage exclusive-lock,
295and second because it could very easily result in deadlock.  (The other
296process might be out of the hash AM altogether, and could do something
297that blocks on another lock this process holds; so even if the hash
298algorithm itself is deadlock-free, a user-induced deadlock could occur.)
299So, this is a conditional LockAcquire operation, and if it fails we just
300abandon the attempt to split.  This is all right since the index is
301overfull but perfectly functional.  Every subsequent inserter will try to
302split, and eventually one will succeed.  If multiple inserters failed to
303split, the index might still be overfull, but eventually, the index will
304not be overfull and split attempts will stop.  (We could make a successful
305splitter loop to see if the index is still overfull, but it seems better to
306distribute the split overhead across successive insertions.)
307
308A problem is that if a split fails partway through (eg due to insufficient
309disk space) the index is left corrupt.  The probability of that could be
310made quite low if we grab a free page or two before we update the meta
311page, but the only real solution is to treat a split as a WAL-loggable,
312must-complete action.  I'm not planning to teach hash about WAL in this
313go-round.
314
315The fourth operation is garbage collection (bulk deletion):
316
317	next bucket := 0
318	pin metapage and take buffer content lock in exclusive mode
319	fetch current max bucket number
320	release meta page buffer content lock and pin
321	while next bucket <= max bucket do
322		Acquire X lock on target bucket
323		Scan and remove tuples, compact free space as needed
324		Release X lock
325		next bucket ++
326	end loop
327	pin metapage and take buffer content lock in exclusive mode
328	check if number of buckets changed
329	if so, release content lock and pin and return to for-each-bucket loop
330	else update metapage tuple count
331	mark meta page dirty and release buffer content lock and pin
332
333Note that this is designed to allow concurrent splits.  If a split occurs,
334tuples relocated into the new bucket will be visited twice by the scan,
335but that does no harm.  (We must however be careful about the statistics
336reported by the VACUUM operation.  What we can do is count the number of
337tuples scanned, and believe this in preference to the stored tuple count
338if the stored tuple count and number of buckets did *not* change at any
339time during the scan.  This provides a way of correcting the stored tuple
340count if it gets out of sync for some reason.  But if a split or insertion
341does occur concurrently, the scan count is untrustworthy; instead,
342subtract the number of tuples deleted from the stored tuple count and
343use that.)
344
345The exclusive lock request could deadlock in some strange scenarios, but
346we can just error out without any great harm being done.
347
348
349Free Space Management
350---------------------
351
352(Question: why is this so complicated?  Why not just have a linked list
353of free pages with the list head in the metapage?  It's not like we
354avoid needing to modify the metapage with all this.)
355
356Free space management consists of two sub-algorithms, one for reserving
357an overflow page to add to a bucket chain, and one for returning an empty
358overflow page to the free pool.
359
360Obtaining an overflow page:
361
362	take metapage content lock in exclusive mode
363	determine next bitmap page number; if none, exit loop
364	release meta page content lock
365	pin bitmap page and take content lock in exclusive mode
366	search for a free page (zero bit in bitmap)
367	if found:
368		set bit in bitmap
369		mark bitmap page dirty and release content lock
370		take metapage buffer content lock in exclusive mode
371		if first-free-bit value did not change,
372			update it and mark meta page dirty
373		release meta page buffer content lock
374		return page number
375	else (not found):
376	release bitmap page buffer content lock
377	loop back to try next bitmap page, if any
378-- here when we have checked all bitmap pages; we hold meta excl. lock
379	extend index to add another overflow page; update meta information
380	mark meta page dirty and release buffer content lock
381	return page number
382
383It is slightly annoying to release and reacquire the metapage lock
384multiple times, but it seems best to do it that way to minimize loss of
385concurrency against processes just entering the index.  We don't want
386to hold the metapage exclusive lock while reading in a bitmap page.
387(We can at least avoid repeated buffer pin/unpin here.)
388
389The normal path for extending the index does not require doing I/O while
390holding the metapage lock.  We do have to do I/O when the extension
391requires adding a new bitmap page as well as the required overflow page
392... but that is an infrequent case, so the loss of concurrency seems
393acceptable.
394
395The portion of tuple insertion that calls the above subroutine looks
396like this:
397
398	-- having determined that no space is free in the target bucket:
399	remember last page of bucket, drop write lock on it
400	call free-page-acquire routine
401	re-write-lock last page of bucket
402	if it is not last anymore, step to the last page
403	update (former) last page to point to new page
404	write-lock and initialize new page, with back link to former last page
405	write and release former last page
406	insert tuple into new page
407	-- etc.
408
409Notice this handles the case where two concurrent inserters try to extend
410the same bucket.  They will end up with a valid, though perhaps
411space-inefficient, configuration: two overflow pages will be added to the
412bucket, each containing one tuple.
413
414The last part of this violates the rule about holding write lock on two
415pages concurrently, but it should be okay to write-lock the previously
416free page; there can be no other process holding lock on it.
417
418Bucket splitting uses a similar algorithm if it has to extend the new
419bucket, but it need not worry about concurrent extension since it has
420exclusive lock on the new bucket.
421
422Freeing an overflow page is done by garbage collection and by bucket
423splitting (the old bucket may contain no-longer-needed overflow pages).
424In both cases, the process holds exclusive lock on the containing bucket,
425so need not worry about other accessors of pages in the bucket.  The
426algorithm is:
427
428	delink overflow page from bucket chain
429	(this requires read/update/write/release of fore and aft siblings)
430	pin meta page and take buffer content lock in shared mode
431	determine which bitmap page contains the free space bit for page
432	release meta page buffer content lock
433	pin bitmap page and take buffer content lock in exclusive mode
434	update bitmap bit
435	mark bitmap page dirty and release buffer content lock and pin
436	if page number is less than what we saw as first-free-bit in meta:
437	retake meta page buffer content lock in exclusive mode
438	if page number is still less than first-free-bit,
439		update first-free-bit field and mark meta page dirty
440	release meta page buffer content lock and pin
441
442We have to do it this way because we must clear the bitmap bit before
443changing the first-free-bit field (hashm_firstfree).  It is possible that
444we set first-free-bit too small (because someone has already reused the
445page we just freed), but that is okay; the only cost is the next overflow
446page acquirer will scan more bitmap bits than he needs to.  What must be
447avoided is having first-free-bit greater than the actual first free bit,
448because then that free page would never be found by searchers.
449
450All the freespace operations should be called while holding no buffer
451locks.  Since they need no lmgr locks, deadlock is not possible.
452
453
454Other Notes
455-----------
456
457All the shenanigans with locking prevent a split occurring while *another*
458process is stopped in a given bucket.  They do not ensure that one of
459our *own* backend's scans is not stopped in the bucket, because lmgr
460doesn't consider a process's own locks to conflict.  So the Split
461algorithm must check for that case separately before deciding it can go
462ahead with the split.  VACUUM does not have this problem since nothing
463else can be happening within the vacuuming backend.
464
465Should we instead try to fix the state of any conflicting local scan?
466Seems mighty ugly --- got to move the held bucket S-lock as well as lots
467of other messiness.  For now, just punt and don't split.
468