• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..08-Nov-2021-

MakefileH A D08-Nov-2021475 185

READMEH A D08-Nov-202115.4 KiB280234

buf_init.cH A D08-Nov-20215.6 KiB19470

buf_table.cH A D08-Nov-20214 KiB16476

bufmgr.cH A D08-Nov-2021125.9 KiB4,4142,206

freelist.cH A D08-Nov-202120.2 KiB705293

localbuf.cH A D08-Nov-202115.8 KiB592346

README

1src/backend/storage/buffer/README
2
3Notes About Shared Buffer Access Rules
4======================================
5
6There are two separate access control mechanisms for shared disk buffers:
7reference counts (a/k/a pin counts) and buffer content locks.  (Actually,
8there's a third level of access control: one must hold the appropriate kind
9of lock on a relation before one can legally access any page belonging to
10the relation.  Relation-level locks are not discussed here.)
11
12Pins: one must "hold a pin on" a buffer (increment its reference count)
13before being allowed to do anything at all with it.  An unpinned buffer is
14subject to being reclaimed and reused for a different page at any instant,
15so touching it is unsafe.  Normally a pin is acquired via ReadBuffer and
16released via ReleaseBuffer.  It is OK and indeed common for a single
17backend to pin a page more than once concurrently; the buffer manager
18handles this efficiently.  It is considered OK to hold a pin for long
19intervals --- for example, sequential scans hold a pin on the current page
20until done processing all the tuples on the page, which could be quite a
21while if the scan is the outer scan of a join.  Similarly, a btree index
22scan may hold a pin on the current index page.  This is OK because normal
23operations never wait for a page's pin count to drop to zero.  (Anything
24that might need to do such a wait is instead handled by waiting to obtain
25the relation-level lock, which is why you'd better hold one first.)  Pins
26may not be held across transaction boundaries, however.
27
28Buffer content locks: there are two kinds of buffer lock, shared and exclusive,
29which act just as you'd expect: multiple backends can hold shared locks on
30the same buffer, but an exclusive lock prevents anyone else from holding
31either shared or exclusive lock.  (These can alternatively be called READ
32and WRITE locks.)  These locks are intended to be short-term: they should not
33be held for long.  Buffer locks are acquired and released by LockBuffer().
34It will *not* work for a single backend to try to acquire multiple locks on
35the same buffer.  One must pin a buffer before trying to lock it.
36
37Buffer access rules:
38
391. To scan a page for tuples, one must hold a pin and either shared or
40exclusive content lock.  To examine the commit status (XIDs and status bits)
41of a tuple in a shared buffer, one must likewise hold a pin and either shared
42or exclusive lock.
43
442. Once one has determined that a tuple is interesting (visible to the
45current transaction) one may drop the content lock, yet continue to access
46the tuple's data for as long as one holds the buffer pin.  This is what is
47typically done by heap scans, since the tuple returned by heap_fetch
48contains a pointer to tuple data in the shared buffer.  Therefore the
49tuple cannot go away while the pin is held (see rule #5).  Its state could
50change, but that is assumed not to matter after the initial determination
51of visibility is made.
52
533. To add a tuple or change the xmin/xmax fields of an existing tuple,
54one must hold a pin and an exclusive content lock on the containing buffer.
55This ensures that no one else might see a partially-updated state of the
56tuple while they are doing visibility checks.
57
584. It is considered OK to update tuple commit status bits (ie, OR the
59values HEAP_XMIN_COMMITTED, HEAP_XMIN_INVALID, HEAP_XMAX_COMMITTED, or
60HEAP_XMAX_INVALID into t_infomask) while holding only a shared lock and
61pin on a buffer.  This is OK because another backend looking at the tuple
62at about the same time would OR the same bits into the field, so there
63is little or no risk of conflicting update; what's more, if there did
64manage to be a conflict it would merely mean that one bit-update would
65be lost and need to be done again later.  These four bits are only hints
66(they cache the results of transaction status lookups in pg_xact), so no
67great harm is done if they get reset to zero by conflicting updates.
68Note, however, that a tuple is frozen by setting both HEAP_XMIN_INVALID
69and HEAP_XMIN_COMMITTED; this is a critical update and accordingly requires
70an exclusive buffer lock (and it must also be WAL-logged).
71
725. To physically remove a tuple or compact free space on a page, one
73must hold a pin and an exclusive lock, *and* observe while holding the
74exclusive lock that the buffer's shared reference count is one (ie,
75no other backend holds a pin).  If these conditions are met then no other
76backend can perform a page scan until the exclusive lock is dropped, and
77no other backend can be holding a reference to an existing tuple that it
78might expect to examine again.  Note that another backend might pin the
79buffer (increment the refcount) while one is performing the cleanup, but
80it won't be able to actually examine the page until it acquires shared
81or exclusive content lock.
82
83
84Obtaining the lock needed under rule #5 is done by the bufmgr routines
85LockBufferForCleanup() or ConditionalLockBufferForCleanup().  They first get
86an exclusive lock and then check to see if the shared pin count is currently
871.  If not, ConditionalLockBufferForCleanup() releases the exclusive lock and
88then returns false, while LockBufferForCleanup() releases the exclusive lock
89(but not the caller's pin) and waits until signaled by another backend,
90whereupon it tries again.  The signal will occur when UnpinBuffer decrements
91the shared pin count to 1.  As indicated above, this operation might have to
92wait a good while before it acquires the lock, but that shouldn't matter much
93for concurrent VACUUM.  The current implementation only supports a single
94waiter for pin-count-1 on any particular shared buffer.  This is enough for
95VACUUM's use, since we don't allow multiple VACUUMs concurrently on a single
96relation anyway.  Anyone wishing to obtain a cleanup lock outside of recovery
97or a VACUUM must use the conditional variant of the function.
98
99
100Buffer Manager's Internal Locking
101---------------------------------
102
103Before PostgreSQL 8.1, all operations of the shared buffer manager itself
104were protected by a single system-wide lock, the BufMgrLock, which
105unsurprisingly proved to be a source of contention.  The new locking scheme
106avoids grabbing system-wide exclusive locks in common code paths.  It works
107like this:
108
109* There is a system-wide LWLock, the BufMappingLock, that notionally
110protects the mapping from buffer tags (page identifiers) to buffers.
111(Physically, it can be thought of as protecting the hash table maintained
112by buf_table.c.)  To look up whether a buffer exists for a tag, it is
113sufficient to obtain share lock on the BufMappingLock.  Note that one
114must pin the found buffer, if any, before releasing the BufMappingLock.
115To alter the page assignment of any buffer, one must hold exclusive lock
116on the BufMappingLock.  This lock must be held across adjusting the buffer's
117header fields and changing the buf_table hash table.  The only common
118operation that needs exclusive lock is reading in a page that was not
119in shared buffers already, which will require at least a kernel call
120and usually a wait for I/O, so it will be slow anyway.
121
122* As of PG 8.2, the BufMappingLock has been split into NUM_BUFFER_PARTITIONS
123separate locks, each guarding a portion of the buffer tag space.  This allows
124further reduction of contention in the normal code paths.  The partition
125that a particular buffer tag belongs to is determined from the low-order
126bits of the tag's hash value.  The rules stated above apply to each partition
127independently.  If it is necessary to lock more than one partition at a time,
128they must be locked in partition-number order to avoid risk of deadlock.
129
130* A separate system-wide spinlock, buffer_strategy_lock, provides mutual
131exclusion for operations that access the buffer free list or select
132buffers for replacement.  A spinlock is used here rather than a lightweight
133lock for efficiency; no other locks of any sort should be acquired while
134buffer_strategy_lock is held.  This is essential to allow buffer replacement
135to happen in multiple backends with reasonable concurrency.
136
137* Each buffer header contains a spinlock that must be taken when examining
138or changing fields of that buffer header.  This allows operations such as
139ReleaseBuffer to make local state changes without taking any system-wide
140lock.  We use a spinlock, not an LWLock, since there are no cases where
141the lock needs to be held for more than a few instructions.
142
143Note that a buffer header's spinlock does not control access to the data
144held within the buffer.  Each buffer header also contains an LWLock, the
145"buffer content lock", that *does* represent the right to access the data
146in the buffer.  It is used per the rules above.
147
148There is yet another set of per-buffer LWLocks, the io_in_progress locks,
149that are used to wait for I/O on a buffer to complete.  The process doing
150a read or write takes exclusive lock for the duration, and processes that
151need to wait for completion try to take shared locks (which they release
152immediately upon obtaining).  XXX on systems where an LWLock represents
153nontrivial resources, it's fairly annoying to need so many locks.  Possibly
154we could use per-backend LWLocks instead (a buffer header would then contain
155a field to show which backend is doing its I/O).
156
157
158Normal Buffer Replacement Strategy
159----------------------------------
160
161There is a "free list" of buffers that are prime candidates for replacement.
162In particular, buffers that are completely free (contain no valid page) are
163always in this list.  We could also throw buffers into this list if we
164consider their pages unlikely to be needed soon; however, the current
165algorithm never does that.  The list is singly-linked using fields in the
166buffer headers; we maintain head and tail pointers in global variables.
167(Note: although the list links are in the buffer headers, they are
168considered to be protected by the buffer_strategy_lock, not the buffer-header
169spinlocks.)  To choose a victim buffer to recycle when there are no free
170buffers available, we use a simple clock-sweep algorithm, which avoids the
171need to take system-wide locks during common operations.  It works like
172this:
173
174Each buffer header contains a usage counter, which is incremented (up to a
175small limit value) whenever the buffer is pinned.  (This requires only the
176buffer header spinlock, which would have to be taken anyway to increment the
177buffer reference count, so it's nearly free.)
178
179The "clock hand" is a buffer index, nextVictimBuffer, that moves circularly
180through all the available buffers.  nextVictimBuffer is protected by the
181buffer_strategy_lock.
182
183The algorithm for a process that needs to obtain a victim buffer is:
184
1851. Obtain buffer_strategy_lock.
186
1872. If buffer free list is nonempty, remove its head buffer.  Release
188buffer_strategy_lock.  If the buffer is pinned or has a nonzero usage count,
189it cannot be used; ignore it go back to step 1.  Otherwise, pin the buffer,
190and return it.
191
1923. Otherwise, the buffer free list is empty.  Select the buffer pointed to by
193nextVictimBuffer, and circularly advance nextVictimBuffer for next time.
194Release buffer_strategy_lock.
195
1964. If the selected buffer is pinned or has a nonzero usage count, it cannot
197be used.  Decrement its usage count (if nonzero), reacquire
198buffer_strategy_lock, and return to step 3 to examine the next buffer.
199
2005. Pin the selected buffer, and return.
201
202(Note that if the selected buffer is dirty, we will have to write it out
203before we can recycle it; if someone else pins the buffer meanwhile we will
204have to give up and try another buffer.  This however is not a concern
205of the basic select-a-victim-buffer algorithm.)
206
207
208Buffer Ring Replacement Strategy
209---------------------------------
210
211When running a query that needs to access a large number of pages just once,
212such as VACUUM or a large sequential scan, a different strategy is used.
213A page that has been touched only by such a scan is unlikely to be needed
214again soon, so instead of running the normal clock sweep algorithm and
215blowing out the entire buffer cache, a small ring of buffers is allocated
216using the normal clock sweep algorithm and those buffers are reused for the
217whole scan.  This also implies that much of the write traffic caused by such
218a statement will be done by the backend itself and not pushed off onto other
219processes.
220
221For sequential scans, a 256KB ring is used. That's small enough to fit in L2
222cache, which makes transferring pages from OS cache to shared buffer cache
223efficient.  Even less would often be enough, but the ring must be big enough
224to accommodate all pages in the scan that are pinned concurrently.  256KB
225should also be enough to leave a small cache trail for other backends to
226join in a synchronized seq scan.  If a ring buffer is dirtied and its LSN
227updated, we would normally have to write and flush WAL before we could
228re-use the buffer; in this case we instead discard the buffer from the ring
229and (later) choose a replacement using the normal clock-sweep algorithm.
230Hence this strategy works best for scans that are read-only (or at worst
231update hint bits).  In a scan that modifies every page in the scan, like a
232bulk UPDATE or DELETE, the buffers in the ring will always be dirtied and
233the ring strategy effectively degrades to the normal strategy.
234
235VACUUM uses a 256KB ring like sequential scans, but dirty pages are not
236removed from the ring.  Instead, WAL is flushed if needed to allow reuse of
237the buffers.  Before introducing the buffer ring strategy in 8.3, VACUUM's
238buffers were sent to the freelist, which was effectively a buffer ring of 1
239buffer, resulting in excessive WAL flushing.  Allowing VACUUM to update
240256KB between WAL flushes should be more efficient.
241
242Bulk writes work similarly to VACUUM.  Currently this applies only to
243COPY IN and CREATE TABLE AS SELECT.  (Might it be interesting to make
244seqscan UPDATE and DELETE use the bulkwrite strategy?)  For bulk writes
245we use a ring size of 16MB (but not more than 1/8th of shared_buffers).
246Smaller sizes have been shown to result in the COPY blocking too often
247for WAL flushes.  While it's okay for a background vacuum to be slowed by
248doing its own WAL flushing, we'd prefer that COPY not be subject to that,
249so we let it use up a bit more of the buffer arena.
250
251
252Background Writer's Processing
253------------------------------
254
255The background writer is designed to write out pages that are likely to be
256recycled soon, thereby offloading the writing work from active backends.
257To do this, it scans forward circularly from the current position of
258nextVictimBuffer (which it does not change!), looking for buffers that are
259dirty and not pinned nor marked with a positive usage count.  It pins,
260writes, and releases any such buffer.
261
262If we can assume that reading nextVictimBuffer is an atomic action, then
263the writer doesn't even need to take buffer_strategy_lock in order to look
264for buffers to write; it needs only to spinlock each buffer header for long
265enough to check the dirtybit.  Even without that assumption, the writer
266only needs to take the lock long enough to read the variable value, not
267while scanning the buffers.  (This is a very substantial improvement in
268the contention cost of the writer compared to PG 8.0.)
269
270The background writer takes shared content lock on a buffer while writing it
271out (and anyone else who flushes buffer contents to disk must do so too).
272This ensures that the page image transferred to disk is reasonably consistent.
273We might miss a hint-bit update or two but that isn't a problem, for the same
274reasons mentioned under buffer access rules.
275
276As of 8.4, background writer starts during recovery mode when there is
277some form of potentially extended recovery to perform. It performs an
278identical service to normal processing, except that checkpoints it
279writes are technically restartpoints.
280