• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

MakefileH A D08-Nov-2021470 185

READMEH A D08-Nov-202121.8 KiB434351

aset.cH A D08-Nov-202139.6 KiB1,321670

dsa.cH A D08-Nov-202175.7 KiB2,2901,252

freepage.cH A D08-Nov-202155.6 KiB1,8871,223

mcxt.cH A D08-Nov-202130.5 KiB1,110563

memdebug.cH A D08-Nov-20213.5 KiB9421

portalmem.cH A D08-Nov-202132.1 KiB1,158553

slab.cH A D08-Nov-202122.3 KiB786410

README

1src/backend/utils/mmgr/README
2
3Memory Context System Design Overview
4=====================================
5
6Background
7----------
8
9We do most of our memory allocation in "memory contexts", which are usually
10AllocSets as implemented by src/backend/utils/mmgr/aset.c.  The key to
11successful memory management without lots of overhead is to define a useful
12set of contexts with appropriate lifespans.
13
14The basic operations on a memory context are:
15
16* create a context
17
18* allocate a chunk of memory within a context (equivalent of standard
19  C library's malloc())
20
21* delete a context (including freeing all the memory allocated therein)
22
23* reset a context (free all memory allocated in the context, but not the
24  context object itself)
25
26Given a chunk of memory previously allocated from a context, one can
27free it or reallocate it larger or smaller (corresponding to standard C
28library's free() and realloc() routines).  These operations return memory
29to or get more memory from the same context the chunk was originally
30allocated in.
31
32At all times there is a "current" context denoted by the
33CurrentMemoryContext global variable.  palloc() implicitly allocates space
34in that context.  The MemoryContextSwitchTo() operation selects a new current
35context (and returns the previous context, so that the caller can restore the
36previous context before exiting).
37
38The main advantage of memory contexts over plain use of malloc/free is
39that the entire contents of a memory context can be freed easily, without
40having to request freeing of each individual chunk within it.  This is
41both faster and more reliable than per-chunk bookkeeping.  We use this
42fact to clean up at transaction end: by resetting all the active contexts
43of transaction or shorter lifespan, we can reclaim all transient memory.
44Similarly, we can clean up at the end of each query, or after each tuple
45is processed during a query.
46
47
48Some Notes About the palloc API Versus Standard C Library
49---------------------------------------------------------
50
51The behavior of palloc and friends is similar to the standard C library's
52malloc and friends, but there are some deliberate differences too.  Here
53are some notes to clarify the behavior.
54
55* If out of memory, palloc and repalloc exit via elog(ERROR).  They
56never return NULL, and it is not necessary or useful to test for such
57a result.  With palloc_extended() that behavior can be overridden
58using the MCXT_ALLOC_NO_OOM flag.
59
60* palloc(0) is explicitly a valid operation.  It does not return a NULL
61pointer, but a valid chunk of which no bytes may be used.  However, the
62chunk might later be repalloc'd larger; it can also be pfree'd without
63error.  Similarly, repalloc allows realloc'ing to zero size.
64
65* pfree and repalloc do not accept a NULL pointer.  This is intentional.
66
67
68The Current Memory Context
69--------------------------
70
71Because it would be too much notational overhead to always pass an
72appropriate memory context to called routines, there always exists the
73notion of the current memory context CurrentMemoryContext.  Without it,
74for example, the copyObject routines would need to be passed a context, as
75would function execution routines that return a pass-by-reference
76datatype.  Similarly for routines that temporarily allocate space
77internally, but don't return it to their caller?  We certainly don't
78want to clutter every call in the system with "here is a context to
79use for any temporary memory allocation you might want to do".
80
81The upshot of that reasoning, though, is that CurrentMemoryContext should
82generally point at a short-lifespan context if at all possible.  During
83query execution it usually points to a context that gets reset after each
84tuple.  Only in *very* circumscribed code should it ever point at a
85context having greater than transaction lifespan, since doing so risks
86permanent memory leaks.
87
88
89pfree/repalloc Do Not Depend On CurrentMemoryContext
90----------------------------------------------------
91
92pfree() and repalloc() can be applied to any chunk whether it belongs
93to CurrentMemoryContext or not --- the chunk's owning context will be
94invoked to handle the operation, regardless.
95
96
97"Parent" and "Child" Contexts
98-----------------------------
99
100If all contexts were independent, it'd be hard to keep track of them,
101especially in error cases.  That is solved by creating a tree of
102"parent" and "child" contexts.  When creating a memory context, the
103new context can be specified to be a child of some existing context.
104A context can have many children, but only one parent.  In this way
105the contexts form a forest (not necessarily a single tree, since there
106could be more than one top-level context; although in current practice
107there is only one top context, TopMemoryContext).
108
109Deleting a context deletes all its direct and indirect children as
110well.  When resetting a context it's almost always more useful to
111delete child contexts, thus MemoryContextReset() means that, and if
112you really do want a tree of empty contexts you need to call
113MemoryContextResetOnly() plus MemoryContextResetChildren().
114
115These features allow us to manage a lot of contexts without fear that
116some will be leaked; we only need to keep track of one top-level
117context that we are going to delete at transaction end, and make sure
118that any shorter-lived contexts we create are descendants of that
119context.  Since the tree can have multiple levels, we can deal easily
120with nested lifetimes of storage, such as per-transaction,
121per-statement, per-scan, per-tuple.  Storage lifetimes that only
122partially overlap can be handled by allocating from different trees of
123the context forest (there are some examples in the next section).
124
125For convenience we also provide operations like "reset/delete all children
126of a given context, but don't reset or delete that context itself".
127
128
129Memory Context Reset/Delete Callbacks
130-------------------------------------
131
132A feature introduced in Postgres 9.5 allows memory contexts to be used
133for managing more resources than just plain palloc'd memory.  This is
134done by registering a "reset callback function" for a memory context.
135Such a function will be called, once, just before the context is next
136reset or deleted.  It can be used to give up resources that are in some
137sense associated with an object allocated within the context.  Possible
138use-cases include
139* closing open files associated with a tuplesort object;
140* releasing reference counts on long-lived cache objects that are held
141  by some object within the context being reset;
142* freeing malloc-managed memory associated with some palloc'd object.
143That last case would just represent bad programming practice for pure
144Postgres code; better to have made all the allocations using palloc,
145in the target context or some child context.  However, it could well
146come in handy for code that interfaces to non-Postgres libraries.
147
148Any number of reset callbacks can be established for a memory context;
149they are called in reverse order of registration.  Also, callbacks
150attached to child contexts are called before callbacks attached to
151parent contexts, if a tree of contexts is being reset or deleted.
152
153The API for this requires the caller to provide a MemoryContextCallback
154memory chunk to hold the state for a callback.  Typically this should be
155allocated in the same context it is logically attached to, so that it
156will be released automatically after use.  The reason for asking the
157caller to provide this memory is that in most usage scenarios, the caller
158will be creating some larger struct within the target context, and the
159MemoryContextCallback struct can be made "for free" without a separate
160palloc() call by including it in this larger struct.
161
162
163Memory Contexts in Practice
164===========================
165
166Globally Known Contexts
167-----------------------
168
169There are a few widely-known contexts that are typically referenced
170through global variables.  At any instant the system may contain many
171additional contexts, but all other contexts should be direct or indirect
172children of one of these contexts to ensure they are not leaked in event
173of an error.
174
175TopMemoryContext --- this is the actual top level of the context tree;
176every other context is a direct or indirect child of this one.  Allocating
177here is essentially the same as "malloc", because this context will never
178be reset or deleted.  This is for stuff that should live forever, or for
179stuff that the controlling module will take care of deleting at the
180appropriate time.  An example is fd.c's tables of open files, as well as
181the context management nodes for memory contexts themselves.  Avoid
182allocating stuff here unless really necessary, and especially avoid
183running with CurrentMemoryContext pointing here.
184
185PostmasterContext --- this is the postmaster's normal working context.
186After a backend is spawned, it can delete PostmasterContext to free its
187copy of memory the postmaster was using that it doesn't need.
188Note that in non-EXEC_BACKEND builds, the postmaster's copy of pg_hba.conf
189and pg_ident.conf data is used directly during authentication in backend
190processes; so backends can't delete PostmasterContext until that's done.
191(The postmaster has only TopMemoryContext, PostmasterContext, and
192ErrorContext --- the remaining top-level contexts are set up in each
193backend during startup.)
194
195CacheMemoryContext --- permanent storage for relcache, catcache, and
196related modules.  This will never be reset or deleted, either, so it's
197not truly necessary to distinguish it from TopMemoryContext.  But it
198seems worthwhile to maintain the distinction for debugging purposes.
199(Note: CacheMemoryContext has child contexts with shorter lifespans.
200For example, a child context is the best place to keep the subsidiary
201storage associated with a relcache entry; that way we can free rule
202parsetrees and so forth easily, without having to depend on constructing
203a reliable version of freeObject().)
204
205MessageContext --- this context holds the current command message from the
206frontend, as well as any derived storage that need only live as long as
207the current message (for example, in simple-Query mode the parse and plan
208trees can live here).  This context will be reset, and any children
209deleted, at the top of each cycle of the outer loop of PostgresMain.  This
210is kept separate from per-transaction and per-portal contexts because a
211query string might need to live either a longer or shorter time than any
212single transaction or portal.
213
214TopTransactionContext --- this holds everything that lives until end of the
215top-level transaction.  This context will be reset, and all its children
216deleted, at conclusion of each top-level transaction cycle.  In most cases
217you don't want to allocate stuff directly here, but in CurTransactionContext;
218what does belong here is control information that exists explicitly to manage
219status across multiple subtransactions.  Note: this context is NOT cleared
220immediately upon error; its contents will survive until the transaction block
221is exited by COMMIT/ROLLBACK.
222
223CurTransactionContext --- this holds data that has to survive until the end
224of the current transaction, and in particular will be needed at top-level
225transaction commit.  When we are in a top-level transaction this is the same
226as TopTransactionContext, but in subtransactions it points to a child context.
227It is important to understand that if a subtransaction aborts, its
228CurTransactionContext is thrown away after finishing the abort processing;
229but a committed subtransaction's CurTransactionContext is kept until top-level
230commit (unless of course one of the intermediate levels of subtransaction
231aborts).  This ensures that we do not keep data from a failed subtransaction
232longer than necessary.  Because of this behavior, you must be careful to clean
233up properly during subtransaction abort --- the subtransaction's state must be
234delinked from any pointers or lists kept in upper transactions, or you will
235have dangling pointers leading to a crash at top-level commit.  An example of
236data kept here is pending NOTIFY messages, which are sent at top-level commit,
237but only if the generating subtransaction did not abort.
238
239PortalContext --- this is not actually a separate context, but a
240global variable pointing to the per-portal context of the currently active
241execution portal.  This can be used if it's necessary to allocate storage
242that will live just as long as the execution of the current portal requires.
243
244ErrorContext --- this permanent context is switched into for error
245recovery processing, and then reset on completion of recovery.  We arrange
246to have a few KB of memory available in it at all times.  In this way, we
247can ensure that some memory is available for error recovery even if the
248backend has run out of memory otherwise.  This allows out-of-memory to be
249treated as a normal ERROR condition, not a FATAL error.
250
251
252Contexts For Prepared Statements And Portals
253--------------------------------------------
254
255A prepared-statement object has an associated private context, in which
256the parse and plan trees for its query are stored.  Because these trees
257are read-only to the executor, the prepared statement can be re-used many
258times without further copying of these trees.
259
260An execution-portal object has a private context that is referenced by
261PortalContext when the portal is active.  In the case of a portal created
262by DECLARE CURSOR, this private context contains the query parse and plan
263trees (there being no other object that can hold them).  Portals created
264from prepared statements simply reference the prepared statements' trees,
265and don't actually need any storage allocated in their private contexts.
266
267
268Logical Replication Worker Contexts
269-----------------------------------
270
271ApplyContext --- permanent during whole lifetime of apply worker.  It
272is possible to use TopMemoryContext here as well, but for simplicity
273of memory usage analysis we spin up different context.
274
275ApplyMessageContext --- short-lived context that is reset after each
276logical replication protocol message is processed.
277
278
279Transient Contexts During Execution
280-----------------------------------
281
282When creating a prepared statement, the parse and plan trees will be built
283in a temporary context that's a child of MessageContext (so that it will
284go away automatically upon error).  On success, the finished plan is
285copied to the prepared statement's private context, and the temp context
286is released; this allows planner temporary space to be recovered before
287execution begins.  (In simple-Query mode we don't bother with the extra
288copy step, so the planner temp space stays around till end of query.)
289
290The top-level executor routines, as well as most of the "plan node"
291execution code, will normally run in a context that is created by
292ExecutorStart and destroyed by ExecutorEnd; this context also holds the
293"plan state" tree built during ExecutorStart.  Most of the memory
294allocated in these routines is intended to live until end of query,
295so this is appropriate for those purposes.  The executor's top context
296is a child of PortalContext, that is, the per-portal context of the
297portal that represents the query's execution.
298
299The main memory-management consideration in the executor is that
300expression evaluation --- both for qual testing and for computation of
301targetlist entries --- needs to not leak memory.  To do this, each
302ExprContext (expression-eval context) created in the executor has a
303private memory context associated with it, and we switch into that context
304when evaluating expressions in that ExprContext.  The plan node that owns
305the ExprContext is responsible for resetting the private context to empty
306when it no longer needs the results of expression evaluations.  Typically
307the reset is done at the start of each tuple-fetch cycle in the plan node.
308
309Note that this design gives each plan node its own expression-eval memory
310context.  This appears necessary to handle nested joins properly, since
311an outer plan node might need to retain expression results it has computed
312while obtaining the next tuple from an inner node --- but the inner node
313might execute many tuple cycles and many expressions before returning a
314tuple.  The inner node must be able to reset its own expression context
315more often than once per outer tuple cycle.  Fortunately, memory contexts
316are cheap enough that giving one to each plan node doesn't seem like a
317problem.
318
319A problem with running index accesses and sorts in a query-lifespan context
320is that these operations invoke datatype-specific comparison functions,
321and if the comparators leak any memory then that memory won't be recovered
322till end of query.  The comparator functions all return bool or int32,
323so there's no problem with their result data, but there can be a problem
324with leakage of internal temporary data.  In particular, comparator
325functions that operate on TOAST-able data types need to be careful
326not to leak detoasted versions of their inputs.  This is annoying, but
327it appeared a lot easier to make the comparators conform than to fix the
328index and sort routines, so that's what was done for 7.1.  This remains
329the state of affairs in btree and hash indexes, so btree and hash support
330functions still need to not leak memory.  Most of the other index AMs
331have been modified to run opclass support functions in short-lived
332contexts, so that leakage is not a problem; this is necessary in view
333of the fact that their support functions tend to be far more complex.
334
335There are some special cases, such as aggregate functions.  nodeAgg.c
336needs to remember the results of evaluation of aggregate transition
337functions from one tuple cycle to the next, so it can't just discard
338all per-tuple state in each cycle.  The easiest way to handle this seems
339to be to have two per-tuple contexts in an aggregate node, and to
340ping-pong between them, so that at each tuple one is the active allocation
341context and the other holds any results allocated by the prior cycle's
342transition function.
343
344Executor routines that switch the active CurrentMemoryContext may need
345to copy data into their caller's current memory context before returning.
346However, we have minimized the need for that, because of the convention
347of resetting the per-tuple context at the *start* of an execution cycle
348rather than at its end.  With that rule, an execution node can return a
349tuple that is palloc'd in its per-tuple context, and the tuple will remain
350good until the node is called for another tuple or told to end execution.
351This parallels the situation with pass-by-reference values at the table
352scan level, since a scan node can return a direct pointer to a tuple in a
353disk buffer that is only guaranteed to remain good that long.
354
355A more common reason for copying data is to transfer a result from
356per-tuple context to per-query context; for example, a Unique node will
357save the last distinct tuple value in its per-query context, requiring a
358copy step.
359
360
361Mechanisms to Allow Multiple Types of Contexts
362----------------------------------------------
363
364To efficiently allow for different allocation patterns, and for
365experimentation, we allow for different types of memory contexts with
366different allocation policies but similar external behavior.  To
367handle this, memory allocation functions are accessed via function
368pointers, and we require all context types to obey the conventions
369given here.
370
371A memory context is represented by struct MemoryContextData (see
372memnodes.h). This struct identifies the exact type of the context, and
373contains information common between the different types of
374MemoryContext like the parent and child contexts, and the name of the
375context.
376
377This is essentially an abstract superclass, and the behavior is
378determined by the "methods" pointer is its virtual function table
379(struct MemoryContextMethods).  Specific memory context types will use
380derived structs having these fields as their first fields.  All the
381contexts of a specific type will have methods pointers that point to
382the same static table of function pointers.
383
384While operations like allocating from and resetting a context take the
385relevant MemoryContext as a parameter, operations like free and
386realloc are trickier.  To make those work, we require all memory
387context types to produce allocated chunks that are immediately,
388without any padding, preceded by a pointer to the corresponding
389MemoryContext.
390
391If a type of allocator needs additional information about its chunks,
392like e.g. the size of the allocation, that information can in turn
393precede the MemoryContext.  This means the only overhead implied by
394the memory context mechanism is a pointer to its context, so we're not
395constraining context-type designers very much.
396
397Given this, routines like pfree determine their corresponding context
398with an operation like (although that is usually encapsulated in
399GetMemoryChunkContext())
400
401    MemoryContext context = *(MemoryContext*) (((char *) pointer) - sizeof(void *));
402
403and then invoke the corresponding method for the context
404
405    (*context->methods->free_p) (p);
406
407
408More Control Over aset.c Behavior
409---------------------------------
410
411By default aset.c always allocates an 8K block upon the first
412allocation in a context, and doubles that size for each successive
413block request.  That's good behavior for a context that might hold
414*lots* of data.  But if there are dozens if not hundreds of smaller
415contexts in the system, we need to be able to fine-tune things a
416little better.
417
418The creator of a context is able to specify an initial block size and
419a maximum block size.  Selecting smaller values can prevent wastage of
420space in contexts that aren't expected to hold very much (an example
421is the relcache's per-relation contexts).
422
423Also, it is possible to specify a minimum context size.  If this
424value is greater than zero then a block of that size will be grabbed
425immediately upon context creation, and cleared but not released during
426context resets.  This feature is needed for ErrorContext (see above),
427but will most likely not be used for other contexts.
428
429We expect that per-tuple contexts will be reset frequently and typically
430will not allocate very much space per tuple cycle.  To make this usage
431pattern cheap, the first block allocated in a context is not given
432back to malloc() during reset, but just cleared.  This avoids malloc
433thrashing.
434