1src/backend/utils/mmgr/README 2 3Memory Context System Design Overview 4===================================== 5 6Background 7---------- 8 9We do most of our memory allocation in "memory contexts", which are usually 10AllocSets as implemented by src/backend/utils/mmgr/aset.c. The key to 11successful memory management without lots of overhead is to define a useful 12set of contexts with appropriate lifespans. 13 14The basic operations on a memory context are: 15 16* create a context 17 18* allocate a chunk of memory within a context (equivalent of standard 19 C library's malloc()) 20 21* delete a context (including freeing all the memory allocated therein) 22 23* reset a context (free all memory allocated in the context, but not the 24 context object itself) 25 26Given a chunk of memory previously allocated from a context, one can 27free it or reallocate it larger or smaller (corresponding to standard C 28library's free() and realloc() routines). These operations return memory 29to or get more memory from the same context the chunk was originally 30allocated in. 31 32At all times there is a "current" context denoted by the 33CurrentMemoryContext global variable. palloc() implicitly allocates space 34in that context. The MemoryContextSwitchTo() operation selects a new current 35context (and returns the previous context, so that the caller can restore the 36previous context before exiting). 37 38The main advantage of memory contexts over plain use of malloc/free is 39that the entire contents of a memory context can be freed easily, without 40having to request freeing of each individual chunk within it. This is 41both faster and more reliable than per-chunk bookkeeping. We use this 42fact to clean up at transaction end: by resetting all the active contexts 43of transaction or shorter lifespan, we can reclaim all transient memory. 44Similarly, we can clean up at the end of each query, or after each tuple 45is processed during a query. 46 47 48Some Notes About the palloc API Versus Standard C Library 49--------------------------------------------------------- 50 51The behavior of palloc and friends is similar to the standard C library's 52malloc and friends, but there are some deliberate differences too. Here 53are some notes to clarify the behavior. 54 55* If out of memory, palloc and repalloc exit via elog(ERROR). They 56never return NULL, and it is not necessary or useful to test for such 57a result. With palloc_extended() that behavior can be overridden 58using the MCXT_ALLOC_NO_OOM flag. 59 60* palloc(0) is explicitly a valid operation. It does not return a NULL 61pointer, but a valid chunk of which no bytes may be used. However, the 62chunk might later be repalloc'd larger; it can also be pfree'd without 63error. Similarly, repalloc allows realloc'ing to zero size. 64 65* pfree and repalloc do not accept a NULL pointer. This is intentional. 66 67 68The Current Memory Context 69-------------------------- 70 71Because it would be too much notational overhead to always pass an 72appropriate memory context to called routines, there always exists the 73notion of the current memory context CurrentMemoryContext. Without it, 74for example, the copyObject routines would need to be passed a context, as 75would function execution routines that return a pass-by-reference 76datatype. Similarly for routines that temporarily allocate space 77internally, but don't return it to their caller? We certainly don't 78want to clutter every call in the system with "here is a context to 79use for any temporary memory allocation you might want to do". 80 81The upshot of that reasoning, though, is that CurrentMemoryContext should 82generally point at a short-lifespan context if at all possible. During 83query execution it usually points to a context that gets reset after each 84tuple. Only in *very* circumscribed code should it ever point at a 85context having greater than transaction lifespan, since doing so risks 86permanent memory leaks. 87 88 89pfree/repalloc Do Not Depend On CurrentMemoryContext 90---------------------------------------------------- 91 92pfree() and repalloc() can be applied to any chunk whether it belongs 93to CurrentMemoryContext or not --- the chunk's owning context will be 94invoked to handle the operation, regardless. 95 96 97"Parent" and "Child" Contexts 98----------------------------- 99 100If all contexts were independent, it'd be hard to keep track of them, 101especially in error cases. That is solved by creating a tree of 102"parent" and "child" contexts. When creating a memory context, the 103new context can be specified to be a child of some existing context. 104A context can have many children, but only one parent. In this way 105the contexts form a forest (not necessarily a single tree, since there 106could be more than one top-level context; although in current practice 107there is only one top context, TopMemoryContext). 108 109Deleting a context deletes all its direct and indirect children as 110well. When resetting a context it's almost always more useful to 111delete child contexts, thus MemoryContextReset() means that, and if 112you really do want a tree of empty contexts you need to call 113MemoryContextResetOnly() plus MemoryContextResetChildren(). 114 115These features allow us to manage a lot of contexts without fear that 116some will be leaked; we only need to keep track of one top-level 117context that we are going to delete at transaction end, and make sure 118that any shorter-lived contexts we create are descendants of that 119context. Since the tree can have multiple levels, we can deal easily 120with nested lifetimes of storage, such as per-transaction, 121per-statement, per-scan, per-tuple. Storage lifetimes that only 122partially overlap can be handled by allocating from different trees of 123the context forest (there are some examples in the next section). 124 125For convenience we also provide operations like "reset/delete all children 126of a given context, but don't reset or delete that context itself". 127 128 129Memory Context Reset/Delete Callbacks 130------------------------------------- 131 132A feature introduced in Postgres 9.5 allows memory contexts to be used 133for managing more resources than just plain palloc'd memory. This is 134done by registering a "reset callback function" for a memory context. 135Such a function will be called, once, just before the context is next 136reset or deleted. It can be used to give up resources that are in some 137sense associated with an object allocated within the context. Possible 138use-cases include 139* closing open files associated with a tuplesort object; 140* releasing reference counts on long-lived cache objects that are held 141 by some object within the context being reset; 142* freeing malloc-managed memory associated with some palloc'd object. 143That last case would just represent bad programming practice for pure 144Postgres code; better to have made all the allocations using palloc, 145in the target context or some child context. However, it could well 146come in handy for code that interfaces to non-Postgres libraries. 147 148Any number of reset callbacks can be established for a memory context; 149they are called in reverse order of registration. Also, callbacks 150attached to child contexts are called before callbacks attached to 151parent contexts, if a tree of contexts is being reset or deleted. 152 153The API for this requires the caller to provide a MemoryContextCallback 154memory chunk to hold the state for a callback. Typically this should be 155allocated in the same context it is logically attached to, so that it 156will be released automatically after use. The reason for asking the 157caller to provide this memory is that in most usage scenarios, the caller 158will be creating some larger struct within the target context, and the 159MemoryContextCallback struct can be made "for free" without a separate 160palloc() call by including it in this larger struct. 161 162 163Memory Contexts in Practice 164=========================== 165 166Globally Known Contexts 167----------------------- 168 169There are a few widely-known contexts that are typically referenced 170through global variables. At any instant the system may contain many 171additional contexts, but all other contexts should be direct or indirect 172children of one of these contexts to ensure they are not leaked in event 173of an error. 174 175TopMemoryContext --- this is the actual top level of the context tree; 176every other context is a direct or indirect child of this one. Allocating 177here is essentially the same as "malloc", because this context will never 178be reset or deleted. This is for stuff that should live forever, or for 179stuff that the controlling module will take care of deleting at the 180appropriate time. An example is fd.c's tables of open files. Avoid 181allocating stuff here unless really necessary, and especially avoid 182running with CurrentMemoryContext pointing here. 183 184PostmasterContext --- this is the postmaster's normal working context. 185After a backend is spawned, it can delete PostmasterContext to free its 186copy of memory the postmaster was using that it doesn't need. 187Note that in non-EXEC_BACKEND builds, the postmaster's copy of pg_hba.conf 188and pg_ident.conf data is used directly during authentication in backend 189processes; so backends can't delete PostmasterContext until that's done. 190(The postmaster has only TopMemoryContext, PostmasterContext, and 191ErrorContext --- the remaining top-level contexts are set up in each 192backend during startup.) 193 194CacheMemoryContext --- permanent storage for relcache, catcache, and 195related modules. This will never be reset or deleted, either, so it's 196not truly necessary to distinguish it from TopMemoryContext. But it 197seems worthwhile to maintain the distinction for debugging purposes. 198(Note: CacheMemoryContext has child contexts with shorter lifespans. 199For example, a child context is the best place to keep the subsidiary 200storage associated with a relcache entry; that way we can free rule 201parsetrees and so forth easily, without having to depend on constructing 202a reliable version of freeObject().) 203 204MessageContext --- this context holds the current command message from the 205frontend, as well as any derived storage that need only live as long as 206the current message (for example, in simple-Query mode the parse and plan 207trees can live here). This context will be reset, and any children 208deleted, at the top of each cycle of the outer loop of PostgresMain. This 209is kept separate from per-transaction and per-portal contexts because a 210query string might need to live either a longer or shorter time than any 211single transaction or portal. 212 213TopTransactionContext --- this holds everything that lives until end of the 214top-level transaction. This context will be reset, and all its children 215deleted, at conclusion of each top-level transaction cycle. In most cases 216you don't want to allocate stuff directly here, but in CurTransactionContext; 217what does belong here is control information that exists explicitly to manage 218status across multiple subtransactions. Note: this context is NOT cleared 219immediately upon error; its contents will survive until the transaction block 220is exited by COMMIT/ROLLBACK. 221 222CurTransactionContext --- this holds data that has to survive until the end 223of the current transaction, and in particular will be needed at top-level 224transaction commit. When we are in a top-level transaction this is the same 225as TopTransactionContext, but in subtransactions it points to a child context. 226It is important to understand that if a subtransaction aborts, its 227CurTransactionContext is thrown away after finishing the abort processing; 228but a committed subtransaction's CurTransactionContext is kept until top-level 229commit (unless of course one of the intermediate levels of subtransaction 230aborts). This ensures that we do not keep data from a failed subtransaction 231longer than necessary. Because of this behavior, you must be careful to clean 232up properly during subtransaction abort --- the subtransaction's state must be 233delinked from any pointers or lists kept in upper transactions, or you will 234have dangling pointers leading to a crash at top-level commit. An example of 235data kept here is pending NOTIFY messages, which are sent at top-level commit, 236but only if the generating subtransaction did not abort. 237 238PortalContext --- this is not actually a separate context, but a 239global variable pointing to the per-portal context of the currently active 240execution portal. This can be used if it's necessary to allocate storage 241that will live just as long as the execution of the current portal requires. 242 243ErrorContext --- this permanent context is switched into for error 244recovery processing, and then reset on completion of recovery. We arrange 245to have a few KB of memory available in it at all times. In this way, we 246can ensure that some memory is available for error recovery even if the 247backend has run out of memory otherwise. This allows out-of-memory to be 248treated as a normal ERROR condition, not a FATAL error. 249 250 251Contexts For Prepared Statements And Portals 252-------------------------------------------- 253 254A prepared-statement object has an associated private context, in which 255the parse and plan trees for its query are stored. Because these trees 256are read-only to the executor, the prepared statement can be re-used many 257times without further copying of these trees. 258 259An execution-portal object has a private context that is referenced by 260PortalContext when the portal is active. In the case of a portal created 261by DECLARE CURSOR, this private context contains the query parse and plan 262trees (there being no other object that can hold them). Portals created 263from prepared statements simply reference the prepared statements' trees, 264and don't actually need any storage allocated in their private contexts. 265 266 267Logical Replication Worker Contexts 268----------------------------------- 269 270ApplyContext --- permanent during whole lifetime of apply worker. It 271is possible to use TopMemoryContext here as well, but for simplicity 272of memory usage analysis we spin up different context. 273 274ApplyMessageContext --- short-lived context that is reset after each 275logical replication protocol message is processed. 276 277 278Transient Contexts During Execution 279----------------------------------- 280 281When creating a prepared statement, the parse and plan trees will be built 282in a temporary context that's a child of MessageContext (so that it will 283go away automatically upon error). On success, the finished plan is 284copied to the prepared statement's private context, and the temp context 285is released; this allows planner temporary space to be recovered before 286execution begins. (In simple-Query mode we don't bother with the extra 287copy step, so the planner temp space stays around till end of query.) 288 289The top-level executor routines, as well as most of the "plan node" 290execution code, will normally run in a context that is created by 291ExecutorStart and destroyed by ExecutorEnd; this context also holds the 292"plan state" tree built during ExecutorStart. Most of the memory 293allocated in these routines is intended to live until end of query, 294so this is appropriate for those purposes. The executor's top context 295is a child of PortalContext, that is, the per-portal context of the 296portal that represents the query's execution. 297 298The main memory-management consideration in the executor is that 299expression evaluation --- both for qual testing and for computation of 300targetlist entries --- needs to not leak memory. To do this, each 301ExprContext (expression-eval context) created in the executor has a 302private memory context associated with it, and we switch into that context 303when evaluating expressions in that ExprContext. The plan node that owns 304the ExprContext is responsible for resetting the private context to empty 305when it no longer needs the results of expression evaluations. Typically 306the reset is done at the start of each tuple-fetch cycle in the plan node. 307 308Note that this design gives each plan node its own expression-eval memory 309context. This appears necessary to handle nested joins properly, since 310an outer plan node might need to retain expression results it has computed 311while obtaining the next tuple from an inner node --- but the inner node 312might execute many tuple cycles and many expressions before returning a 313tuple. The inner node must be able to reset its own expression context 314more often than once per outer tuple cycle. Fortunately, memory contexts 315are cheap enough that giving one to each plan node doesn't seem like a 316problem. 317 318A problem with running index accesses and sorts in a query-lifespan context 319is that these operations invoke datatype-specific comparison functions, 320and if the comparators leak any memory then that memory won't be recovered 321till end of query. The comparator functions all return bool or int32, 322so there's no problem with their result data, but there can be a problem 323with leakage of internal temporary data. In particular, comparator 324functions that operate on TOAST-able data types need to be careful 325not to leak detoasted versions of their inputs. This is annoying, but 326it appeared a lot easier to make the comparators conform than to fix the 327index and sort routines, so that's what was done for 7.1. This remains 328the state of affairs in btree and hash indexes, so btree and hash support 329functions still need to not leak memory. Most of the other index AMs 330have been modified to run opclass support functions in short-lived 331contexts, so that leakage is not a problem; this is necessary in view 332of the fact that their support functions tend to be far more complex. 333 334There are some special cases, such as aggregate functions. nodeAgg.c 335needs to remember the results of evaluation of aggregate transition 336functions from one tuple cycle to the next, so it can't just discard 337all per-tuple state in each cycle. The easiest way to handle this seems 338to be to have two per-tuple contexts in an aggregate node, and to 339ping-pong between them, so that at each tuple one is the active allocation 340context and the other holds any results allocated by the prior cycle's 341transition function. 342 343Executor routines that switch the active CurrentMemoryContext may need 344to copy data into their caller's current memory context before returning. 345However, we have minimized the need for that, because of the convention 346of resetting the per-tuple context at the *start* of an execution cycle 347rather than at its end. With that rule, an execution node can return a 348tuple that is palloc'd in its per-tuple context, and the tuple will remain 349good until the node is called for another tuple or told to end execution. 350This parallels the situation with pass-by-reference values at the table 351scan level, since a scan node can return a direct pointer to a tuple in a 352disk buffer that is only guaranteed to remain good that long. 353 354A more common reason for copying data is to transfer a result from 355per-tuple context to per-query context; for example, a Unique node will 356save the last distinct tuple value in its per-query context, requiring a 357copy step. 358 359 360Mechanisms to Allow Multiple Types of Contexts 361---------------------------------------------- 362 363To efficiently allow for different allocation patterns, and for 364experimentation, we allow for different types of memory contexts with 365different allocation policies but similar external behavior. To 366handle this, memory allocation functions are accessed via function 367pointers, and we require all context types to obey the conventions 368given here. 369 370A memory context is represented by struct MemoryContextData (see 371memnodes.h). This struct identifies the exact type of the context, and 372contains information common between the different types of 373MemoryContext like the parent and child contexts, and the name of the 374context. 375 376This is essentially an abstract superclass, and the behavior is 377determined by the "methods" pointer is its virtual function table 378(struct MemoryContextMethods). Specific memory context types will use 379derived structs having these fields as their first fields. All the 380contexts of a specific type will have methods pointers that point to 381the same static table of function pointers. 382 383While operations like allocating from and resetting a context take the 384relevant MemoryContext as a parameter, operations like free and 385realloc are trickier. To make those work, we require all memory 386context types to produce allocated chunks that are immediately, 387without any padding, preceded by a pointer to the corresponding 388MemoryContext. 389 390If a type of allocator needs additional information about its chunks, 391like e.g. the size of the allocation, that information can in turn 392precede the MemoryContext. This means the only overhead implied by 393the memory context mechanism is a pointer to its context, so we're not 394constraining context-type designers very much. 395 396Given this, routines like pfree determine their corresponding context 397with an operation like (although that is usually encapsulated in 398GetMemoryChunkContext()) 399 400 MemoryContext context = *(MemoryContext*) (((char *) pointer) - sizeof(void *)); 401 402and then invoke the corresponding method for the context 403 404 context->methods->free_p(pointer); 405 406 407More Control Over aset.c Behavior 408--------------------------------- 409 410By default aset.c always allocates an 8K block upon the first 411allocation in a context, and doubles that size for each successive 412block request. That's good behavior for a context that might hold 413*lots* of data. But if there are dozens if not hundreds of smaller 414contexts in the system, we need to be able to fine-tune things a 415little better. 416 417The creator of a context is able to specify an initial block size and 418a maximum block size. Selecting smaller values can prevent wastage of 419space in contexts that aren't expected to hold very much (an example 420is the relcache's per-relation contexts). 421 422Also, it is possible to specify a minimum context size, in case for some 423reason that should be different from the initial size for additional 424blocks. An aset.c context will always contain at least one block, 425of size minContextSize if that is specified, otherwise initBlockSize. 426 427We expect that per-tuple contexts will be reset frequently and typically 428will not allocate very much space per tuple cycle. To make this usage 429pattern cheap, the first block allocated in a context is not given 430back to malloc() during reset, but just cleared. This avoids malloc 431thrashing. 432 433 434Alternative Memory Context Implementations 435------------------------------------------ 436 437aset.c is our default general-purpose implementation, working fine 438in most situations. We also have two implementations optimized for 439special use cases, providing either better performance or lower memory 440usage compared to aset.c (or both). 441 442* slab.c (SlabContext) is designed for allocations of fixed-length 443 chunks, and does not allow allocations of chunks with different size. 444 445* generation.c (GenerationContext) is designed for cases when chunks 446 are allocated in groups with similar lifespan (generations), or 447 roughly in FIFO order. 448 449Both memory contexts aim to free memory back to the operating system 450(unlike aset.c, which keeps the freed chunks in a freelist, and only 451returns the memory when reset/deleted). 452 453These memory contexts were initially developed for ReorderBuffer, but 454may be useful elsewhere as long as the allocation patterns match. 455