1src/backend/utils/mmgr/README 2 3Memory Context System Design Overview 4===================================== 5 6Background 7---------- 8 9We do most of our memory allocation in "memory contexts", which are usually 10AllocSets as implemented by src/backend/utils/mmgr/aset.c. The key to 11successful memory management without lots of overhead is to define a useful 12set of contexts with appropriate lifespans. 13 14The basic operations on a memory context are: 15 16* create a context 17 18* allocate a chunk of memory within a context (equivalent of standard 19 C library's malloc()) 20 21* delete a context (including freeing all the memory allocated therein) 22 23* reset a context (free all memory allocated in the context, but not the 24 context object itself) 25 26* inquire about the total amount of memory allocated to the context 27 (the raw memory from which the context allocates chunks; not the 28 chunks themselves) 29 30Given a chunk of memory previously allocated from a context, one can 31free it or reallocate it larger or smaller (corresponding to standard C 32library's free() and realloc() routines). These operations return memory 33to or get more memory from the same context the chunk was originally 34allocated in. 35 36At all times there is a "current" context denoted by the 37CurrentMemoryContext global variable. palloc() implicitly allocates space 38in that context. The MemoryContextSwitchTo() operation selects a new current 39context (and returns the previous context, so that the caller can restore the 40previous context before exiting). 41 42The main advantage of memory contexts over plain use of malloc/free is 43that the entire contents of a memory context can be freed easily, without 44having to request freeing of each individual chunk within it. This is 45both faster and more reliable than per-chunk bookkeeping. We use this 46fact to clean up at transaction end: by resetting all the active contexts 47of transaction or shorter lifespan, we can reclaim all transient memory. 48Similarly, we can clean up at the end of each query, or after each tuple 49is processed during a query. 50 51 52Some Notes About the palloc API Versus Standard C Library 53--------------------------------------------------------- 54 55The behavior of palloc and friends is similar to the standard C library's 56malloc and friends, but there are some deliberate differences too. Here 57are some notes to clarify the behavior. 58 59* If out of memory, palloc and repalloc exit via elog(ERROR). They 60never return NULL, and it is not necessary or useful to test for such 61a result. With palloc_extended() that behavior can be overridden 62using the MCXT_ALLOC_NO_OOM flag. 63 64* palloc(0) is explicitly a valid operation. It does not return a NULL 65pointer, but a valid chunk of which no bytes may be used. However, the 66chunk might later be repalloc'd larger; it can also be pfree'd without 67error. Similarly, repalloc allows realloc'ing to zero size. 68 69* pfree and repalloc do not accept a NULL pointer. This is intentional. 70 71 72The Current Memory Context 73-------------------------- 74 75Because it would be too much notational overhead to always pass an 76appropriate memory context to called routines, there always exists the 77notion of the current memory context CurrentMemoryContext. Without it, 78for example, the copyObject routines would need to be passed a context, as 79would function execution routines that return a pass-by-reference 80datatype. Similarly for routines that temporarily allocate space 81internally, but don't return it to their caller? We certainly don't 82want to clutter every call in the system with "here is a context to 83use for any temporary memory allocation you might want to do". 84 85The upshot of that reasoning, though, is that CurrentMemoryContext should 86generally point at a short-lifespan context if at all possible. During 87query execution it usually points to a context that gets reset after each 88tuple. Only in *very* circumscribed code should it ever point at a 89context having greater than transaction lifespan, since doing so risks 90permanent memory leaks. 91 92 93pfree/repalloc Do Not Depend On CurrentMemoryContext 94---------------------------------------------------- 95 96pfree() and repalloc() can be applied to any chunk whether it belongs 97to CurrentMemoryContext or not --- the chunk's owning context will be 98invoked to handle the operation, regardless. 99 100 101"Parent" and "Child" Contexts 102----------------------------- 103 104If all contexts were independent, it'd be hard to keep track of them, 105especially in error cases. That is solved by creating a tree of 106"parent" and "child" contexts. When creating a memory context, the 107new context can be specified to be a child of some existing context. 108A context can have many children, but only one parent. In this way 109the contexts form a forest (not necessarily a single tree, since there 110could be more than one top-level context; although in current practice 111there is only one top context, TopMemoryContext). 112 113Deleting a context deletes all its direct and indirect children as 114well. When resetting a context it's almost always more useful to 115delete child contexts, thus MemoryContextReset() means that, and if 116you really do want a tree of empty contexts you need to call 117MemoryContextResetOnly() plus MemoryContextResetChildren(). 118 119These features allow us to manage a lot of contexts without fear that 120some will be leaked; we only need to keep track of one top-level 121context that we are going to delete at transaction end, and make sure 122that any shorter-lived contexts we create are descendants of that 123context. Since the tree can have multiple levels, we can deal easily 124with nested lifetimes of storage, such as per-transaction, 125per-statement, per-scan, per-tuple. Storage lifetimes that only 126partially overlap can be handled by allocating from different trees of 127the context forest (there are some examples in the next section). 128 129For convenience we also provide operations like "reset/delete all children 130of a given context, but don't reset or delete that context itself". 131 132 133Memory Context Reset/Delete Callbacks 134------------------------------------- 135 136A feature introduced in Postgres 9.5 allows memory contexts to be used 137for managing more resources than just plain palloc'd memory. This is 138done by registering a "reset callback function" for a memory context. 139Such a function will be called, once, just before the context is next 140reset or deleted. It can be used to give up resources that are in some 141sense associated with an object allocated within the context. Possible 142use-cases include 143* closing open files associated with a tuplesort object; 144* releasing reference counts on long-lived cache objects that are held 145 by some object within the context being reset; 146* freeing malloc-managed memory associated with some palloc'd object. 147That last case would just represent bad programming practice for pure 148Postgres code; better to have made all the allocations using palloc, 149in the target context or some child context. However, it could well 150come in handy for code that interfaces to non-Postgres libraries. 151 152Any number of reset callbacks can be established for a memory context; 153they are called in reverse order of registration. Also, callbacks 154attached to child contexts are called before callbacks attached to 155parent contexts, if a tree of contexts is being reset or deleted. 156 157The API for this requires the caller to provide a MemoryContextCallback 158memory chunk to hold the state for a callback. Typically this should be 159allocated in the same context it is logically attached to, so that it 160will be released automatically after use. The reason for asking the 161caller to provide this memory is that in most usage scenarios, the caller 162will be creating some larger struct within the target context, and the 163MemoryContextCallback struct can be made "for free" without a separate 164palloc() call by including it in this larger struct. 165 166 167Memory Contexts in Practice 168=========================== 169 170Globally Known Contexts 171----------------------- 172 173There are a few widely-known contexts that are typically referenced 174through global variables. At any instant the system may contain many 175additional contexts, but all other contexts should be direct or indirect 176children of one of these contexts to ensure they are not leaked in event 177of an error. 178 179TopMemoryContext --- this is the actual top level of the context tree; 180every other context is a direct or indirect child of this one. Allocating 181here is essentially the same as "malloc", because this context will never 182be reset or deleted. This is for stuff that should live forever, or for 183stuff that the controlling module will take care of deleting at the 184appropriate time. An example is fd.c's tables of open files. Avoid 185allocating stuff here unless really necessary, and especially avoid 186running with CurrentMemoryContext pointing here. 187 188PostmasterContext --- this is the postmaster's normal working context. 189After a backend is spawned, it can delete PostmasterContext to free its 190copy of memory the postmaster was using that it doesn't need. 191Note that in non-EXEC_BACKEND builds, the postmaster's copy of pg_hba.conf 192and pg_ident.conf data is used directly during authentication in backend 193processes; so backends can't delete PostmasterContext until that's done. 194(The postmaster has only TopMemoryContext, PostmasterContext, and 195ErrorContext --- the remaining top-level contexts are set up in each 196backend during startup.) 197 198CacheMemoryContext --- permanent storage for relcache, catcache, and 199related modules. This will never be reset or deleted, either, so it's 200not truly necessary to distinguish it from TopMemoryContext. But it 201seems worthwhile to maintain the distinction for debugging purposes. 202(Note: CacheMemoryContext has child contexts with shorter lifespans. 203For example, a child context is the best place to keep the subsidiary 204storage associated with a relcache entry; that way we can free rule 205parsetrees and so forth easily, without having to depend on constructing 206a reliable version of freeObject().) 207 208MessageContext --- this context holds the current command message from the 209frontend, as well as any derived storage that need only live as long as 210the current message (for example, in simple-Query mode the parse and plan 211trees can live here). This context will be reset, and any children 212deleted, at the top of each cycle of the outer loop of PostgresMain. This 213is kept separate from per-transaction and per-portal contexts because a 214query string might need to live either a longer or shorter time than any 215single transaction or portal. 216 217TopTransactionContext --- this holds everything that lives until end of the 218top-level transaction. This context will be reset, and all its children 219deleted, at conclusion of each top-level transaction cycle. In most cases 220you don't want to allocate stuff directly here, but in CurTransactionContext; 221what does belong here is control information that exists explicitly to manage 222status across multiple subtransactions. Note: this context is NOT cleared 223immediately upon error; its contents will survive until the transaction block 224is exited by COMMIT/ROLLBACK. 225 226CurTransactionContext --- this holds data that has to survive until the end 227of the current transaction, and in particular will be needed at top-level 228transaction commit. When we are in a top-level transaction this is the same 229as TopTransactionContext, but in subtransactions it points to a child context. 230It is important to understand that if a subtransaction aborts, its 231CurTransactionContext is thrown away after finishing the abort processing; 232but a committed subtransaction's CurTransactionContext is kept until top-level 233commit (unless of course one of the intermediate levels of subtransaction 234aborts). This ensures that we do not keep data from a failed subtransaction 235longer than necessary. Because of this behavior, you must be careful to clean 236up properly during subtransaction abort --- the subtransaction's state must be 237delinked from any pointers or lists kept in upper transactions, or you will 238have dangling pointers leading to a crash at top-level commit. An example of 239data kept here is pending NOTIFY messages, which are sent at top-level commit, 240but only if the generating subtransaction did not abort. 241 242PortalContext --- this is not actually a separate context, but a 243global variable pointing to the per-portal context of the currently active 244execution portal. This can be used if it's necessary to allocate storage 245that will live just as long as the execution of the current portal requires. 246 247ErrorContext --- this permanent context is switched into for error 248recovery processing, and then reset on completion of recovery. We arrange 249to have a few KB of memory available in it at all times. In this way, we 250can ensure that some memory is available for error recovery even if the 251backend has run out of memory otherwise. This allows out-of-memory to be 252treated as a normal ERROR condition, not a FATAL error. 253 254 255Contexts For Prepared Statements And Portals 256-------------------------------------------- 257 258A prepared-statement object has an associated private context, in which 259the parse and plan trees for its query are stored. Because these trees 260are read-only to the executor, the prepared statement can be re-used many 261times without further copying of these trees. 262 263An execution-portal object has a private context that is referenced by 264PortalContext when the portal is active. In the case of a portal created 265by DECLARE CURSOR, this private context contains the query parse and plan 266trees (there being no other object that can hold them). Portals created 267from prepared statements simply reference the prepared statements' trees, 268and don't actually need any storage allocated in their private contexts. 269 270 271Logical Replication Worker Contexts 272----------------------------------- 273 274ApplyContext --- permanent during whole lifetime of apply worker. It 275is possible to use TopMemoryContext here as well, but for simplicity 276of memory usage analysis we spin up different context. 277 278ApplyMessageContext --- short-lived context that is reset after each 279logical replication protocol message is processed. 280 281 282Transient Contexts During Execution 283----------------------------------- 284 285When creating a prepared statement, the parse and plan trees will be built 286in a temporary context that's a child of MessageContext (so that it will 287go away automatically upon error). On success, the finished plan is 288copied to the prepared statement's private context, and the temp context 289is released; this allows planner temporary space to be recovered before 290execution begins. (In simple-Query mode we don't bother with the extra 291copy step, so the planner temp space stays around till end of query.) 292 293The top-level executor routines, as well as most of the "plan node" 294execution code, will normally run in a context that is created by 295ExecutorStart and destroyed by ExecutorEnd; this context also holds the 296"plan state" tree built during ExecutorStart. Most of the memory 297allocated in these routines is intended to live until end of query, 298so this is appropriate for those purposes. The executor's top context 299is a child of PortalContext, that is, the per-portal context of the 300portal that represents the query's execution. 301 302The main memory-management consideration in the executor is that 303expression evaluation --- both for qual testing and for computation of 304targetlist entries --- needs to not leak memory. To do this, each 305ExprContext (expression-eval context) created in the executor has a 306private memory context associated with it, and we switch into that context 307when evaluating expressions in that ExprContext. The plan node that owns 308the ExprContext is responsible for resetting the private context to empty 309when it no longer needs the results of expression evaluations. Typically 310the reset is done at the start of each tuple-fetch cycle in the plan node. 311 312Note that this design gives each plan node its own expression-eval memory 313context. This appears necessary to handle nested joins properly, since 314an outer plan node might need to retain expression results it has computed 315while obtaining the next tuple from an inner node --- but the inner node 316might execute many tuple cycles and many expressions before returning a 317tuple. The inner node must be able to reset its own expression context 318more often than once per outer tuple cycle. Fortunately, memory contexts 319are cheap enough that giving one to each plan node doesn't seem like a 320problem. 321 322A problem with running index accesses and sorts in a query-lifespan context 323is that these operations invoke datatype-specific comparison functions, 324and if the comparators leak any memory then that memory won't be recovered 325till end of query. The comparator functions all return bool or int32, 326so there's no problem with their result data, but there can be a problem 327with leakage of internal temporary data. In particular, comparator 328functions that operate on TOAST-able data types need to be careful 329not to leak detoasted versions of their inputs. This is annoying, but 330it appeared a lot easier to make the comparators conform than to fix the 331index and sort routines, so that's what was done for 7.1. This remains 332the state of affairs in btree and hash indexes, so btree and hash support 333functions still need to not leak memory. Most of the other index AMs 334have been modified to run opclass support functions in short-lived 335contexts, so that leakage is not a problem; this is necessary in view 336of the fact that their support functions tend to be far more complex. 337 338There are some special cases, such as aggregate functions. nodeAgg.c 339needs to remember the results of evaluation of aggregate transition 340functions from one tuple cycle to the next, so it can't just discard 341all per-tuple state in each cycle. The easiest way to handle this seems 342to be to have two per-tuple contexts in an aggregate node, and to 343ping-pong between them, so that at each tuple one is the active allocation 344context and the other holds any results allocated by the prior cycle's 345transition function. 346 347Executor routines that switch the active CurrentMemoryContext may need 348to copy data into their caller's current memory context before returning. 349However, we have minimized the need for that, because of the convention 350of resetting the per-tuple context at the *start* of an execution cycle 351rather than at its end. With that rule, an execution node can return a 352tuple that is palloc'd in its per-tuple context, and the tuple will remain 353good until the node is called for another tuple or told to end execution. 354This parallels the situation with pass-by-reference values at the table 355scan level, since a scan node can return a direct pointer to a tuple in a 356disk buffer that is only guaranteed to remain good that long. 357 358A more common reason for copying data is to transfer a result from 359per-tuple context to per-query context; for example, a Unique node will 360save the last distinct tuple value in its per-query context, requiring a 361copy step. 362 363 364Mechanisms to Allow Multiple Types of Contexts 365---------------------------------------------- 366 367To efficiently allow for different allocation patterns, and for 368experimentation, we allow for different types of memory contexts with 369different allocation policies but similar external behavior. To 370handle this, memory allocation functions are accessed via function 371pointers, and we require all context types to obey the conventions 372given here. 373 374A memory context is represented by struct MemoryContextData (see 375memnodes.h). This struct identifies the exact type of the context, and 376contains information common between the different types of 377MemoryContext like the parent and child contexts, and the name of the 378context. 379 380This is essentially an abstract superclass, and the behavior is 381determined by the "methods" pointer is its virtual function table 382(struct MemoryContextMethods). Specific memory context types will use 383derived structs having these fields as their first fields. All the 384contexts of a specific type will have methods pointers that point to 385the same static table of function pointers. 386 387While operations like allocating from and resetting a context take the 388relevant MemoryContext as a parameter, operations like free and 389realloc are trickier. To make those work, we require all memory 390context types to produce allocated chunks that are immediately, 391without any padding, preceded by a pointer to the corresponding 392MemoryContext. 393 394If a type of allocator needs additional information about its chunks, 395like e.g. the size of the allocation, that information can in turn 396precede the MemoryContext. This means the only overhead implied by 397the memory context mechanism is a pointer to its context, so we're not 398constraining context-type designers very much. 399 400Given this, routines like pfree determine their corresponding context 401with an operation like (although that is usually encapsulated in 402GetMemoryChunkContext()) 403 404 MemoryContext context = *(MemoryContext*) (((char *) pointer) - sizeof(void *)); 405 406and then invoke the corresponding method for the context 407 408 context->methods->free_p(pointer); 409 410 411More Control Over aset.c Behavior 412--------------------------------- 413 414By default aset.c always allocates an 8K block upon the first 415allocation in a context, and doubles that size for each successive 416block request. That's good behavior for a context that might hold 417*lots* of data. But if there are dozens if not hundreds of smaller 418contexts in the system, we need to be able to fine-tune things a 419little better. 420 421The creator of a context is able to specify an initial block size and 422a maximum block size. Selecting smaller values can prevent wastage of 423space in contexts that aren't expected to hold very much (an example 424is the relcache's per-relation contexts). 425 426Also, it is possible to specify a minimum context size, in case for some 427reason that should be different from the initial size for additional 428blocks. An aset.c context will always contain at least one block, 429of size minContextSize if that is specified, otherwise initBlockSize. 430 431We expect that per-tuple contexts will be reset frequently and typically 432will not allocate very much space per tuple cycle. To make this usage 433pattern cheap, the first block allocated in a context is not given 434back to malloc() during reset, but just cleared. This avoids malloc 435thrashing. 436 437 438Alternative Memory Context Implementations 439------------------------------------------ 440 441aset.c is our default general-purpose implementation, working fine 442in most situations. We also have two implementations optimized for 443special use cases, providing either better performance or lower memory 444usage compared to aset.c (or both). 445 446* slab.c (SlabContext) is designed for allocations of fixed-length 447 chunks, and does not allow allocations of chunks with different size. 448 449* generation.c (GenerationContext) is designed for cases when chunks 450 are allocated in groups with similar lifespan (generations), or 451 roughly in FIFO order. 452 453Both memory contexts aim to free memory back to the operating system 454(unlike aset.c, which keeps the freed chunks in a freelist, and only 455returns the memory when reset/deleted). 456 457These memory contexts were initially developed for ReorderBuffer, but 458may be useful elsewhere as long as the allocation patterns match. 459 460 461Memory Accounting 462----------------- 463 464One of the basic memory context operations is determining the amount of 465memory used in the context (and its children). We have multiple places 466that implement their own ad hoc memory accounting, and this is meant to 467provide a unified approach. Ad hoc accounting solutions work for places 468with tight control over the allocations or when it's easy to determine 469sizes of allocated chunks (e.g. places that only work with tuples). 470 471The accounting built into the memory contexts is transparent and works 472transparently for all allocations as long as they end up in the right 473memory context subtree. 474 475Consider for example aggregate functions - the aggregate state is often 476represented by an arbitrary structure, allocated from the transition 477function, so the ad hoc accounting is unlikely to work. The built-in 478accounting will however handle such cases just fine. 479 480To minimize overhead, the accounting is done at the block level, not for 481individual allocation chunks. 482 483The accounting is lazy - after a block is allocated (or freed), only the 484context owning that block is updated. This means that when inquiring 485about the memory usage in a given context, we have to walk all children 486contexts recursively. This means the memory accounting is not intended 487for cases with too many memory contexts (in the relevant subtree). 488