1src/backend/utils/mmgr/README 2 3Notes About Memory Allocation Redesign 4====================================== 5 6Up through version 7.0, Postgres had serious problems with memory leakage 7during large queries that process a lot of pass-by-reference data. There 8was no provision for recycling memory until end of query. This needed to be 9fixed, even more so with the advent of TOAST which allows very large chunks 10of data to be passed around in the system. This document describes the new 11memory management system implemented in 7.1. 12 13 14Background 15---------- 16 17We do most of our memory allocation in "memory contexts", which are usually 18AllocSets as implemented by src/backend/utils/mmgr/aset.c. The key to 19successful memory management without lots of overhead is to define a useful 20set of contexts with appropriate lifespans. 21 22The basic operations on a memory context are: 23 24* create a context 25 26* allocate a chunk of memory within a context (equivalent of standard 27 C library's malloc()) 28 29* delete a context (including freeing all the memory allocated therein) 30 31* reset a context (free all memory allocated in the context, but not the 32 context object itself) 33 34Given a chunk of memory previously allocated from a context, one can 35free it or reallocate it larger or smaller (corresponding to standard C 36library's free() and realloc() routines). These operations return memory 37to or get more memory from the same context the chunk was originally 38allocated in. 39 40At all times there is a "current" context denoted by the 41CurrentMemoryContext global variable. The backend macro palloc() 42implicitly allocates space in that context. The MemoryContextSwitchTo() 43operation selects a new current context (and returns the previous context, 44so that the caller can restore the previous context before exiting). 45 46The main advantage of memory contexts over plain use of malloc/free is 47that the entire contents of a memory context can be freed easily, without 48having to request freeing of each individual chunk within it. This is 49both faster and more reliable than per-chunk bookkeeping. We use this 50fact to clean up at transaction end: by resetting all the active contexts 51of transaction or shorter lifespan, we can reclaim all transient memory. 52Similarly, we can clean up at the end of each query, or after each tuple 53is processed during a query. 54 55 56Some Notes About the palloc API Versus Standard C Library 57--------------------------------------------------------- 58 59The behavior of palloc and friends is similar to the standard C library's 60malloc and friends, but there are some deliberate differences too. Here 61are some notes to clarify the behavior. 62 63* If out of memory, palloc and repalloc exit via elog(ERROR). They never 64return NULL, and it is not necessary or useful to test for such a result. 65 66* palloc(0) is explicitly a valid operation. It does not return a NULL 67pointer, but a valid chunk of which no bytes may be used. However, the 68chunk might later be repalloc'd larger; it can also be pfree'd without 69error. Similarly, repalloc allows realloc'ing to zero size. 70 71* pfree and repalloc do not accept a NULL pointer. This is intentional. 72 73 74pfree/repalloc No Longer Depend On CurrentMemoryContext 75------------------------------------------------------- 76 77Since Postgres 7.1, pfree() and repalloc() can be applied to any chunk 78whether it belongs to CurrentMemoryContext or not --- the chunk's owning 79context will be invoked to handle the operation, regardless. This is a 80change from the old requirement that CurrentMemoryContext must be set 81to the same context the memory was allocated from before one can use 82pfree() or repalloc(). 83 84There was some consideration of getting rid of CurrentMemoryContext entirely, 85instead requiring the target memory context for allocation to be specified 86explicitly. But we decided that would be too much notational overhead --- 87we'd have to pass an appropriate memory context to called routines in 88many places. For example, the copyObject routines would need to be passed 89a context, as would function execution routines that return a 90pass-by-reference datatype. And what of routines that temporarily 91allocate space internally, but don't return it to their caller? We 92certainly don't want to clutter every call in the system with "here is 93a context to use for any temporary memory allocation you might want to 94do". So there'd still need to be a global variable specifying a suitable 95temporary-allocation context. That might as well be CurrentMemoryContext. 96 97The upshot of that reasoning, though, is that CurrentMemoryContext should 98generally point at a short-lifespan context if at all possible. During 99query execution it usually points to a context that gets reset after each 100tuple. Only in *very* circumscribed code should it ever point at a 101context having greater than transaction lifespan, since doing so risks 102permanent memory leaks. 103 104 105Additions to the Memory-Context Mechanism 106----------------------------------------- 107 108Before 7.1 memory contexts were all independent, but it was too hard to 109keep track of them; with lots of contexts there needs to be explicit 110mechanism for that. 111 112We solved this by creating a tree of "parent" and "child" contexts. When 113creating a memory context, the new context can be specified to be a child 114of some existing context. A context can have many children, but only one 115parent. In this way the contexts form a forest (not necessarily a single 116tree, since there could be more than one top-level context; although in 117current practice there is only one top context, TopMemoryContext). 118 119We then say that resetting or deleting any particular context resets or 120deletes all its direct and indirect children as well. This feature allows 121us to manage a lot of contexts without fear that some will be leaked; we 122only need to keep track of one top-level context that we are going to 123delete at transaction end, and make sure that any shorter-lived contexts 124we create are descendants of that context. Since the tree can have 125multiple levels, we can deal easily with nested lifetimes of storage, 126such as per-transaction, per-statement, per-scan, per-tuple. Storage 127lifetimes that only partially overlap can be handled by allocating 128from different trees of the context forest (there are some examples 129in the next section). 130 131Actually, it turns out that resetting a given context should almost 132always imply deleting, not just resetting, any child contexts it has. 133So MemoryContextReset() means that, and if you really do want a tree of 134empty contexts you need to call MemoryContextResetOnly() plus 135MemoryContextResetChildren(). 136 137For convenience we also provide operations like "reset/delete all children 138of a given context, but don't reset or delete that context itself". 139 140 141Globally Known Contexts 142----------------------- 143 144There are a few widely-known contexts that are typically referenced 145through global variables. At any instant the system may contain many 146additional contexts, but all other contexts should be direct or indirect 147children of one of these contexts to ensure they are not leaked in event 148of an error. 149 150TopMemoryContext --- this is the actual top level of the context tree; 151every other context is a direct or indirect child of this one. Allocating 152here is essentially the same as "malloc", because this context will never 153be reset or deleted. This is for stuff that should live forever, or for 154stuff that the controlling module will take care of deleting at the 155appropriate time. An example is fd.c's tables of open files, as well as 156the context management nodes for memory contexts themselves. Avoid 157allocating stuff here unless really necessary, and especially avoid 158running with CurrentMemoryContext pointing here. 159 160PostmasterContext --- this is the postmaster's normal working context. 161After a backend is spawned, it can delete PostmasterContext to free its 162copy of memory the postmaster was using that it doesn't need. 163Note that in non-EXEC_BACKEND builds, the postmaster's copy of pg_hba.conf 164and pg_ident.conf data is used directly during authentication in backend 165processes; so backends can't delete PostmasterContext until that's done. 166(The postmaster has only TopMemoryContext, PostmasterContext, and 167ErrorContext --- the remaining top-level contexts are set up in each 168backend during startup.) 169 170CacheMemoryContext --- permanent storage for relcache, catcache, and 171related modules. This will never be reset or deleted, either, so it's 172not truly necessary to distinguish it from TopMemoryContext. But it 173seems worthwhile to maintain the distinction for debugging purposes. 174(Note: CacheMemoryContext has child contexts with shorter lifespans. 175For example, a child context is the best place to keep the subsidiary 176storage associated with a relcache entry; that way we can free rule 177parsetrees and so forth easily, without having to depend on constructing 178a reliable version of freeObject().) 179 180MessageContext --- this context holds the current command message from the 181frontend, as well as any derived storage that need only live as long as 182the current message (for example, in simple-Query mode the parse and plan 183trees can live here). This context will be reset, and any children 184deleted, at the top of each cycle of the outer loop of PostgresMain. This 185is kept separate from per-transaction and per-portal contexts because a 186query string might need to live either a longer or shorter time than any 187single transaction or portal. 188 189TopTransactionContext --- this holds everything that lives until end of the 190top-level transaction. This context will be reset, and all its children 191deleted, at conclusion of each top-level transaction cycle. In most cases 192you don't want to allocate stuff directly here, but in CurTransactionContext; 193what does belong here is control information that exists explicitly to manage 194status across multiple subtransactions. Note: this context is NOT cleared 195immediately upon error; its contents will survive until the transaction block 196is exited by COMMIT/ROLLBACK. 197 198CurTransactionContext --- this holds data that has to survive until the end 199of the current transaction, and in particular will be needed at top-level 200transaction commit. When we are in a top-level transaction this is the same 201as TopTransactionContext, but in subtransactions it points to a child context. 202It is important to understand that if a subtransaction aborts, its 203CurTransactionContext is thrown away after finishing the abort processing; 204but a committed subtransaction's CurTransactionContext is kept until top-level 205commit (unless of course one of the intermediate levels of subtransaction 206aborts). This ensures that we do not keep data from a failed subtransaction 207longer than necessary. Because of this behavior, you must be careful to clean 208up properly during subtransaction abort --- the subtransaction's state must be 209delinked from any pointers or lists kept in upper transactions, or you will 210have dangling pointers leading to a crash at top-level commit. An example of 211data kept here is pending NOTIFY messages, which are sent at top-level commit, 212but only if the generating subtransaction did not abort. 213 214PortalContext --- this is not actually a separate context, but a 215global variable pointing to the per-portal context of the currently active 216execution portal. This can be used if it's necessary to allocate storage 217that will live just as long as the execution of the current portal requires. 218 219ErrorContext --- this permanent context is switched into for error 220recovery processing, and then reset on completion of recovery. We arrange 221to have a few KB of memory available in it at all times. In this way, we 222can ensure that some memory is available for error recovery even if the 223backend has run out of memory otherwise. This allows out-of-memory to be 224treated as a normal ERROR condition, not a FATAL error. 225 226 227Contexts For Prepared Statements And Portals 228-------------------------------------------- 229 230A prepared-statement object has an associated private context, in which 231the parse and plan trees for its query are stored. Because these trees 232are read-only to the executor, the prepared statement can be re-used many 233times without further copying of these trees. 234 235An execution-portal object has a private context that is referenced by 236PortalContext when the portal is active. In the case of a portal created 237by DECLARE CURSOR, this private context contains the query parse and plan 238trees (there being no other object that can hold them). Portals created 239from prepared statements simply reference the prepared statements' trees, 240and don't actually need any storage allocated in their private contexts. 241 242 243Transient Contexts During Execution 244----------------------------------- 245 246When creating a prepared statement, the parse and plan trees will be built 247in a temporary context that's a child of MessageContext (so that it will 248go away automatically upon error). On success, the finished plan is 249copied to the prepared statement's private context, and the temp context 250is released; this allows planner temporary space to be recovered before 251execution begins. (In simple-Query mode we don't bother with the extra 252copy step, so the planner temp space stays around till end of query.) 253 254The top-level executor routines, as well as most of the "plan node" 255execution code, will normally run in a context that is created by 256ExecutorStart and destroyed by ExecutorEnd; this context also holds the 257"plan state" tree built during ExecutorStart. Most of the memory 258allocated in these routines is intended to live until end of query, 259so this is appropriate for those purposes. The executor's top context 260is a child of PortalContext, that is, the per-portal context of the 261portal that represents the query's execution. 262 263The main memory-management consideration in the executor is that 264expression evaluation --- both for qual testing and for computation of 265targetlist entries --- needs to not leak memory. To do this, each 266ExprContext (expression-eval context) created in the executor has a 267private memory context associated with it, and we switch into that context 268when evaluating expressions in that ExprContext. The plan node that owns 269the ExprContext is responsible for resetting the private context to empty 270when it no longer needs the results of expression evaluations. Typically 271the reset is done at the start of each tuple-fetch cycle in the plan node. 272 273Note that this design gives each plan node its own expression-eval memory 274context. This appears necessary to handle nested joins properly, since 275an outer plan node might need to retain expression results it has computed 276while obtaining the next tuple from an inner node --- but the inner node 277might execute many tuple cycles and many expressions before returning a 278tuple. The inner node must be able to reset its own expression context 279more often than once per outer tuple cycle. Fortunately, memory contexts 280are cheap enough that giving one to each plan node doesn't seem like a 281problem. 282 283A problem with running index accesses and sorts in a query-lifespan context 284is that these operations invoke datatype-specific comparison functions, 285and if the comparators leak any memory then that memory won't be recovered 286till end of query. The comparator functions all return bool or int32, 287so there's no problem with their result data, but there can be a problem 288with leakage of internal temporary data. In particular, comparator 289functions that operate on TOAST-able data types need to be careful 290not to leak detoasted versions of their inputs. This is annoying, but 291it appeared a lot easier to make the comparators conform than to fix the 292index and sort routines, so that's what was done for 7.1. This remains 293the state of affairs in btree and hash indexes, so btree and hash support 294functions still need to not leak memory. Most of the other index AMs 295have been modified to run opclass support functions in short-lived 296contexts, so that leakage is not a problem; this is necessary in view 297of the fact that their support functions tend to be far more complex. 298 299There are some special cases, such as aggregate functions. nodeAgg.c 300needs to remember the results of evaluation of aggregate transition 301functions from one tuple cycle to the next, so it can't just discard 302all per-tuple state in each cycle. The easiest way to handle this seems 303to be to have two per-tuple contexts in an aggregate node, and to 304ping-pong between them, so that at each tuple one is the active allocation 305context and the other holds any results allocated by the prior cycle's 306transition function. 307 308Executor routines that switch the active CurrentMemoryContext may need 309to copy data into their caller's current memory context before returning. 310However, we have minimized the need for that, because of the convention 311of resetting the per-tuple context at the *start* of an execution cycle 312rather than at its end. With that rule, an execution node can return a 313tuple that is palloc'd in its per-tuple context, and the tuple will remain 314good until the node is called for another tuple or told to end execution. 315This parallels the situation with pass-by-reference values at the table 316scan level, since a scan node can return a direct pointer to a tuple in a 317disk buffer that is only guaranteed to remain good that long. 318 319A more common reason for copying data is to transfer a result from 320per-tuple context to per-query context; for example, a Unique node will 321save the last distinct tuple value in its per-query context, requiring a 322copy step. 323 324 325Mechanisms to Allow Multiple Types of Contexts 326---------------------------------------------- 327 328We may want several different types of memory contexts with different 329allocation policies but similar external behavior. To handle this, 330memory allocation functions will be accessed via function pointers, 331and we will require all context types to obey the conventions given here. 332(As of 2015, there's actually still just one context type; but interest in 333creating other types has never gone away entirely, so we retain this API.) 334 335A memory context is represented by an object like 336 337typedef struct MemoryContextData 338{ 339 NodeTag type; /* identifies exact kind of context */ 340 MemoryContextMethods methods; 341 MemoryContextData *parent; /* NULL if no parent (toplevel context) */ 342 MemoryContextData *firstchild; /* head of linked list of children */ 343 MemoryContextData *nextchild; /* next child of same parent */ 344 char *name; /* context name (just for debugging) */ 345} MemoryContextData, *MemoryContext; 346 347This is essentially an abstract superclass, and the "methods" pointer is 348its virtual function table. Specific memory context types will use 349derived structs having these fields as their first fields. All the 350contexts of a specific type will have methods pointers that point to the 351same static table of function pointers, which look like 352 353typedef struct MemoryContextMethodsData 354{ 355 Pointer (*alloc) (MemoryContext c, Size size); 356 void (*free_p) (Pointer chunk); 357 Pointer (*realloc) (Pointer chunk, Size newsize); 358 void (*reset) (MemoryContext c); 359 void (*delete) (MemoryContext c); 360} MemoryContextMethodsData, *MemoryContextMethods; 361 362Alloc, reset, and delete requests will take a MemoryContext pointer 363as parameter, so they'll have no trouble finding the method pointer 364to call. Free and realloc are trickier. To make those work, we 365require all memory context types to produce allocated chunks that 366are immediately preceded by a standard chunk header, which has the 367layout 368 369typedef struct StandardChunkHeader 370{ 371 MemoryContext mycontext; /* Link to owning context object */ 372 Size size; /* Allocated size of chunk */ 373}; 374 375It turns out that the pre-existing aset.c memory context type did this 376already, and probably any other kind of context would need to have the 377same data available to support realloc, so this is not really creating 378any additional overhead. (Note that if a context type needs more per- 379allocated-chunk information than this, it can make an additional 380nonstandard header that precedes the standard header. So we're not 381constraining context-type designers very much.) 382 383Given this, the pfree routine looks something like 384 385 StandardChunkHeader * header = 386 (StandardChunkHeader *) ((char *) p - sizeof(StandardChunkHeader)); 387 388 (*header->mycontext->methods->free_p) (p); 389 390 391More Control Over aset.c Behavior 392--------------------------------- 393 394Previously, aset.c always allocated an 8K block upon the first allocation 395in a context, and doubled that size for each successive block request. 396That's good behavior for a context that might hold *lots* of data, and 397the overhead wasn't bad when we had only a few contexts in existence. 398With dozens if not hundreds of smaller contexts in the system, we need 399to be able to fine-tune things a little better. 400 401The creator of a context is now able to specify an initial block size 402and a maximum block size. Selecting smaller values can prevent wastage 403of space in contexts that aren't expected to hold very much (an example is 404the relcache's per-relation contexts). 405 406Also, it is possible to specify a minimum context size. If this 407value is greater than zero then a block of that size will be grabbed 408immediately upon context creation, and cleared but not released during 409context resets. This feature is needed for ErrorContext (see above), 410but will most likely not be used for other contexts. 411 412We expect that per-tuple contexts will be reset frequently and typically 413will not allocate very much space per tuple cycle. To make this usage 414pattern cheap, the first block allocated in a context is not given 415back to malloc() during reset, but just cleared. This avoids malloc 416thrashing. 417 418 419Memory Context Reset/Delete Callbacks 420------------------------------------- 421 422A feature introduced in Postgres 9.5 allows memory contexts to be used 423for managing more resources than just plain palloc'd memory. This is 424done by registering a "reset callback function" for a memory context. 425Such a function will be called, once, just before the context is next 426reset or deleted. It can be used to give up resources that are in some 427sense associated with an object allocated within the context. Possible 428use-cases include 429* closing open files associated with a tuplesort object; 430* releasing reference counts on long-lived cache objects that are held 431 by some object within the context being reset; 432* freeing malloc-managed memory associated with some palloc'd object. 433That last case would just represent bad programming practice for pure 434Postgres code; better to have made all the allocations using palloc, 435in the target context or some child context. However, it could well 436come in handy for code that interfaces to non-Postgres libraries. 437 438Any number of reset callbacks can be established for a memory context; 439they are called in reverse order of registration. Also, callbacks 440attached to child contexts are called before callbacks attached to 441parent contexts, if a tree of contexts is being reset or deleted. 442 443The API for this requires the caller to provide a MemoryContextCallback 444memory chunk to hold the state for a callback. Typically this should be 445allocated in the same context it is logically attached to, so that it 446will be released automatically after use. The reason for asking the 447caller to provide this memory is that in most usage scenarios, the caller 448will be creating some larger struct within the target context, and the 449MemoryContextCallback struct can be made "for free" without a separate 450palloc() call by including it in this larger struct. 451