1Carrier Migration 2================= 3 4Introduction 5------------ 6 7The ERTS memory allocators manage memory blocks in two types of raw 8memory chunks. We call these chunks of raw memory 9*carriers*. Single-block carriers which only contain one large block, 10and multi-block carriers which contain multiple blocks. A carrier is 11typically created using `mmap()` on unix systems. However, how a 12carrier is created is of minor importance. An allocator instance 13typically manages a mixture of single- and multi-block carriers. 14 15Problem 16------- 17 18When a carrier is empty, i.e. contains only one large free block, it 19is deallocated. Since multi-block carriers can contain both allocated 20blocks and free blocks at the same time, an allocator instance might 21be stuck with a large amount of poorly utilized carriers if the memory 22load decreases. After a peak in memory usage it is expected that not 23all memory can be returned since the blocks still allocated are likely 24to be dispersed over multiple carriers. Such poorly utilized carriers 25can usually be reused if the memory load increases again. However, 26since each scheduler thread manages its own set of allocator 27instances, and memory load is not necessarily correlated to CPU load, we 28might get into a situation where there are lots of poorly utilized 29multi-block carriers on some allocator instances while we need to 30allocate new multi-block carriers on other allocator instances. In 31scenarios like this, the demand for multi-block carriers in the system 32might increase at the same time as the actual memory demand in the 33system has decreased which is both unwanted and quite unexpected for 34the end user. 35 36Solution 37-------- 38 39In order to prevent scenarios like this we've implemented support for 40migration of multi-block carriers between allocator instances. 41 42### Management of Free Blocks ### 43 44In order to be able to remove a carrier from one allocator instance 45and add it to another we need to be able to move references to the 46free blocks of the carrier between the allocator instances. The 47allocator instance specific data structure referring to the free 48blocks it manages often refers to the same carrier from multiple 49places. For example, when the address order best-fit strategy is used 50this data structure is a binary search tree spanning all carriers that 51the allocator instance manages. Free blocks in one specific carrier 52can be referred to from potentially every other carrier that is 53managed, and the amount of such references can be huge. That is, the 54work of removing the free blocks of such a carrier from the search 55tree will be huge. One way of solving this could be not to migrate 56carriers that contain lots of free blocks, but this would prevent us 57from migrating carriers that potentially need to be migrated in order 58to solve the problem we set out to solve. 59 60By using one data structure of free blocks in each carrier and an 61allocator instance-wide data structure of carriers managed by the 62allocator instance, the work needed in order to remove and add 63carriers can be kept to a minimum. When migration of carriers is 64enabled on a specific allocator type, we require that an allocation 65strategy with such an implementation is used. Currently we've 66implemented this for three different allocation strategies. All of 67these strategies use a search tree of carriers sorted so that we can 68find the carrier with the lowest address that can satisfy the 69request. Internally in carriers we use yet another search tree that 70either implement address order first fit, address order best fit, 71or best fit. The abbreviations used for these different allocation 72strategies are `aoff`, and `aoffcaobf`, `aoffcbf`. 73 74### Carrier Pool ### 75 76In order to migrate carriers between allocator instances we move them 77through a pool of carriers. In order for a carrier migration to 78complete, one scheduler needs to move the carrier into the pool, and 79another scheduler needs to take the carrier out of the pool. 80 81The pool is implemented as a lock-free, circular, double linked, 82list. The list contains a sentinel which is used as the starting point 83when inserting to, or fetching from, the pool. Carriers in the pool are 84elements in this list. 85 86The list can be modified by all scheduler threads 87simultaneously. During modifications the double linked list is allowed 88to get a bit "out of shape". For example, following the `next` pointer 89to the next element and then following the `prev` pointer does not 90always take you back to were you started. The following is however 91always true: 92 93* Repeatedly following `next` pointers will eventually take you to the 94 sentinel. 95* Repeatedly following `prev` pointers will eventually take you to the 96 sentinel. 97* Following a `next` or a `prev` pointer will take you to either an 98 element in the pool, or an element that used to be in the pool. 99 100When inserting a new element we search for a place to insert the 101element by only following `next` pointers, and we always begin by 102skipping the first element encountered. When trying to fetch an 103element we do the same thing, but instead only follow `prev` pointers. 104 105By going different directions when inserting and fetching, we avoid 106contention between threads inserting and threads fetching as much as 107possible. By skipping one element when we begin searching, we preserve 108the sentinel unmodified as much as possible. This is beneficial since 109all search operations need to read the content of the sentinel. If we 110were to modify the sentinel, the cache line containing the sentinel 111would unnecessarily be bounced between processors. 112 113The `prev` and `next` fields in the elements of the list contain the 114value of the pointer, a modification marker, and a deleted 115marker. Memory operations on these fields are done using atomic memory 116operations. When a thread has set the modification marker in a field, 117no-one except the thread that set the marker is allowed to modify the 118field. If multiple modification markers need to be set, we always 119begin with `next` fields followed by `prev` fields in the order 120following the actual pointers. This guarantees that no deadlocks will 121occur. 122 123When a carrier is being removed from a pool, we mark it with a thread 124progress value that needs to be reached before we are allowed to 125modify the `next` and `prev` fields. That is, until we reach this 126thread progress we are not allowed to insert the carrier into the pool 127again, and we are not allowed to deallocate the carrier. This ensures 128that threads inspecting the pool always will be able to traverse the 129pool and reach valid elements. Once we have reached the thread 130progress value that the carrier was tagged with, we know that no 131threads may have references to it via the pool. 132 133### Migration ### 134 135Each allocator instance keeps track of the current utilization of its 136multi-block carriers. When the total utilization falls below the "abandon 137carrier utilization limit" it starts to inspect the utilization of the 138current carrier when deallocations are made. If also the utilization 139of the carrier falls below the "abandon carrier utilization limit" it 140unlinks the carrier from its data structure of available free blocks 141and inserts the carrier into the pool. 142 143Since the carrier has been unlinked from the data structure of 144available free blocks, no more allocations will be made in the 145carrier. 146 147The allocator instance that created a carrier is called its *owner*. 148Ownership never changes. 149 150The allocator instance that has the responsibility to perform deallocations in a 151carrier is called its *employer*. The employer may also perform allocations if 152the carrier is not in the pool. Employment may change when a carrier is fetched from 153or inserted into the pool. 154 155Deallocations in a carrier, while it remains in the pool, is always performed 156the owner. That is, all pooled carriers are employed by their owners. 157 158Each carrier has an atomic word containing a pointer to the employing allocator 159instance and three bit flags; IN\_POOL, BUSY and HOMECOMING. 160 161When fetching a carrier from the pool, employment may change and further 162deallocations in the carrier will be redirected to the new 163employer using the delayed dealloc functionality. 164 165When a foreign allocator instance abandons a carrier back into the pool, it will 166also pass it back to its *owner* using the delayed dealloc queue. When doing 167this it will set the HOMECOMING bit flag to mark it as "enqueued". The owner 168will later clear the HOMECOMING bit when the carrier is dequeued. This mechanism 169prevents a carrier from being enqueued again before it has been dequeued. 170 171When a carrier becomes empty, it will be deallocated. Carrier deallocation is 172always done by the owner that allocated the carrier. By doing this, the 173underlying functionality of allocating and deallocating carriers can 174remain simple and doesn't have to bother about multiple threads. In a 175NUMA system we will also not mix carriers originating from multiple 176NUMA nodes. 177 178If a carrier in the pool becomes empty, it will be withdrawn from the 179pool and be deallocated by the owner which already employs it. 180 181If a carrier employed by a foreign allocator becomes empty, it will be passed 182back to the owner for deallocation using the delayed dealloc functionality. 183 184In short: 185 186* The allocator instance that created a carrier *owns* it. 187* An empty carrier is always deallocated by its *owner*. 188* *Ownership* never changes. 189* The allocator instance that uses a carrier *employs* it. 190* An *employer* can abandon a carrier into the pool. 191* Pooled carriers are not allocated from. 192* Pooled carriers are always *employed* by their *owner*. 193* *Employment* can only change from *owner* to a foreign allocator 194 when a carrier is fetched from the pool. 195 196 197### Searching the pool ### 198 199When an allocator instance needs more carrier space, it inspects the pool. If no 200carrier could be fetched from the pool, it will allocate a new 201carrier. Regardless of where the allocator instance gets the carrier from, it 202just links in the carrier into its data structure of free blocks. 203 204To harbor real time characteristics, searching the pool is 205limited. We only inspect a limited number of carriers. If none of 206those carriers had a free block large enough to satisfy the allocation 207request, the search will fail. A carrier in the pool can also be BUSY 208if another thread is currently doing block deallocation work on the 209carrier. A BUSY carrier will also be skipped by the search as it cannot 210satisfy the request. The pool is lock-free and we do not want to 211block, waiting for the other thread to finish. 212 213### The bad cluster problem ### 214 215Before OTP-17.4 the search algorithm had a problem as the search always started 216at the same position in the pool, the sentinel. This could lead to 217contention from concurrent searching processes. But even worse, it 218could lead to a "bad" state when searches fail with a high rate 219leading to new carriers instead being allocated. These new carriers 220may later be inserted into the pool due to bad utilization. If the 221frequency of insertions into the pool is higher than successful 222fetching from the pool, memory will eventually get exhausted. 223 224This "bad" state consists of a cluster of small and/or highly 225fragmented carriers located at the sentinel in the pool. The largest free 226block in such a "bad" carrier is rather small, making it unable to satisfy 227most allocation requests. As the search always started at the 228sentinel, any such "bad" carriers that had been left in the pool would 229eventually cluster together at the sentinel. All searches first 230have to skip past this cluster of "bad" carriers to reach a "good" 231carrier. When the cluster gets to the same size as the search limit, 232all searches will essentially fail. 233 234To counter the "bad cluster" problem and also ease the contention, the 235search will now always start by first looking at the allocators *own* 236carriers. That is, carriers that were initially created by the 237allocator itself and later had been abandoned to the pool. If none of 238our own abandoned carrier would do, then the search continues into the 239pool, as before, to look for carriers created by other 240allocators. However, if we have at least one abandoned carrier of our 241own that could not satisfy the request, we can use that as entry point 242into the pool. 243 244The result is that we prefer carriers created by the thread itself, 245which is good for NUMA performance. And we get more entry points when 246searching the pool, which will ease contention and clustering. 247 248### Our own pooled tree ### 249 250To do the first search among own carriers, every allocator instance 251has a `pooled_tree` of carriers. This tree is only accessed by the allocator 252itself and can only contain its own carriers. When a carrier is 253abandoned and put in the pool, it is also inserted into `pooled_tree`. This is 254either done direct, if the carrier was already employed by its owner, or by 255first passing it back to the owner via the delayed dealloc queue. 256 257When we search our `pooled_tree` and find a carrier that is no longer in the 258pool, we remove that carrier from `pooled_tree` and mark it as TRAITOR, as it is 259now employed by a foreign allocator. We will not find any carriers in 260`pooled_tree` that are marked as BUSY by other threads. 261 262If no carrier in `pooled_tree` had a large enough free block, we search it again 263to find any carrier that may act as an entry point into the shared list of all 264pooled carriers. This in order to, if possible, avoid starting at the sentinel 265and thereby ease the "bad clustering" problem. 266 267Furthermore, the search for own carriers that are scheduled 268for deallocation is done as the last search option. The idea is 269that it is better to reuse a poorly utilized carrier than to 270resurrect an empty carrier that was just about to be released back to 271the OS. 272 273### Result ### 274 275The use of this strategy of abandoning carriers with poor utilization 276and reusing them in allocator instances with an increased carrier 277demand is extremely effective and completely eliminates the problems 278that otherwise sometimes occurred when CPU load dropped while memory 279load did not. 280 281When using the `aoffcaobf` or `aoff` strategies compared to `gf` or 282`bf`, we loose some performance since we get more modifications in the 283data structure of free blocks. This performance penalty is however 284reduced using the `aoffcbf` strategy. A trade off between memory 285consumption and performance is however inevitable, and it is up to 286the user to decide what is most important. 287 288