1Carrier Migration
2=================
3
4Introduction
5------------
6
7The ERTS memory allocators manage memory blocks in two types of raw
8memory chunks. We call these chunks of raw memory
9*carriers*. Single-block carriers which only contain one large block,
10and multi-block carriers which contain multiple blocks. A carrier is
11typically created using `mmap()` on unix systems. However, how a
12carrier is created is of minor importance. An allocator instance
13typically manages a mixture of single- and multi-block carriers.
14
15Problem
16-------
17
18When a carrier is empty, i.e. contains only one large free block, it
19is deallocated. Since multi-block carriers can contain both allocated
20blocks and free blocks at the same time, an allocator instance might
21be stuck with a large amount of poorly utilized carriers if the memory
22load decreases. After a peak in memory usage it is expected that not
23all memory can be returned since the blocks still allocated are likely
24to be dispersed over multiple carriers. Such poorly utilized carriers
25can usually be reused if the memory load increases again. However,
26since each scheduler thread manages its own set of allocator
27instances, and memory load is not necessarily correlated to CPU load, we
28might get into a situation where there are lots of poorly utilized
29multi-block carriers on some allocator instances while we need to
30allocate new multi-block carriers on other allocator instances. In
31scenarios like this, the demand for multi-block carriers in the system
32might increase at the same time as the actual memory demand in the
33system has decreased which is both unwanted and quite unexpected for
34the end user.
35
36Solution
37--------
38
39In order to prevent scenarios like this we've implemented support for
40migration of multi-block carriers between allocator instances.
41
42### Management of Free Blocks ###
43
44In order to be able to remove a carrier from one allocator instance
45and add it to another we need to be able to move references to the
46free blocks of the carrier between the allocator instances. The
47allocator instance specific data structure referring to the free
48blocks it manages often refers to the same carrier from multiple
49places. For example, when the address order best-fit strategy is used
50this data structure is a binary search tree spanning all carriers that
51the allocator instance manages. Free blocks in one specific carrier
52can be referred to from potentially every other carrier that is
53managed, and the amount of such references can be huge. That is, the
54work of removing the free blocks of such a carrier from the search
55tree will be huge. One way of solving this could be not to migrate
56carriers that contain lots of free blocks, but this would prevent us
57from migrating carriers that potentially need to be migrated in order
58to solve the problem we set out to solve.
59
60By using one data structure of free blocks in each carrier and an
61allocator instance-wide data structure of carriers managed by the
62allocator instance, the work needed in order to remove and add
63carriers can be kept to a minimum. When migration of carriers is
64enabled on a specific allocator type, we require that an allocation
65strategy with such an implementation is used. Currently we've
66implemented this for three different allocation strategies. All of
67these strategies use a search tree of carriers sorted so that we can
68find the carrier with the lowest address that can satisfy the
69request. Internally in carriers we use yet another search tree that
70either implement address order first fit, address order best fit,
71or best fit. The abbreviations used for these different allocation
72strategies are `aoff`, and `aoffcaobf`, `aoffcbf`.
73
74### Carrier Pool ###
75
76In order to migrate carriers between allocator instances we move them
77through a pool of carriers. In order for a carrier migration to
78complete, one scheduler needs to move the carrier into the pool, and
79another scheduler needs to take the carrier out of the pool.
80
81The pool is implemented as a lock-free, circular, double linked,
82list. The list contains a sentinel which is used as the starting point
83when inserting to, or fetching from, the pool. Carriers in the pool are
84elements in this list.
85
86The list can be modified by all scheduler threads
87simultaneously. During modifications the double linked list is allowed
88to get a bit "out of shape". For example, following the `next` pointer
89to the next element and then following the `prev` pointer does not
90always take you back to were you started. The following is however
91always true:
92
93*   Repeatedly following `next` pointers will eventually take you to the
94    sentinel.
95*   Repeatedly following `prev` pointers will eventually take you to the
96    sentinel.
97*   Following a `next` or a `prev` pointer will take you to either an
98    element in the pool, or an element that used to be in the pool.
99
100When inserting a new element we search for a place to insert the
101element by only following `next` pointers, and we always begin by
102skipping the first element encountered. When trying to fetch an
103element we do the same thing, but instead only follow `prev` pointers.
104
105By going different directions when inserting and fetching, we avoid
106contention between threads inserting and threads fetching as much as
107possible. By skipping one element when we begin searching, we preserve
108the sentinel unmodified as much as possible. This is beneficial since
109all search operations need to read the content of the sentinel. If we
110were to modify the sentinel, the cache line containing the sentinel
111would unnecessarily be bounced between processors.
112
113The `prev` and `next` fields in the elements of the list contain the
114value of the pointer, a modification marker, and a deleted
115marker. Memory operations on these fields are done using atomic memory
116operations. When a thread has set the modification marker in a field,
117no-one except the thread that set the marker is allowed to modify the
118field. If multiple modification markers need to be set, we always
119begin with `next` fields followed by `prev` fields in the order
120following the actual pointers. This guarantees that no deadlocks will
121occur.
122
123When a carrier is being removed from a pool, we mark it with a thread
124progress value that needs to be reached before we are allowed to
125modify the `next` and `prev` fields. That is, until we reach this
126thread progress we are not allowed to insert the carrier into the pool
127again, and we are not allowed to deallocate the carrier. This ensures
128that threads inspecting the pool always will be able to traverse the
129pool and reach valid elements. Once we have reached the thread
130progress value that the carrier was tagged with, we know that no
131threads may have references to it via the pool.
132
133### Migration ###
134
135Each allocator instance keeps track of the current utilization of its
136multi-block carriers. When the total utilization falls below the "abandon
137carrier utilization limit" it starts to inspect the utilization of the
138current carrier when deallocations are made. If also the utilization
139of the carrier falls below the "abandon carrier utilization limit" it
140unlinks the carrier from its data structure of available free blocks
141and inserts the carrier into the pool.
142
143Since the carrier has been unlinked from the data structure of
144available free blocks, no more allocations will be made in the
145carrier.
146
147The allocator instance that created a carrier is called its *owner*.
148Ownership never changes.
149
150The allocator instance that has the responsibility to perform deallocations in a
151carrier is called its *employer*. The employer may also perform allocations if
152the carrier is not in the pool. Employment may change when a carrier is fetched from
153or inserted into the pool.
154
155Deallocations in a carrier, while it remains in the pool, is always performed
156the owner. That is, all pooled carriers are employed by their owners.
157
158Each carrier has an atomic word containing a pointer to the employing allocator
159instance and three bit flags; IN\_POOL, BUSY and HOMECOMING.
160
161When fetching a carrier from the pool, employment may change and further
162deallocations in the carrier will be redirected to the new
163employer using the delayed dealloc functionality.
164
165When a foreign allocator instance abandons a carrier back into the pool, it will
166also pass it back to its *owner* using the delayed dealloc queue. When doing
167this it will set the HOMECOMING bit flag to mark it as "enqueued". The owner
168will later clear the HOMECOMING bit when the carrier is dequeued. This mechanism
169prevents a carrier from being enqueued again before it has been dequeued.
170
171When a carrier becomes empty, it will be deallocated. Carrier deallocation is
172always done by the owner that allocated the carrier. By doing this, the
173underlying functionality of allocating and deallocating carriers can
174remain simple and doesn't have to bother about multiple threads. In a
175NUMA system we will also not mix carriers originating from multiple
176NUMA nodes.
177
178If a carrier in the pool becomes empty, it will be withdrawn from the
179pool and be deallocated by the owner which already employs it.
180
181If a carrier employed by a foreign allocator becomes empty, it will be passed
182back to the owner for deallocation using the delayed dealloc functionality.
183
184In short:
185
186* The allocator instance that created a carrier *owns* it.
187* An empty carrier is always deallocated by its *owner*.
188* *Ownership* never changes.
189* The allocator instance that uses a carrier *employs* it.
190* An *employer* can abandon a carrier into the pool.
191* Pooled carriers are not allocated from.
192* Pooled carriers are always *employed* by their *owner*.
193* *Employment* can only change from *owner* to a foreign allocator
194  when a carrier is fetched from the pool.
195
196
197### Searching the pool ###
198
199When an allocator instance needs more carrier space, it inspects the pool. If no
200carrier could be fetched from the pool, it will allocate a new
201carrier. Regardless of where the allocator instance gets the carrier from, it
202just links in the carrier into its data structure of free blocks.
203
204To harbor real time characteristics, searching the pool is
205limited. We only inspect a limited number of carriers. If none of
206those carriers had a free block large enough to satisfy the allocation
207request, the search will fail. A carrier in the pool can also be BUSY
208if another thread is currently doing block deallocation work on the
209carrier. A BUSY carrier will also be skipped by the search as it cannot
210satisfy the request. The pool is lock-free and we do not want to
211block, waiting for the other thread to finish.
212
213### The bad cluster problem ###
214
215Before OTP-17.4 the search algorithm had a problem as the search always started
216at the same position in the pool, the sentinel. This could lead to
217contention from concurrent searching processes. But even worse, it
218could lead to a "bad" state when searches fail with a high rate
219leading to new carriers instead being allocated. These new carriers
220may later be inserted into the pool due to bad utilization. If the
221frequency of insertions into the pool is higher than successful
222fetching from the pool, memory will eventually get exhausted.
223
224This "bad" state consists of a cluster of small and/or highly
225fragmented carriers located at the sentinel in the pool. The largest free
226block in such a "bad" carrier is rather small, making it unable to satisfy
227most allocation requests. As the search always started at the
228sentinel, any such "bad" carriers that had been left in the pool would
229eventually cluster together at the sentinel. All searches first
230have to skip past this cluster of "bad" carriers to reach a "good"
231carrier. When the cluster gets to the same size as the search limit,
232all searches will essentially fail.
233
234To counter the "bad cluster" problem and also ease the contention, the
235search will now always start by first looking at the allocators *own*
236carriers. That is, carriers that were initially created by the
237allocator itself and later had been abandoned to the pool. If none of
238our own abandoned carrier would do, then the search continues into the
239pool, as before, to look for carriers created by other
240allocators. However, if we have at least one abandoned carrier of our
241own that could not satisfy the request, we can use that as entry point
242into the pool.
243
244The result is that we prefer carriers created by the thread itself,
245which is good for NUMA performance. And we get more entry points when
246searching the pool, which will ease contention and clustering.
247
248### Our own pooled tree ###
249
250To do the first search among own carriers, every allocator instance
251has a `pooled_tree` of carriers. This tree is only accessed by the allocator
252itself and can only contain its own carriers. When a carrier is
253abandoned and put in the pool, it is also inserted into `pooled_tree`. This is
254either done direct, if the carrier was already employed by its owner, or by
255first passing it back to the owner via the delayed dealloc queue.
256
257When we search our `pooled_tree` and find a carrier that is no longer in the
258pool, we remove that carrier from `pooled_tree` and mark it as TRAITOR, as it is
259now employed by a foreign allocator. We will not find any carriers in
260`pooled_tree` that are marked as BUSY by other threads.
261
262If no carrier in `pooled_tree` had a large enough free block, we search it again
263to find any carrier that may act as an entry point into the shared list of all
264pooled carriers. This in order to, if possible, avoid starting at the sentinel
265and thereby ease the "bad clustering" problem.
266
267Furthermore, the search for own carriers that are scheduled
268for deallocation is done as the last search option. The idea is
269that it is better to reuse a poorly utilized carrier than to
270resurrect an empty carrier that was just about to be released back to
271the OS.
272
273### Result ###
274
275The use of this strategy of abandoning carriers with poor utilization
276and reusing them in allocator instances with an increased carrier
277demand is extremely effective and completely eliminates the problems
278that otherwise sometimes occurred when CPU load dropped while memory
279load did not.
280
281When using the `aoffcaobf` or `aoff` strategies compared to `gf` or
282`bf`, we loose some performance since we get more modifications in the
283data structure of free blocks. This performance penalty is however
284reduced using the `aoffcbf` strategy. A trade off between memory
285consumption and performance is however inevitable, and it is up to
286the user to decide what is most important.
287
288