1Delayed Dealloc
2===============
3
4Problem
5-------
6
7An easy way to handle memory allocation in a multi-threaded
8environment is to protect the memory allocator with a global lock
9which threads performing memory allocations or deallocations have to
10have locked during the whole operation. This solution of course scales
11very poorly, due to heavy lock contention. An improved solution of
12this scheme is to use multiple thread specific instances of such an
13allocator. That is, each thread allocates in its own allocator
14instance which is protected by a lock. In the general case references
15to memory need to be passed between threads. In the case where a
16thread that needs to deallocate memory that originates from another
17threads allocator instance a lock conflict is possible. In a system as
18the Erlang VM where memory allocation/deallocation is frequent and
19references to memory also are passed around between threads this
20solution will also scale poorly due to lock contention.
21
22Functionality Used to Address This problem
23-----------------------------------------
24
25In order to reduce contention due to locking of allocator instances we
26introduced completely lock free instances tied to each scheduler
27thread, and an extra locked instance for other threads. The scheduler
28threads in the system is expected to do the major part of the
29work. Other threads may still be needed but should not perform any
30major and/or time critical work. The limited amount of contention that
31appears on the locked allocator instance can more or less be
32disregarded.
33
34Since we still need to be able to pass references to memory between
35scheduler threads we need some way to manage this. An allocator
36instance belonging to one scheduler thread is only allowed to be
37manipulated by that scheduler thread. When other threads need to
38deallocate memory originating from a foreign allocator instance, they
39only pass the memory block to a "message box" containing deallocation
40jobs attached to the originating allocator instance. When a scheduler
41thread detects such deallocation job it performs the actual
42deallocation.
43
44The "message box" is implemented using a lock free single linked list
45through the memory blocks to deallocate. The order of the elements in
46this list is not important. Insertion of new free blocks will be made
47somewhere near the end of this list. Requiring that the new blocks
48need to be inserted at the end would cause unnecessary contention when
49large amount of memory blocks are inserted simultaneous by multiple
50threads.
51
52The data structure referring to this single linked list cover two cache
53lines. One cache line containing information about the head of the
54list, and one cache line containing information about the tail of the
55list. This in order to reduce cache line ping ponging of this data
56structure. The head of the list will only be manipulated by the thread
57owning the allocator instance, and the tail will be manipulated by
58other threads inserting deallocation jobs.
59
60### Tail ###
61
62In the tail part of the data structure we find a pointer to the last
63element of the list, or at least something that is near the end of the
64list. In the uncontended case it will point to the end of the list,
65but when simultaneous insert operations are performed it will point to
66something near the end of the list.
67
68When inserting an element one will try to write a pointer to the new
69element in the next pointer of the element pointed to by the last
70pointer. This is done using an atomic compare and swap that expects
71the next pointer to be `NULL`. If this succeeds the thread performing
72this operation moves the last pointer to point to the newly inserted
73element.
74
75If the atomic compare and swap described above failed, the last
76pointer didn't point to the last element. In this case we need to
77insert the new element somewhere between the element that the last
78pointer pointed to and the actual last element. If we do it this way
79the last pointer will eventually end up at the last element when
80threads stop adding new elements. When trying to insert somewhere near
81the end and failing to do so, the inserting thread sometimes moves to
82the next element and sometimes tries with the same element again. This
83in order to spread the inserted elements during heavy contention. That
84is, we try to spread the modifications of memory to different
85locations instead of letting all threads continue to try to modify the
86same location in memory.
87
88### Head ###
89
90The head contains pointers to beginning of the list (`head.first`), and
91to the first block which other threads may refer to
92(`head.unref_end`). Blocks between these pointers are only refered to
93by the head part of the data structure which is only used by the
94thread owning the allocator instance. When these two pointers are not
95equal the thread owning the allocator instance deallocate block after
96block until `head.first` reach `head.unref_end`.
97
98We of course periodically need to move the `head.unref_end` closer to
99the end in order to be able to continue deallocating memory
100blocks. Since all threads inserting new elements in the linked list
101will enter the list using the last pointer we can use this
102knowledge. If we call `erts_thr_progress_later()` and wait until we
103have reached that thread progress we know that no managed threads can
104refer the elements up to the element pointed to by the last pointer at
105the time when we called `erts_thr_progress_later()`. This since, all
106managed threads must have left the code implementing this at least
107once, and they always enters into the list via the last pointer. The
108`tail.next` field contains information about next `head.unref_end`
109pointer and thread progress that needs to be reached before we can
110move `head.unref_end`.
111
112Unfortunately not only threads managed by the thread progress
113functionality may insert memory blocks. Other threads also needs to be
114taken care of. Other threads will not be as frequent users of this
115functionality as managed threads, so using a less efficient scheme for
116them is not that big of a problem. In order to handle unmanaged
117threads we use two reference counters. When an unmanaged thread enters
118this implementation it increments the reference counter currently
119used, and when it leaves this implementation it decrements the same
120reference counter. When the consumer thread calls
121`erts_thr_progress_later()` in order to determine when it is safe to
122move `head.unref_end`, it also swaps reference counters for unmanaged
123threads. The previous current represents outstanding references from
124the time up to this point. The new current represents future reference
125following this point. When the consumer thread detects that we have
126both reached the desired thread progress and when the previous current
127reference counter reach zero it is safe to move the `head.unref_end`.
128
129The reason for using two reference counters is that we need to know
130that the reference counter eventually will reach zero. If we only used
131one reference counter it would potentially be held above zero for ever
132by different unmanaged threads.
133
134### Empty List ###
135
136If no new memory blocks are inserted into the list, it should
137eventually be emptied. All pointers to the list however expect to
138always point to something. This is solved by inserting an empty
139"marker" element, which only has to purpose of being there in the
140absense of other elements. That is when the list is empty it only
141contains this "marker" element.
142
143### Contention ###
144
145When elements are continuously inserted by threads not owning the
146allocator instance, the thread owning the allocator instance will be
147able to work more or less undisturbed by other threads at the head end
148of the list. At the tail end large amounts of simultaneous inserts may
149cause contention, but we reduce such contention by spreading inserts
150of new elements near the end instead of requiring all new elements to
151be inserted at the end.
152
153### Schedulers and The Locked Allocator Instance ###
154
155Also the locked allocator instance for use by non-scheduler threads
156have a message box for deallocation jobs just as all the other
157allocator instances. The reason for this is that other threads may
158allocate memory pass it to a scheduler that then needs to deallocate
159it. We do not want the scheduler to have to wait for the lock on this
160locked instance. Since also locked instances has message boxes for
161deallocation jobs, the scheduler can just insert the job and avoid the
162locking.
163
164
165### A Benchmark Result ###
166
167When running the ehb benchmark, large amount of messages are passed
168around between schedulers. All message passing will in some way or the
169other cause memory allocation and deallocation. Since messages are
170passed between different schedulers we will get contention on the
171allocator instances where messages were allocated. By the introduction
172of the delayed dealloc feature, we got a speedup of between 25-45%,
173depending on configuration of the benchmark, when running on a
174relatively new machine with an Intel i7 quad core processor with
175hyper-threading using 8 schedulers.