1Delayed Dealloc 2=============== 3 4Problem 5------- 6 7An easy way to handle memory allocation in a multi-threaded 8environment is to protect the memory allocator with a global lock 9which threads performing memory allocations or deallocations have to 10have locked during the whole operation. This solution of course scales 11very poorly, due to heavy lock contention. An improved solution of 12this scheme is to use multiple thread specific instances of such an 13allocator. That is, each thread allocates in its own allocator 14instance which is protected by a lock. In the general case references 15to memory need to be passed between threads. In the case where a 16thread that needs to deallocate memory that originates from another 17threads allocator instance a lock conflict is possible. In a system as 18the Erlang VM where memory allocation/deallocation is frequent and 19references to memory also are passed around between threads this 20solution will also scale poorly due to lock contention. 21 22Functionality Used to Address This problem 23----------------------------------------- 24 25In order to reduce contention due to locking of allocator instances we 26introduced completely lock free instances tied to each scheduler 27thread, and an extra locked instance for other threads. The scheduler 28threads in the system is expected to do the major part of the 29work. Other threads may still be needed but should not perform any 30major and/or time critical work. The limited amount of contention that 31appears on the locked allocator instance can more or less be 32disregarded. 33 34Since we still need to be able to pass references to memory between 35scheduler threads we need some way to manage this. An allocator 36instance belonging to one scheduler thread is only allowed to be 37manipulated by that scheduler thread. When other threads need to 38deallocate memory originating from a foreign allocator instance, they 39only pass the memory block to a "message box" containing deallocation 40jobs attached to the originating allocator instance. When a scheduler 41thread detects such deallocation job it performs the actual 42deallocation. 43 44The "message box" is implemented using a lock free single linked list 45through the memory blocks to deallocate. The order of the elements in 46this list is not important. Insertion of new free blocks will be made 47somewhere near the end of this list. Requiring that the new blocks 48need to be inserted at the end would cause unnecessary contention when 49large amount of memory blocks are inserted simultaneous by multiple 50threads. 51 52The data structure referring to this single linked list cover two cache 53lines. One cache line containing information about the head of the 54list, and one cache line containing information about the tail of the 55list. This in order to reduce cache line ping ponging of this data 56structure. The head of the list will only be manipulated by the thread 57owning the allocator instance, and the tail will be manipulated by 58other threads inserting deallocation jobs. 59 60### Tail ### 61 62In the tail part of the data structure we find a pointer to the last 63element of the list, or at least something that is near the end of the 64list. In the uncontended case it will point to the end of the list, 65but when simultaneous insert operations are performed it will point to 66something near the end of the list. 67 68When inserting an element one will try to write a pointer to the new 69element in the next pointer of the element pointed to by the last 70pointer. This is done using an atomic compare and swap that expects 71the next pointer to be `NULL`. If this succeeds the thread performing 72this operation moves the last pointer to point to the newly inserted 73element. 74 75If the atomic compare and swap described above failed, the last 76pointer didn't point to the last element. In this case we need to 77insert the new element somewhere between the element that the last 78pointer pointed to and the actual last element. If we do it this way 79the last pointer will eventually end up at the last element when 80threads stop adding new elements. When trying to insert somewhere near 81the end and failing to do so, the inserting thread sometimes moves to 82the next element and sometimes tries with the same element again. This 83in order to spread the inserted elements during heavy contention. That 84is, we try to spread the modifications of memory to different 85locations instead of letting all threads continue to try to modify the 86same location in memory. 87 88### Head ### 89 90The head contains pointers to beginning of the list (`head.first`), and 91to the first block which other threads may refer to 92(`head.unref_end`). Blocks between these pointers are only refered to 93by the head part of the data structure which is only used by the 94thread owning the allocator instance. When these two pointers are not 95equal the thread owning the allocator instance deallocate block after 96block until `head.first` reach `head.unref_end`. 97 98We of course periodically need to move the `head.unref_end` closer to 99the end in order to be able to continue deallocating memory 100blocks. Since all threads inserting new elements in the linked list 101will enter the list using the last pointer we can use this 102knowledge. If we call `erts_thr_progress_later()` and wait until we 103have reached that thread progress we know that no managed threads can 104refer the elements up to the element pointed to by the last pointer at 105the time when we called `erts_thr_progress_later()`. This since, all 106managed threads must have left the code implementing this at least 107once, and they always enters into the list via the last pointer. The 108`tail.next` field contains information about next `head.unref_end` 109pointer and thread progress that needs to be reached before we can 110move `head.unref_end`. 111 112Unfortunately not only threads managed by the thread progress 113functionality may insert memory blocks. Other threads also needs to be 114taken care of. Other threads will not be as frequent users of this 115functionality as managed threads, so using a less efficient scheme for 116them is not that big of a problem. In order to handle unmanaged 117threads we use two reference counters. When an unmanaged thread enters 118this implementation it increments the reference counter currently 119used, and when it leaves this implementation it decrements the same 120reference counter. When the consumer thread calls 121`erts_thr_progress_later()` in order to determine when it is safe to 122move `head.unref_end`, it also swaps reference counters for unmanaged 123threads. The previous current represents outstanding references from 124the time up to this point. The new current represents future reference 125following this point. When the consumer thread detects that we have 126both reached the desired thread progress and when the previous current 127reference counter reach zero it is safe to move the `head.unref_end`. 128 129The reason for using two reference counters is that we need to know 130that the reference counter eventually will reach zero. If we only used 131one reference counter it would potentially be held above zero for ever 132by different unmanaged threads. 133 134### Empty List ### 135 136If no new memory blocks are inserted into the list, it should 137eventually be emptied. All pointers to the list however expect to 138always point to something. This is solved by inserting an empty 139"marker" element, which only has to purpose of being there in the 140absense of other elements. That is when the list is empty it only 141contains this "marker" element. 142 143### Contention ### 144 145When elements are continuously inserted by threads not owning the 146allocator instance, the thread owning the allocator instance will be 147able to work more or less undisturbed by other threads at the head end 148of the list. At the tail end large amounts of simultaneous inserts may 149cause contention, but we reduce such contention by spreading inserts 150of new elements near the end instead of requiring all new elements to 151be inserted at the end. 152 153### Schedulers and The Locked Allocator Instance ### 154 155Also the locked allocator instance for use by non-scheduler threads 156have a message box for deallocation jobs just as all the other 157allocator instances. The reason for this is that other threads may 158allocate memory pass it to a scheduler that then needs to deallocate 159it. We do not want the scheduler to have to wait for the lock on this 160locked instance. Since also locked instances has message boxes for 161deallocation jobs, the scheduler can just insert the job and avoid the 162locking. 163 164 165### A Benchmark Result ### 166 167When running the ehb benchmark, large amount of messages are passed 168around between schedulers. All message passing will in some way or the 169other cause memory allocation and deallocation. Since messages are 170passed between different schedulers we will get contention on the 171allocator instances where messages were allocated. By the introduction 172of the delayed dealloc feature, we got a speedup of between 25-45%, 173depending on configuration of the benchmark, when running on a 174relatively new machine with an Intel i7 quad core processor with 175hyper-threading using 8 schedulers.