1Process Management Optimizations 2================================ 3 4Problems 5-------- 6 7Early versions of the SMP support for the runtime system completely 8relied on locking in order to protect data accesses from multiple 9threads. In some cases this isn't that problematic, but in some cases 10it really is. It complicates the code, ensuring all locks needed are 11actually held, and ensuring that all locks are acquired in such an 12order that no deadlock occur. Acquiring locks in the right order often 13also involve releasing locks held, forcing threads to reread data 14already read. A good recipe for creation of bugs. Trying to use more 15fine-grained locking in order to increase possible parallelism in the 16system makes the complexity situation even worse. Having to acquire a 17bunch of locks when doing operations also often cause heavy lock 18contention which cause poor scalability. 19 20Management of processes internally in the runtime system suffered from 21these problems. When changing state on a process, for example from 22`waiting` to `runnable`, a lock on the process needed to be 23locked. When inserting a process into a run queue also a lock 24protecting the run queue had to be locked. When migrating a process 25from one run queue to another run queue, locks on both run queues and 26on the process had to be locked. 27 28This last example is a quite common case in during normal 29operation. For example, when a scheduler thread runs out of work it 30tries to steal work from another scheduler threads run queue. When 31searching for a victim to steal from there was a lot of juggling of 32run queue locks involved, and during the actual theft finalized by 33having to lock both run queues and the process. When one scheduler 34runs out of work, often others also do, causing lots of lock 35contention. 36 37Solution 38-------- 39 40### Process ### 41 42In order to avoid these situations we wanted to be able to do most of 43the fundamental operations on a process without having to acquire a 44lock on the process. Some examples of such fundamental operations are, 45moving a process between run queues, detecting if we need to insert it 46into a run queue or not, detecting if it is alive or not. 47 48All of this information in the process structure that was needed by 49these operations was protected by the process `status` lock, but the 50information was spread across a number of fields. The fields used was 51typically state fields that could contain a small number of different 52states. By reordering this information a bit we could *easily* fit 53this information into a 32-bit wide field of bit flags (only 12-flags 54were needed). By moving this information we could remove five 32-bit 55wide fields and one pointer field from the process structure! The move 56also enabled us to easily read and change the state using atomic 57memory operations. 58 59### Run Queue ### 60 61As with processes we wanted to be able to do the most fundamental 62operations without having to acquire a lock on it. The most important 63being able to determine if we should enqueue a process in a specific 64run queue or not. This involves being able to read actual load, and 65load balancing information. 66 67The load balancing functionality is triggered at repeated fixed 68intervals. The load balancing more or less strives to even out run 69queue lengths over the system. When balancing is triggered, 70information about every run queue is gathered, migrations paths and 71run queue length limits are set up. Migration paths and limits are 72fixed until the next balancing has been done. The most important 73information about each run queue is the maximum run queue length since 74last balancing. All of this information were previously stored in the 75run queues themselves. 76 77When a process has become runnable, for example due to reception of a 78message, we need to determine which run queue to enqueue it 79in. Previously this at least involved locking the run queue that the 80process currently was assigned to while holding the status lock on the 81process. Depending on load we sometimes also had to acquire a lock on 82another run queue in order to be able to determine if it should be 83migrated to that run queue or not. 84 85In order to be able to decide which run queue to use without having to 86lock any run queues, we moved all fixed balancing information out of 87the run queues into a global memory block. That is, migration paths 88and run queue limits. Information that need to be frequently updated, 89like for example maximum run queue length, were kept in the run queue, 90but instead of operating on this information under locks we now use 91atomic memory operations when accessing this information. This made it 92possible to first determine which run queue to use, without locking 93any run queues, and when decided, lock the chosen run queue and insert 94the process. 95 96#### Fixed Balancing Information #### 97 98When determining which run queue to choose we need to read the fixed 99balancing information that we moved out of the run queues. This 100information is global, read only between load balancing operations, 101but will be changed during a load balancing. We do not want to 102introduce a global lock that needs to be acquired when accessing this 103information. A reader optimized rwlock could avoid some of the 104overhead since the data is most frequently read, but it would 105unavoidably cause disruption during load balancing, since this 106information is very frequently read. The likelihood of a large 107disruption due to this also increase as number of schedulers grows. 108 109Instead of using a global lock protecting modifications of this 110information, we write a completely new version of it at each load 111balancing. The new version is written in another memory block than the 112previous one, and published by issuing a write memory barrier and then 113storing a pointer to the new memory block in a global variable using 114an atomic write operation. 115 116When schedulers need to read this information, they read the pointer 117to currently used information using an atomic read operation, and then 118issue a data dependency read barrier, which on most architectures is a 119no-op. That is, it is very little overhead getting access to this 120information. 121 122Instead of allocating and deallocating memory blocks for the different 123versions of the balancing information we keep old memory blocks and 124reuse them when it is safe to do so. In order to be able to determine 125when it is safe to reuse a block we use the thread progress 126functionality, ensuring that no threads have any references to the 127memory block when we reuse it. 128 129#### Be Less Aggressive #### 130 131We implemented a test version using lock free run queues. This 132implementation did however not perform as good as the version using 133one lock per run queue. The reason for this was not investigated 134enough to say why this was. Since the locked version performed better 135we kept it, at least for now. The lock free version, however, forced 136us to use other solutions, some of them we kept. 137 138Previously when a process that was in a run queue got suspended, we 139removed it from the queue straight away. This involved locking the 140process, locking the run queue, and then unlinking it from the double 141linked list implementing the queue. Removing a process from a lock 142free queue gets really complicated. Instead, of removing it from the 143queue, we just leave it in the queue and mark it as suspended. When 144later selected for execution we check if the process is suspended, if 145so just dropped it. During its time in the queue, it might also get 146resumed again, if so execute it when it get selected for execution. 147 148By keeping this part when reverting back to a locked implementation, 149we could remove a pointer field in each process structure, and avoid 150unnecessary operations on the process and the queue which might cause 151contention. 152 153### Combined Modifications ### 154 155By combining the modifications of the process state management and the 156run queue management, we can do large parts of the work involved when 157managing processes with regards to scheduling and migration without 158having any locks locked at all. In these situations we previously had 159to have multiple locks locked. This of course caused a lot of rewrites 160across large parts of the runtime system, but the rewrite both 161simplified code and eliminated locking at a number of places. The 162major benefit is, of course, reduced contention. 163 164### A Benchmark Result ### 165 166When running the chameneosredux benchmark, schedulers frequently run 167out of work trying to steal work from each other. That is, either 168succeeding in migrating, or trying to migrate processes which is a 169scenario which we wanted to optimize. By the introduction of these 170improvements, we got a speedup of 25-35% when running this benchmark 171on a relatively new machine with an Intel i7 quad core processor with 172hyper-threading using 8 schedulers.