1Process Management Optimizations
2================================
3
4Problems
5--------
6
7Early versions of the SMP support for the runtime system completely
8relied on locking in order to protect data accesses from multiple
9threads. In some cases this isn't that problematic, but in some cases
10it really is. It complicates the code, ensuring all locks needed are
11actually held, and ensuring that all locks are acquired in such an
12order that no deadlock occur. Acquiring locks in the right order often
13also involve releasing locks held, forcing threads to reread data
14already read. A good recipe for creation of bugs. Trying to use more
15fine-grained locking in order to increase possible parallelism in the
16system makes the complexity situation even worse. Having to acquire a
17bunch of locks when doing operations also often cause heavy lock
18contention which cause poor scalability.
19
20Management of processes internally in the runtime system suffered from
21these problems. When changing state on a process, for example from
22`waiting` to `runnable`, a lock on the process needed to be
23locked. When inserting a process into a run queue also a lock
24protecting the run queue had to be locked. When migrating a process
25from one run queue to another run queue, locks on both run queues and
26on the process had to be locked.
27
28This last example is a quite common case in during normal
29operation. For example, when a scheduler thread runs out of work it
30tries to steal work from another scheduler threads run queue. When
31searching for a victim to steal from there was a lot of juggling of
32run queue locks involved, and during the actual theft finalized by
33having to lock both run queues and the process. When one scheduler
34runs out of work, often others also do, causing lots of lock
35contention.
36
37Solution
38--------
39
40### Process ###
41
42In order to avoid these situations we wanted to be able to do most of
43the fundamental operations on a process without having to acquire a
44lock on the process. Some examples of such fundamental operations are,
45moving a process between run queues, detecting if we need to insert it
46into a run queue or not, detecting if it is alive or not.
47
48All of this information in the process structure that was needed by
49these operations was protected by the process `status` lock, but the
50information was spread across a number of fields. The fields used was
51typically state fields that could contain a small number of different
52states. By reordering this information a bit we could *easily* fit
53this information into a 32-bit wide field of bit flags (only 12-flags
54were needed). By moving this information we could remove five 32-bit
55wide fields and one pointer field from the process structure! The move
56also enabled us to easily read and change the state using atomic
57memory operations.
58
59### Run Queue ###
60
61As with processes we wanted to be able to do the most fundamental
62operations without having to acquire a lock on it. The most important
63being able to determine if we should enqueue a process in a specific
64run queue or not. This involves being able to read actual load, and
65load balancing information.
66
67The load balancing functionality is triggered at repeated fixed
68intervals. The load balancing more or less strives to even out run
69queue lengths over the system. When balancing is triggered,
70information about every run queue is gathered, migrations paths and
71run queue length limits are set up. Migration paths and limits are
72fixed until the next balancing has been done. The most important
73information about each run queue is the maximum run queue length since
74last balancing. All of this information were previously stored in the
75run queues themselves.
76
77When a process has become runnable, for example due to reception of a
78message, we need to determine which run queue to enqueue it
79in. Previously this at least involved locking the run queue that the
80process currently was assigned to while holding the status lock on the
81process. Depending on load we sometimes also had to acquire a lock on
82another run queue in order to be able to determine if it should be
83migrated to that run queue or not.
84
85In order to be able to decide which run queue to use without having to
86lock any run queues, we moved all fixed balancing information out of
87the run queues into a global memory block. That is, migration paths
88and run queue limits. Information that need to be frequently updated,
89like for example maximum run queue length, were kept in the run queue,
90but instead of operating on this information under locks we now use
91atomic memory operations when accessing this information. This made it
92possible to first determine which run queue to use, without locking
93any run queues, and when decided, lock the chosen run queue and insert
94the process.
95
96#### Fixed Balancing Information ####
97
98When determining which run queue to choose we need to read the fixed
99balancing information that we moved out of the run queues. This
100information is global, read only between load balancing operations,
101but will be changed during a load balancing. We do not want to
102introduce a global lock that needs to be acquired when accessing this
103information. A reader optimized rwlock could avoid some of the
104overhead since the data is most frequently read, but it would
105unavoidably cause disruption during load balancing, since this
106information is very frequently read. The likelihood of a large
107disruption due to this also increase as number of schedulers grows.
108
109Instead of using a global lock protecting modifications of this
110information, we write a completely new version of it at each load
111balancing. The new version is written in another memory block than the
112previous one, and published by issuing a write memory barrier and then
113storing a pointer to the new memory block in a global variable using
114an atomic write operation.
115
116When schedulers need to read this information, they read the pointer
117to currently used information using an atomic read operation, and then
118issue a data dependency read barrier, which on most architectures is a
119no-op. That is, it is very little overhead getting access to this
120information.
121
122Instead of allocating and deallocating memory blocks for the different
123versions of the balancing information we keep old memory blocks and
124reuse them when it is safe to do so. In order to be able to determine
125when it is safe to reuse a block we use the thread progress
126functionality, ensuring that no threads have any references to the
127memory block when we reuse it.
128
129#### Be Less Aggressive ####
130
131We implemented a test version using lock free run queues. This
132implementation did however not perform as good as the version using
133one lock per run queue. The reason for this was not investigated
134enough to say why this was. Since the locked version performed better
135we kept it, at least for now. The lock free version, however, forced
136us to use other solutions, some of them we kept.
137
138Previously when a process that was in a run queue got suspended, we
139removed it from the queue straight away. This involved locking the
140process, locking the run queue, and then unlinking it from the double
141linked list implementing the queue. Removing a process from a lock
142free queue gets really complicated. Instead, of removing it from the
143queue, we just leave it in the queue and mark it as suspended. When
144later selected for execution we check if the process is suspended, if
145so just dropped it. During its time in the queue, it might also get
146resumed again, if so execute it when it get selected for execution.
147
148By keeping this part when reverting back to a locked implementation,
149we could remove a pointer field in each process structure, and avoid
150unnecessary operations on the process and the queue which might cause
151contention.
152
153### Combined Modifications ###
154
155By combining the modifications of the process state management and the
156run queue management, we can do large parts of the work involved when
157managing processes with regards to scheduling and migration without
158having any locks locked at all. In these situations we previously had
159to have multiple locks locked. This of course caused a lot of rewrites
160across large parts of the runtime system, but the rewrite both
161simplified code and eliminated locking at a number of places. The
162major benefit is, of course, reduced contention.
163
164### A Benchmark Result ###
165
166When running the chameneosredux benchmark, schedulers frequently run
167out of work trying to steal work from each other. That is, either
168succeeding in migrating, or trying to migrate processes which is a
169scenario which we wanted to optimize. By the introduction of these
170improvements, we got a speedup of 25-35% when running this benchmark
171on a relatively new machine with an Intel i7 quad core processor with
172hyper-threading using 8 schedulers.