1Thread Progress
2===============
3
4Problems
5--------
6
7### Knowing When Threads Have Completed Accesses to a Data Structure ###
8
9When multiple threads access the same data structure you often need to
10know when all threads have completed their accesses. For example, in
11order to know when it is safe to deallocate the data structure. One
12simple way to accomplish this is to reference count all accesses to
13the data structure. The problem with this approach is that the cache
14line where the reference counter is located needs to be communicated
15between all involved processors. Such communication can become
16extremely expensive and will scale poorly if the reference counter is
17frequently accessed. That is, we want to use some other approach of
18keeping track of threads than reference counting.
19
20### Knowing That Modifications of Memory is Consistently Observed ###
21
22Different hardware architectures have different memory models. Some
23architectures allows very aggressive reordering of memory accesses
24while other architectures only reorder a few specific cases. Common to
25all modern hardware is, however, that some type of reordering will
26occur. When using locks to protect all memory accesses made from
27multiple threads such reorderings will not be visible. The locking
28primitives will ensure that the memory accesses will be ordered. When
29using lock free algorithms one do however have to take this reordering
30made by the hardware into account.
31
32Hardware memory barriers or memory fences are instructions that can be
33used to enforce order between memory accesses. Different hardware
34architectures provide different memory barriers. Lock free algorithms
35need to use memory barriers in order to ensure that memory accesses
36are not reordered in such ways that the algorithm breaks down. Memory
37barriers are also expensive instructions, so you typically want to
38minimize the use of these instructions.
39
40Functionality Used to Address These Problems
41-------------------------------------------
42
43The "thread progress" functionality in the Erlang VM is used to
44address these problems. The name "thread progress" was chosen since we
45want to use it to determine when all threads in a set of threads have
46made such progress so that two specific events have taken place for
47all them.
48
49The set of threads that we are interested in we call managed
50threads. The managed threads are the only threads that we get any
51information about. These threads *have* to frequently report
52progress. Not all threads in the system are able to frequently report
53progress. Such threads cannot be allowed in the set of managed threads
54and are called unmanaged threads. An example of unmanaged threads are
55threads in the async thread pool. Async threads can be blocked for
56very long times and by this be prevented from frequently reporting
57progress. Currently only scheduler threads and a couple of other
58threads are managed threads.
59
60### Thread Progress Events ###
61
62Any thread in the system may use the thread progress functionality in
63order to determine when the following events have occurred at least
64once in all managed threads:
65
661.  The thread has returned from other code to a known state in the
67    thread progress functionality, which is independent of any other
68    code.
692.  The thread has executed a full memory barrier.
70
71These events, of course, need to occur ordered to other memory
72operations. The operation of determining this begins by initiating the
73thread progress operation. The thread that initiated the thread
74progress operation after this poll for the completion of the
75operation. Both of these events must occur at least once *after* the
76thread progress operation has been initiated, and at least once
77*before* the operation has completed in each managed thread. This is
78ordered using communication via memory which makes it possible to draw
79conclusion about the memory state after the thread progress operation
80has completed. Lets call the progress made from initiation to
81comletion for "thread progress".
82
83Assuming that the thread progress functionality is efficient, a lot of
84algorithms can both be simplified and made more efficient than using
85the first approach that comes to mind. A couple of examples follows.
86
87By being able to determine when the first event above has occurred we
88can easily know when all managed threads have completed accesses to a
89data structure. This can be determined the following way. We have an
90implementation of some functionality `F` using a data structure
91`D`. The reference to `D` is always looked up before `D` is being
92accessed, and the references to `D` is always dropped before we leave
93the code implementing `F`. If we remove the possibility to look up `D`
94and then wait until the first event has occurred in all managed
95threads, no managed threads can have any references to the data
96structure `D`. This could for example have been achieved by using
97reference counting, but the cache line containing the reference
98counter would in this case be ping ponged between all processors
99accessing `D` at every access.
100
101By being able to determine when the second event has occurred it is
102quite easy to do complex modifications of memory that needs to be seen
103consistently by other threads without having to resort to locking. By
104doing the modifications, then issuing a full memory barrier, then wait
105until the second event has occurred in all managed threads, and then
106publish the modifications, we know that all managed threads reading
107this memory will get a consistent view of the modifications. Managed
108threads reading this will not have to issue any extra memory barriers
109at all.
110
111Implementation of the Thread Progress Functionality
112---------------------------------------------------
113
114### Requirement on the Implementation ###
115
116In order to be able to determine when all managed threads have reached
117the states that we are interested in we need to communicate between
118all involved threads. We of course want to minimize this
119communication.
120
121We also want threads to be able to determine when thread progress has
122been made relatively fast. That is we need to have some balance
123between comunication overhead and time to complete the operation.
124
125### API ###
126
127I will only present the most important functions in the API here.
128
129*   `ErtsThrPrgrVal erts_thr_progress_later(void)` - Initiation of the
130    operation. The thread progress value returned can be used testing
131    for completion of the operation.
132*   `int erts_thr_progress_has_reached(ErtsThrPrgrVal val)` - Returns
133    a non zero value when we have reached the thread progress value
134    passed as argument. That is, when a non zero value is returned the
135    operation has completed.
136
137When a thread calls `my_val = erts_thr_progress_later()` and waits for
138`erts_thr_progress_has_reached(my_val)` to return a non zero value it
139knows that thread progress has been made.
140
141While waiting for `erts_thr_progress_has_reached()` to return a non
142zero value we typically do not want to block waiting, but instead want
143to continue working with other stuff. If we run out of other stuff to
144work on we typically do want to block waiting until we have reached
145the thread progress value that we are waiting for. In order to be able
146to do this we provide functionality for waking up a thread when a
147certain thread progress value has been reached:
148
149*   `void erts_thr_progress_wakeup(ErtsSchedulerData *esdp,
150    ErtsThrPrgrVal val)` - Request wake up. The calling thread will be
151    woken when thread progress has reached val.
152
153Managed threads frequently need to update their thread progress by
154calling the following functions:
155
156*   `int erts_thr_progress_update(ErtsSchedulerData *esdp)` - Update
157    thread progress. If a non zero value is returned
158    `erts_thr_progress_leader_update()` has to be called without any
159    locks held.
160*   `int erts_thr_progress_leader_update(ErtsSchedulerData *esdp)` -
161    Leader update thread progress.
162
163Unmanaged threads can delay thread progress being made:
164
165*   `ErtsThrPrgrDelayHandle erts_thr_progress_unmanaged_delay(void)` -
166    Delay thread progress.
167*   `void erts_thr_progress_unmanaged_continue(ErtsThrPrgrDelayHandle
168    handle)` - Let thread progress continue.
169
170Scheduler threads can schedule an operation to be executed by the
171scheduler itself when thread progress has been made:
172
173* `void erts_schedule_thr_prgr_later_op(void (*funcp)(void *), void
174  *argp, ErtsThrPrgrLaterOp *memp)` - Schedule a call to `funcp`. The
175  call `(*funcp)(argp)` will be executed when thread progress has been
176  made since the call to `erts_schedule_thr_prgr_later_op()` was
177  made.
178
179### Implementation ###
180
181In order to determine when the events has happened we use a global
182counter that is incremented when all managed threads have called
183`erts_thr_progress_update()` (or `erts_thr_progress_leader_update()`).
184This could naively be implemented using a "thread confirmed" counter.
185This would however cause an explosion of communication where all
186involved processors would need to communicate with each other at each
187update.
188
189Instead of confirming at a global location each thread confirms that
190it accepts in increment of the global counter in its own cache
191line. These confirmation cache lines are located in sequence in an
192array, and each confirmation cache line will only be written by one
193and only one thread. One of the managed threads always have the leader
194responsibility. This responsibility may jump between threads, but as
195long as there are some activity in the system always one of them will
196have the leader responsibility. The thread with the leader
197responsibility will call `erts_thr_progress_leader_update()` which
198will check that all other threads have confirmed an increment of the
199global counter before doing the increment of the global counter. The
200leader thread is the only thread reading the confirmation cache
201lines.
202
203Doing it this way we will get a communication pattern of information
204going from the leader thread out to all other managed threads and then
205back from the other threads to the leader thread. This since only the
206leader thread will write to the global counter and all other threads
207will only read it, and since each confirmation cache lines will only
208be written by one specific thread and only read by the leader
209thread. When each managed thread is distributed over different
210processors, the communication between processors will be a reflection
211of this communication pattern between threads.
212
213The value returned from `erts_thr_progress_later()` equals the, by
214this thread, latest confirmed value plus two. The global value may be
215latest confirmed value or latest confirmed value minus one. In order
216to be certain that all other managed threads actually will call
217`erts_thr_progress_update()` at least once before we reach the value
218returned from `erts_thr_progress_later()`, the global counter plus one
219is not enough. This since all other threads may already have confirmed
220current global value plus one at the time when we call
221`erts_thr_progress_later()`. They are however guaranteed not to have
222confirmed global value plus two at this time.
223
224The above described implementation more or less minimizes the
225comunication needed before we can increment the global counter. The
226amount of communication in the system due to the thread progress
227functionality however also depend on the frequency with which managed
228threads call `erts_thr_progress_update()`. Today each scheduler thread
229calls `erts_thr_progress_update()` more or less each time an Erlang
230process is scheduled out. One way of further reducing communication
231due to the thread progress functionality is to only call
232`erts_thr_progress_update()` every second, or third time an Erlang
233process is scheduled out, or even less frequently than that. However,
234by doing updates of thread progress less frequently all operations
235depending on the thread progress functionality will also take a longer
236time.
237
238#### Delay of Thread Progress by Unmanaged Threads ####
239
240In order to implement delay of thread progress from unmanaged threads
241we use two reference counters. One being `current` and one being
242`waiting`. When an unmanaged thread wants to delay thread progress it
243increments `current` and gets a handle back to the reference counter
244it incremented. When it later wants to enable continuation of thread
245progress it uses the handle to decrement the reference counter it
246previously incremented.
247
248When the leader threads is about to increment the global thread
249progress counter it verifies that the `waiting` counter is zero before
250doing so. If not zero, the leader isn't allowed to increment the
251global counter, and needs to wait before it can do this. When it is
252zero, it swaps the `waiting` and `current` counters before increasing
253the global counter. From now on the new `waiting` counter will
254decrease, so that it eventually will reach zero, making it possible to
255increment the global counter the next time. If we only used one
256reference counter it would potentially be held above zero for ever by
257different unmanaged threads.
258
259When an unmanaged thread increment the `current` counter it will not
260prevent the next increment of the global counter, but instead the
261increment after that. This is sufficient since the global counter
262needs to be incremented two times before thread progress has been
263made. It is also desirable not to prevent the first increment, since
264the likelihood increases that the delay is withdrawn before any
265increment of the global counter is delayed. That is, the operation
266will cause as little disruption as possible.
267
268However, this feature of delaying thread progress from unmanaged
269threads should preferably be used as little as possible, since heavy
270use of it will cause contention on the reference counter cache
271lines. The functionality is however very useful in code which normally
272only executes in managed threads, but which may under some infrequent
273circumstances be executed in other threads.
274
275#### Overhead ####
276
277The overhead caused by the thread progress functionality is more or
278less fixed using the same amount of schedulers regardless of the
279number of uses of the functionality. Already today quite a lot of
280functionality use it, and we plan to use it even more. When rewriting
281old implementations of ERTS internal functionality to use the thread
282progress functionality, this implies removing communication in the old
283implementation. Otherwise it is simply no point rewriting the old
284implementation to use the thread progress functionality. Since the
285thread progress overhead is more or less fixed, the rewrite will cause
286a reduction of the total communication in the system.
287
288##### An Example #####
289
290The main structure of an ETS table was originally managed using
291reference counting. Already a long time ago we replaced this strategy
292since the reference counter caused contention on each access of the
293table. The solution used was to schedule "confirm deletion" jobs on
294each scheduler in order to know when it was safe to deallocate the
295table structure of a removed table. These confirm deletion jobs needed
296to be allocated. That is, we had to allocate and deallocate as many
297blocks as schedulers in order to deallocate one block. This of course
298was a quite an expensive operation, but we only needed to do this once
299when removing a table. It was more important to get rid of the
300contention on the reference counter which was present on every
301operation on the table.
302
303When the thread progress functionality had been introduced, we could
304remove the code implementing the "confirm deletion" jobs, and then
305just schedule a thread progress later operation which deallocates the
306structure. Besides simplifying the code a lot, we got an increase of
307more than 10% of the number of transactions per second handled on a
308mnesia tpcb benchmark executing on a quad core machine.
309