1Thread Progress 2=============== 3 4Problems 5-------- 6 7### Knowing When Threads Have Completed Accesses to a Data Structure ### 8 9When multiple threads access the same data structure you often need to 10know when all threads have completed their accesses. For example, in 11order to know when it is safe to deallocate the data structure. One 12simple way to accomplish this is to reference count all accesses to 13the data structure. The problem with this approach is that the cache 14line where the reference counter is located needs to be communicated 15between all involved processors. Such communication can become 16extremely expensive and will scale poorly if the reference counter is 17frequently accessed. That is, we want to use some other approach of 18keeping track of threads than reference counting. 19 20### Knowing That Modifications of Memory is Consistently Observed ### 21 22Different hardware architectures have different memory models. Some 23architectures allows very aggressive reordering of memory accesses 24while other architectures only reorder a few specific cases. Common to 25all modern hardware is, however, that some type of reordering will 26occur. When using locks to protect all memory accesses made from 27multiple threads such reorderings will not be visible. The locking 28primitives will ensure that the memory accesses will be ordered. When 29using lock free algorithms one do however have to take this reordering 30made by the hardware into account. 31 32Hardware memory barriers or memory fences are instructions that can be 33used to enforce order between memory accesses. Different hardware 34architectures provide different memory barriers. Lock free algorithms 35need to use memory barriers in order to ensure that memory accesses 36are not reordered in such ways that the algorithm breaks down. Memory 37barriers are also expensive instructions, so you typically want to 38minimize the use of these instructions. 39 40Functionality Used to Address These Problems 41------------------------------------------- 42 43The "thread progress" functionality in the Erlang VM is used to 44address these problems. The name "thread progress" was chosen since we 45want to use it to determine when all threads in a set of threads have 46made such progress so that two specific events have taken place for 47all them. 48 49The set of threads that we are interested in we call managed 50threads. The managed threads are the only threads that we get any 51information about. These threads *have* to frequently report 52progress. Not all threads in the system are able to frequently report 53progress. Such threads cannot be allowed in the set of managed threads 54and are called unmanaged threads. An example of unmanaged threads are 55threads in the async thread pool. Async threads can be blocked for 56very long times and by this be prevented from frequently reporting 57progress. Currently only scheduler threads and a couple of other 58threads are managed threads. 59 60### Thread Progress Events ### 61 62Any thread in the system may use the thread progress functionality in 63order to determine when the following events have occurred at least 64once in all managed threads: 65 661. The thread has returned from other code to a known state in the 67 thread progress functionality, which is independent of any other 68 code. 692. The thread has executed a full memory barrier. 70 71These events, of course, need to occur ordered to other memory 72operations. The operation of determining this begins by initiating the 73thread progress operation. The thread that initiated the thread 74progress operation after this poll for the completion of the 75operation. Both of these events must occur at least once *after* the 76thread progress operation has been initiated, and at least once 77*before* the operation has completed in each managed thread. This is 78ordered using communication via memory which makes it possible to draw 79conclusion about the memory state after the thread progress operation 80has completed. Lets call the progress made from initiation to 81comletion for "thread progress". 82 83Assuming that the thread progress functionality is efficient, a lot of 84algorithms can both be simplified and made more efficient than using 85the first approach that comes to mind. A couple of examples follows. 86 87By being able to determine when the first event above has occurred we 88can easily know when all managed threads have completed accesses to a 89data structure. This can be determined the following way. We have an 90implementation of some functionality `F` using a data structure 91`D`. The reference to `D` is always looked up before `D` is being 92accessed, and the references to `D` is always dropped before we leave 93the code implementing `F`. If we remove the possibility to look up `D` 94and then wait until the first event has occurred in all managed 95threads, no managed threads can have any references to the data 96structure `D`. This could for example have been achieved by using 97reference counting, but the cache line containing the reference 98counter would in this case be ping ponged between all processors 99accessing `D` at every access. 100 101By being able to determine when the second event has occurred it is 102quite easy to do complex modifications of memory that needs to be seen 103consistently by other threads without having to resort to locking. By 104doing the modifications, then issuing a full memory barrier, then wait 105until the second event has occurred in all managed threads, and then 106publish the modifications, we know that all managed threads reading 107this memory will get a consistent view of the modifications. Managed 108threads reading this will not have to issue any extra memory barriers 109at all. 110 111Implementation of the Thread Progress Functionality 112--------------------------------------------------- 113 114### Requirement on the Implementation ### 115 116In order to be able to determine when all managed threads have reached 117the states that we are interested in we need to communicate between 118all involved threads. We of course want to minimize this 119communication. 120 121We also want threads to be able to determine when thread progress has 122been made relatively fast. That is we need to have some balance 123between comunication overhead and time to complete the operation. 124 125### API ### 126 127I will only present the most important functions in the API here. 128 129* `ErtsThrPrgrVal erts_thr_progress_later(void)` - Initiation of the 130 operation. The thread progress value returned can be used testing 131 for completion of the operation. 132* `int erts_thr_progress_has_reached(ErtsThrPrgrVal val)` - Returns 133 a non zero value when we have reached the thread progress value 134 passed as argument. That is, when a non zero value is returned the 135 operation has completed. 136 137When a thread calls `my_val = erts_thr_progress_later()` and waits for 138`erts_thr_progress_has_reached(my_val)` to return a non zero value it 139knows that thread progress has been made. 140 141While waiting for `erts_thr_progress_has_reached()` to return a non 142zero value we typically do not want to block waiting, but instead want 143to continue working with other stuff. If we run out of other stuff to 144work on we typically do want to block waiting until we have reached 145the thread progress value that we are waiting for. In order to be able 146to do this we provide functionality for waking up a thread when a 147certain thread progress value has been reached: 148 149* `void erts_thr_progress_wakeup(ErtsSchedulerData *esdp, 150 ErtsThrPrgrVal val)` - Request wake up. The calling thread will be 151 woken when thread progress has reached val. 152 153Managed threads frequently need to update their thread progress by 154calling the following functions: 155 156* `int erts_thr_progress_update(ErtsSchedulerData *esdp)` - Update 157 thread progress. If a non zero value is returned 158 `erts_thr_progress_leader_update()` has to be called without any 159 locks held. 160* `int erts_thr_progress_leader_update(ErtsSchedulerData *esdp)` - 161 Leader update thread progress. 162 163Unmanaged threads can delay thread progress being made: 164 165* `ErtsThrPrgrDelayHandle erts_thr_progress_unmanaged_delay(void)` - 166 Delay thread progress. 167* `void erts_thr_progress_unmanaged_continue(ErtsThrPrgrDelayHandle 168 handle)` - Let thread progress continue. 169 170Scheduler threads can schedule an operation to be executed by the 171scheduler itself when thread progress has been made: 172 173* `void erts_schedule_thr_prgr_later_op(void (*funcp)(void *), void 174 *argp, ErtsThrPrgrLaterOp *memp)` - Schedule a call to `funcp`. The 175 call `(*funcp)(argp)` will be executed when thread progress has been 176 made since the call to `erts_schedule_thr_prgr_later_op()` was 177 made. 178 179### Implementation ### 180 181In order to determine when the events has happened we use a global 182counter that is incremented when all managed threads have called 183`erts_thr_progress_update()` (or `erts_thr_progress_leader_update()`). 184This could naively be implemented using a "thread confirmed" counter. 185This would however cause an explosion of communication where all 186involved processors would need to communicate with each other at each 187update. 188 189Instead of confirming at a global location each thread confirms that 190it accepts in increment of the global counter in its own cache 191line. These confirmation cache lines are located in sequence in an 192array, and each confirmation cache line will only be written by one 193and only one thread. One of the managed threads always have the leader 194responsibility. This responsibility may jump between threads, but as 195long as there are some activity in the system always one of them will 196have the leader responsibility. The thread with the leader 197responsibility will call `erts_thr_progress_leader_update()` which 198will check that all other threads have confirmed an increment of the 199global counter before doing the increment of the global counter. The 200leader thread is the only thread reading the confirmation cache 201lines. 202 203Doing it this way we will get a communication pattern of information 204going from the leader thread out to all other managed threads and then 205back from the other threads to the leader thread. This since only the 206leader thread will write to the global counter and all other threads 207will only read it, and since each confirmation cache lines will only 208be written by one specific thread and only read by the leader 209thread. When each managed thread is distributed over different 210processors, the communication between processors will be a reflection 211of this communication pattern between threads. 212 213The value returned from `erts_thr_progress_later()` equals the, by 214this thread, latest confirmed value plus two. The global value may be 215latest confirmed value or latest confirmed value minus one. In order 216to be certain that all other managed threads actually will call 217`erts_thr_progress_update()` at least once before we reach the value 218returned from `erts_thr_progress_later()`, the global counter plus one 219is not enough. This since all other threads may already have confirmed 220current global value plus one at the time when we call 221`erts_thr_progress_later()`. They are however guaranteed not to have 222confirmed global value plus two at this time. 223 224The above described implementation more or less minimizes the 225comunication needed before we can increment the global counter. The 226amount of communication in the system due to the thread progress 227functionality however also depend on the frequency with which managed 228threads call `erts_thr_progress_update()`. Today each scheduler thread 229calls `erts_thr_progress_update()` more or less each time an Erlang 230process is scheduled out. One way of further reducing communication 231due to the thread progress functionality is to only call 232`erts_thr_progress_update()` every second, or third time an Erlang 233process is scheduled out, or even less frequently than that. However, 234by doing updates of thread progress less frequently all operations 235depending on the thread progress functionality will also take a longer 236time. 237 238#### Delay of Thread Progress by Unmanaged Threads #### 239 240In order to implement delay of thread progress from unmanaged threads 241we use two reference counters. One being `current` and one being 242`waiting`. When an unmanaged thread wants to delay thread progress it 243increments `current` and gets a handle back to the reference counter 244it incremented. When it later wants to enable continuation of thread 245progress it uses the handle to decrement the reference counter it 246previously incremented. 247 248When the leader threads is about to increment the global thread 249progress counter it verifies that the `waiting` counter is zero before 250doing so. If not zero, the leader isn't allowed to increment the 251global counter, and needs to wait before it can do this. When it is 252zero, it swaps the `waiting` and `current` counters before increasing 253the global counter. From now on the new `waiting` counter will 254decrease, so that it eventually will reach zero, making it possible to 255increment the global counter the next time. If we only used one 256reference counter it would potentially be held above zero for ever by 257different unmanaged threads. 258 259When an unmanaged thread increment the `current` counter it will not 260prevent the next increment of the global counter, but instead the 261increment after that. This is sufficient since the global counter 262needs to be incremented two times before thread progress has been 263made. It is also desirable not to prevent the first increment, since 264the likelihood increases that the delay is withdrawn before any 265increment of the global counter is delayed. That is, the operation 266will cause as little disruption as possible. 267 268However, this feature of delaying thread progress from unmanaged 269threads should preferably be used as little as possible, since heavy 270use of it will cause contention on the reference counter cache 271lines. The functionality is however very useful in code which normally 272only executes in managed threads, but which may under some infrequent 273circumstances be executed in other threads. 274 275#### Overhead #### 276 277The overhead caused by the thread progress functionality is more or 278less fixed using the same amount of schedulers regardless of the 279number of uses of the functionality. Already today quite a lot of 280functionality use it, and we plan to use it even more. When rewriting 281old implementations of ERTS internal functionality to use the thread 282progress functionality, this implies removing communication in the old 283implementation. Otherwise it is simply no point rewriting the old 284implementation to use the thread progress functionality. Since the 285thread progress overhead is more or less fixed, the rewrite will cause 286a reduction of the total communication in the system. 287 288##### An Example ##### 289 290The main structure of an ETS table was originally managed using 291reference counting. Already a long time ago we replaced this strategy 292since the reference counter caused contention on each access of the 293table. The solution used was to schedule "confirm deletion" jobs on 294each scheduler in order to know when it was safe to deallocate the 295table structure of a removed table. These confirm deletion jobs needed 296to be allocated. That is, we had to allocate and deallocate as many 297blocks as schedulers in order to deallocate one block. This of course 298was a quite an expensive operation, but we only needed to do this once 299when removing a table. It was more important to get rid of the 300contention on the reference counter which was present on every 301operation on the table. 302 303When the thread progress functionality had been introduced, we could 304remove the code implementing the "confirm deletion" jobs, and then 305just schedule a thread progress later operation which deallocates the 306structure. Besides simplifying the code a lot, we got an increase of 307more than 10% of the number of transactions per second handled on a 308mnesia tpcb benchmark executing on a quad core machine. 309