1Port Signals 2============ 3 4Problems 5-------- 6 7Erlang ports conceptually are very similar to Erlang processes. Erlang 8processes execute Erlang code in the virtual machine, while an Erlang 9port execute native code typically used for communication with the 10outside world. For example, when an Erlang process wants to 11communicate using TCP over the network, it communicates via an Erlang 12port implementing the TCP socket interface in native code. Both Erlang 13Processes and Ports communicate using asynchronous signaling. The 14native code executed by an Erlang port is a collection of callback 15functions, called a driver. Each callback more or less implements the 16code of a signal to, or from the port. 17 18Even though processes and ports conceptually always have been very 19similar, the implementations have been very different. Originally, 20more or less all port signals were handled synchronously at the time 21they occurred. Very early in the development of the SMP support for 22the runtime system we recognized that this was a huge problem for 23signals between ports and the outside world. That is, I/O events to 24and from the outside world, or I/O signals. This was one of the first 25things that had to be rewritten in order to be able to do I/O in 26parallel at all. The solution was to implement scheduling of these 27signals. I/O signals corresponding to different ports could then be 28executed in parallel on different scheduler threads. Signals from 29processes to ports was not as big of a problem as the I/O signals, and 30the implementation of those was left as they were. 31 32Each port is protected by its own lock to protect against simultaneous 33execution in multiple threads. Previously when a process, executing on 34a scheduler thread, sent a port a signal, it locked the port lock and 35synchronously executed the code corresponding to the signal. If the 36lock was busy, the scheduler thread blocked waiting until it could 37lock the lock. If multiple processes executing simultaneously on 38different scheduler threads, sent signals to the same port, schedulers 39suffered from heavy lock contention. Such contention could also occur 40between I/O signals for the port executing on one scheduler thread, 41and a signal from a process to the port executing on another scheduler 42thread. Beside the contention issues, we also loose potential work to 43execute in parallel on different scheduler threads. This since the 44process sending the *asynchronous* signal is blocked while the code 45implementing the signal is executed synchronously. 46 47Solution 48-------- 49 50In order to prevent multiple schedulers from trying to execute signals 51to/from the same port simultaneously, we need to be able to ensure 52that all signals to/from a port are executed in sequence on one 53scheduler. More or less, the only way to do this is to schedule all 54types of signals. Signals corresponding to a port can then be executed 55in sequence by one single scheduler thread. If only one thread tries 56to execute the port, no contention will appear on the port 57lock. Besides getting rid of the contention, processes sending signals 58to the port can also continue execution of their own Erlang code on 59other schedulers at the same time as the signaling code is executing 60on another scheduler. 61 62When implementing this there are a couple of important properties that 63we either need, or want to preserve: 64 65* Signal ordering guarantee. Signals from process `X` to port `Y`, 66 *must* be delivered to `Y` in the same order as sent from `X`. 67 68* Signal latency. Due to the previous synchronous implementation, 69 latency of signals sent from processes to ports have usually been 70 very low. During contention the latency has of course 71 increased. Users expect latency of these signals to be low, a 72 sudden increase in latency would not be appreciated by our users. 73 74* Compatible flow control. Ports have for a very long time had the 75 possibility to use the busy port functionality when implementing 76 flow control. One may argue that this functionality fits very bad 77 with the conceptually completely asynchronous signaling, but the 78 functionality has been there for ages and is expected to be 79 there. When a port sets itself into a busy state, `command` 80 signals should not be delivered, and senders of such signals 81 should suspend until the port sets itself in a not busy state. 82 83### Scheduling of Port Signals ### 84 85A run queue has four queues for processes of different priority and 86one queue for ports. The scheduler thread associated with the run 87queue switch evenly between execution of processes and execution of 88ports while both processes and ports exist in the queue. This is not 89completely true, but not important for this discussion. A port that is 90in a run queue also has a queue of tasks to execute. Each task 91corresponds to an in- or outgoing signal. When the port is selected 92for execution each task will be executed in sequence. The run queue 93locks not only protected the queues of ports, but also the queues of 94port tasks. 95 96Since we go from a state where I/O signals are the only port related 97signals scheduled, to a state where potentially all port related 98signals may be scheduled we may drastically increase the load on the 99run queue lock. The amount of scheduled port tasks very much depend on 100the Erlang application executing, which we do not control, and we do 101not want to get increased contention on the run queue locks. We 102therefore need another approach of protecting the port task queue. 103 104#### Task Queue #### 105 106We chose a "semi locked" approach, with one public locked task queue, 107and a private, lock free, queue like, task data structure. This "semi 108locked" approach is similar to how the message boxes of processes are 109managed. The lock is port specific and only used for protection of 110port tasks, so the run queue lock is now needed in more or less the 111same way for ports as for processes. This ensures that we wont see an 112increased lock contention on run queue locks due to this rewrite of 113the port functionality. 114 115When an executing port runs out of work to execute in the private task 116data structure, it moves the public task queue into the private task 117data structure while holding the lock. Once tasks has been moved to 118the private data structure no lock protects them. This way the port 119can continue working on tasks in the private data structure without 120having to fight for the lock. 121 122I/O signals may however be aborted. This could be solved by letting 123the port specific scheduling lock also protect the private task data 124structure, but then the port very frequently would have to fight with 125others enqueueing new tasks. In order to handle this while keeping the 126private task data structure lock free, we use a similar "non 127aggressive" approach as we use when handling processes that gets 128suspended while in the run queue. Instead of removing the aborted port 129task, we just mark it as aborted using an atomic memory 130operation. When a task is selected for execution, we first verify that 131it has not been aborted. If aborted we, just drop the task. 132 133A task that can be aborted is referred via another data structure from 134other parts of the system, so that a thread that needs to abort the 135task can reach it. In order to be sure to safely deallocate a task 136that is no longer used, we first clear this reference and then use the 137thread progress functionality in order to make sure no references can 138exist to the task. Unfortunately, also unmanaged threads might abort 139tasks. This is very infrequent, but might occur. This could be handled 140locally for each port, but would require extra information in each 141port structure which very infrequently would be used. Instead of 142implementing this in each port, we implemented general functionality 143that can be used from unmanaged threads to delay thread progress. 144 145The private "queue like" task data structure could have been an 146ordinary queue if it wasn't for the busy port functionality. When the 147port has flagged itself as busy, `command` signals are not allowed to 148be delivered and need to be blocked. Other signals sent from the same 149sender following a `command` signal that has been blocked also have to 150be blocked; otherwise, we would violate the ordering guarantee. At the 151same time, other signals that have no dependencies to blocked 152`command` signals are expected to be delivered. 153 154The above requirements makes the private task data structure a rather 155complex data structure. It has a queue of unprocessed tasks, and a 156busy queue. The busy queue contains blocked tasks corresponding to 157`command` signals, and tasks with dependencies to such tasks. The busy 158queue is accompanied by a table over blocked tasks based on sender 159with a references into last task in the busy queue from a specific 160sender. This since we need check for dependencies when new tasks are 161processed in the queue of unprocessed tasks. When a new task is 162processed that needs to be blocked it isn't enqueued at the end of the 163busy queue, but instead directly after the last task with the same 164sender. This in order to easily be able to detect when we have tasks 165that no longer have any dependencies to tasks corresponding to 166`command` signals which should be moved out of the busy queue. When 167the port executes, it switches between processing tasks from the busy 168queue, and processing directly from the unprocessed queue based on its 169busy state. When processing directly from the unprocessed queue it 170might, of course, have to move a task into the busy queue instead of 171executing it. 172 173#### Busy Port Queue #### 174 175Since it is the port itself which decides when it is time to enter a 176busy state, it needs to be executing in order to enter the busy 177state. As a result of `command` signals being scheduled, we may get 178into a situation where the port gets flooded by a huge amount of 179`command` signals before it even gets a chance to set itself into a 180busy state. This since it has not been scheduled for execution 181yet. That is, under these circumstances the busy port functionality 182loose the flow control properties it was intended to provide. 183 184In order to solve this, we introduced a new busy feature, namely "busy 185port queue". The port has a limit of `command` data that is allowed to 186be enqueued in the task queue. When this limit is reached, the port 187will automatically enter a busy port queue state. When in this state, 188senders of `command` signals will be suspended, but `command` signals 189will still be delivered to the port unless it is also in a busy port 190state. This limit is known as the high limit. 191 192There is also a low limit. When the amount of queued `command` data 193falls below this limit and the port is in a busy port queue state, the 194busy port queue state is automatically disabled. The low limit should 195typically be significantly lower than the high limit in order to 196prevent frequent oscillation around the busy port queue state. 197 198By introduction of this new busy state we still can provide the flow 199control. Old driver do not even have to be changed. The limits can, 200however, be configured and even disabled by the port. By default the 201high limit is 8 KB and the low limit is 4 KB. 202 203### Preparation of Signal Send ### 204 205Previously all operations sending signals to ports began by acquiring 206the port lock, then performed preparations for sending the signal, and 207then finally sent the signal. The preparations typically included 208inspecting the state of the port, and preparing the data to pass along 209with the signal. The preparation of data is frequently quite time 210consuming, and did not really depend on the port. That is we would 211like to do this without having the port lock locked. 212 213In order to improve this, state information was re-organized in the 214port structer, so that we can access it using atomic memory 215operations. This together with the new port table implementation, 216enabled us to lookup the port and inspect the state before acquiring 217the port lock, which in turn made it possible to perform preparations 218of signal data before acquiring the port lock. 219 220### Preserving Low Latency ### 221 222If we disregard the contended cases, we will inevitably get a higher 223latency when scheduling signals for execution at a later time than by 224executing the signal immediately. In order to preserve the low latency 225we now first check if this is a contended case or not. If it is, we 226schedule the signal for later execution; otherwise, we execute the 227signal immediately. It is a contended case if other signals already 228are scheduled on the port, or if we fail to acquire the port 229lock. That is we will not block waiting for the lock. 230 231Doing it this way we will preserve the low latency at the expense of 232lost potential parallel execution of the signal and other code in the 233process sending the signal. This default behaviour can however be 234changed on port basis or system wide, forcing scheduling of all 235signals from processes to ports that are not part of a synchronous 236communication. That is, an unconditional request/response pair of 237asynchronous signals. In this case it is no potential for parallelism, 238and by that no point forcing scheduling of the request signal. 239 240The immediate execution of signals may also cause a scheduler that is 241about to execute scheduled tasks to block waiting for the port 242lock. This is however more or less the only scenario where a scheduler 243needs to wait for the port lock. The maximum time it has to wait is 244the time it takes to execute one signal, since we always schedule 245signals when contention occurs. 246 247### Signal Operations ### 248 249Besides implementing the functionality enabling the scheduling, 250preparation of signal data without port lock, etc, each operation 251sending signals to ports had to be quite extensively re-written. This 252in order to move all sub-operations that can be done without the lock 253to a place before we have acquired the lock, and also since signals 254now sometimes are executed immediately and sometimes scheduled for 255execution at a later time which put different requirements on the data 256to pass along with the signal. 257 258### Some Benchmark Results ### 259 260When running some simple benchmarks where contention only occur due to 261I/O signals contending with signals from one single process we got a 262speedup of 5-15%. When multiple processes send signals to one single 263port the improvements can be much larger, but the scenario with one 264process contending with I/O is the most common one. 265 266The benchmarks were run on a relatively new machine with an Intel i7 267quad core processor with hyper-threading using 8 schedulers.