1Port Signals
2============
3
4Problems
5--------
6
7Erlang ports conceptually are very similar to Erlang processes. Erlang
8processes execute Erlang code in the virtual machine, while an Erlang
9port execute native code typically used for communication with the
10outside world. For example, when an Erlang process wants to
11communicate using TCP over the network, it communicates via an Erlang
12port implementing the TCP socket interface in native code. Both Erlang
13Processes and Ports communicate using asynchronous signaling. The
14native code executed by an Erlang port is a collection of callback
15functions, called a driver. Each callback more or less implements the
16code of a signal to, or from the port.
17
18Even though processes and ports conceptually always have been very
19similar, the implementations have been very different. Originally,
20more or less all port signals were handled synchronously at the time
21they occurred. Very early in the development of the SMP support for
22the runtime system we recognized that this was a huge problem for
23signals between ports and the outside world. That is, I/O events to
24and from the outside world, or I/O signals. This was one of the first
25things that had to be rewritten in order to be able to do I/O in
26parallel at all. The solution was to implement scheduling of these
27signals. I/O signals corresponding to different ports could then be
28executed in parallel on different scheduler threads. Signals from
29processes to ports was not as big of a problem as the I/O signals, and
30the implementation of those was left as they were.
31
32Each port is protected by its own lock to protect against simultaneous
33execution in multiple threads. Previously when a process, executing on
34a scheduler thread, sent a port a signal, it locked the port lock and
35synchronously executed the code corresponding to the signal. If the
36lock was busy, the scheduler thread blocked waiting until it could
37lock the lock. If multiple processes executing simultaneously on
38different scheduler threads, sent signals to the same port, schedulers
39suffered from heavy lock contention. Such contention could also occur
40between I/O signals for the port executing on one scheduler thread,
41and a signal from a process to the port executing on another scheduler
42thread. Beside the contention issues, we also loose potential work to
43execute in parallel on different scheduler threads. This since the
44process sending the *asynchronous* signal is blocked while the code
45implementing the signal is executed synchronously.
46
47Solution
48--------
49
50In order to prevent multiple schedulers from trying to execute signals
51to/from the same port simultaneously, we need to be able to ensure
52that all signals to/from a port are executed in sequence on one
53scheduler. More or less, the only way to do this is to schedule all
54types of signals. Signals corresponding to a port can then be executed
55in sequence by one single scheduler thread. If only one thread tries
56to execute the port, no contention will appear on the port
57lock. Besides getting rid of the contention, processes sending signals
58to the port can also continue execution of their own Erlang code on
59other schedulers at the same time as the signaling code is executing
60on another scheduler.
61
62When implementing this there are a couple of important properties that
63we either need, or want to preserve:
64
65*   Signal ordering guarantee. Signals from process `X` to port `Y`,
66    *must* be delivered to `Y` in the same order as sent from `X`.
67
68*   Signal latency. Due to the previous synchronous implementation,
69    latency of signals sent from processes to ports have usually been
70    very low. During contention the latency has of course
71    increased. Users expect latency of these signals to be low, a
72    sudden increase in latency would not be appreciated by our users.
73
74*   Compatible flow control. Ports have for a very long time had the
75    possibility to use the busy port functionality when implementing
76    flow control. One may argue that this functionality fits very bad
77    with the conceptually completely asynchronous signaling, but the
78    functionality has been there for ages and is expected to be
79    there. When a port sets itself into a busy state, `command`
80    signals should not be delivered, and senders of such signals
81    should suspend until the port sets itself in a not busy state.
82
83### Scheduling of Port Signals ###
84
85A run queue has four queues for processes of different priority and
86one queue for ports. The scheduler thread associated with the run
87queue switch evenly between execution of processes and execution of
88ports while both processes and ports exist in the queue. This is not
89completely true, but not important for this discussion. A port that is
90in a run queue also has a queue of tasks to execute. Each task
91corresponds to an in- or outgoing signal. When the port is selected
92for execution each task will be executed in sequence. The run queue
93locks not only protected the queues of ports, but also the queues of
94port tasks.
95
96Since we go from a state where I/O signals are the only port related
97signals scheduled, to a state where potentially all port related
98signals may be scheduled we may drastically increase the load on the
99run queue lock. The amount of scheduled port tasks very much depend on
100the Erlang application executing, which we do not control, and we do
101not want to get increased contention on the run queue locks. We
102therefore need another approach of protecting the port task queue.
103
104#### Task Queue ####
105
106We chose a "semi locked" approach, with one public locked task queue,
107and a private, lock free, queue like, task data structure. This "semi
108locked" approach is similar to how the message boxes of processes are
109managed. The lock is port specific and only used for protection of
110port tasks, so the run queue lock is now needed in more or less the
111same way for ports as for processes. This ensures that we wont see an
112increased lock contention on run queue locks due to this rewrite of
113the port functionality.
114
115When an executing port runs out of work to execute in the private task
116data structure, it moves the public task queue into the private task
117data structure while holding the lock. Once tasks has been moved to
118the private data structure no lock protects them. This way the port
119can continue working on tasks in the private data structure without
120having to fight for the lock.
121
122I/O signals may however be aborted. This could be solved by letting
123the port specific scheduling lock also protect the private task data
124structure, but then the port very frequently would have to fight with
125others enqueueing new tasks. In order to handle this while keeping the
126private task data structure lock free, we use a similar "non
127aggressive" approach as we use when handling processes that gets
128suspended while in the run queue. Instead of removing the aborted port
129task, we just mark it as aborted using an atomic memory
130operation. When a task is selected for execution, we first verify that
131it has not been aborted. If aborted we, just drop the task.
132
133A task that can be aborted is referred via another data structure from
134other parts of the system, so that a thread that needs to abort the
135task can reach it. In order to be sure to safely deallocate a task
136that is no longer used, we first clear this reference and then use the
137thread progress functionality in order to make sure no references can
138exist to the task. Unfortunately, also unmanaged threads might abort
139tasks. This is very infrequent, but might occur. This could be handled
140locally for each port, but would require extra information in each
141port structure which very infrequently would be used. Instead of
142implementing this in each port, we implemented general functionality
143that can be used from unmanaged threads to delay thread progress.
144
145The private "queue like" task data structure could have been an
146ordinary queue if it wasn't for the busy port functionality. When the
147port has flagged itself as busy, `command` signals are not allowed to
148be delivered and need to be blocked. Other signals sent from the same
149sender following a `command` signal that has been blocked also have to
150be blocked; otherwise, we would violate the ordering guarantee. At the
151same time, other signals that have no dependencies to blocked
152`command` signals are expected to be delivered.
153
154The above requirements makes the private task data structure a rather
155complex data structure. It has a queue of unprocessed tasks, and a
156busy queue. The busy queue contains blocked tasks corresponding to
157`command` signals, and tasks with dependencies to such tasks. The busy
158queue is accompanied by a table over blocked tasks based on sender
159with a references into last task in the busy queue from a specific
160sender. This since we need check for dependencies when new tasks are
161processed in the queue of unprocessed tasks. When a new task is
162processed that needs to be blocked it isn't enqueued at the end of the
163busy queue, but instead directly after the last task with the same
164sender. This in order to easily be able to detect when we have tasks
165that no longer have any dependencies to tasks corresponding to
166`command` signals which should be moved out of the busy queue. When
167the port executes, it switches between processing tasks from the busy
168queue, and processing directly from the unprocessed queue based on its
169busy state. When processing directly from the unprocessed queue it
170might, of course, have to move a task into the busy queue instead of
171executing it.
172
173#### Busy Port Queue ####
174
175Since it is the port itself which decides when it is time to enter a
176busy state, it needs to be executing in order to enter the busy
177state. As a result of `command` signals being scheduled, we may get
178into a situation where the port gets flooded by a huge amount of
179`command` signals before it even gets a chance to set itself into a
180busy state. This since it has not been scheduled for execution
181yet. That is, under these circumstances the busy port functionality
182loose the flow control properties it was intended to provide.
183
184In order to solve this, we introduced a new busy feature, namely "busy
185port queue". The port has a limit of `command` data that is allowed to
186be enqueued in the task queue. When this limit is reached, the port
187will automatically enter a busy port queue state. When in this state,
188senders of `command` signals will be suspended, but `command` signals
189will still be delivered to the port unless it is also in a busy port
190state. This limit is known as the high limit.
191
192There is also a low limit. When the amount of queued `command` data
193falls below this limit and the port is in a busy port queue state, the
194busy port queue state is automatically disabled. The low limit should
195typically be significantly lower than the high limit in order to
196prevent frequent oscillation around the busy port queue state.
197
198By introduction of this new busy state we still can provide the flow
199control. Old driver do not even have to be changed. The limits can,
200however, be configured and even disabled by the port. By default the
201high limit is 8 KB and the low limit is 4 KB.
202
203### Preparation of Signal Send ###
204
205Previously all operations sending signals to ports began by acquiring
206the port lock, then performed preparations for sending the signal, and
207then finally sent the signal. The preparations typically included
208inspecting the state of the port, and preparing the data to pass along
209with the signal. The preparation of data is frequently quite time
210consuming, and did not really depend on the port. That is we would
211like to do this without having the port lock locked.
212
213In order to improve this, state information was re-organized in the
214port structer, so that we can access it using atomic memory
215operations. This together with the new port table implementation,
216enabled us to lookup the port and inspect the state before acquiring
217the port lock, which in turn made it possible to perform preparations
218of signal data before acquiring the port lock.
219
220### Preserving Low Latency ###
221
222If we disregard the contended cases, we will inevitably get a higher
223latency when scheduling signals for execution at a later time than by
224executing the signal immediately. In order to preserve the low latency
225we now first check if this is a contended case or not. If it is, we
226schedule the signal for later execution; otherwise, we execute the
227signal immediately. It is a contended case if other signals already
228are scheduled on the port, or if we fail to acquire the port
229lock. That is we will not block waiting for the lock.
230
231Doing it this way we will preserve the low latency at the expense of
232lost potential parallel execution of the signal and other code in the
233process sending the signal. This default behaviour can however be
234changed on port basis or system wide, forcing scheduling of all
235signals from processes to ports that are not part of a synchronous
236communication. That is, an unconditional request/response pair of
237asynchronous signals. In this case it is no potential for parallelism,
238and by that no point forcing scheduling of the request signal.
239
240The immediate execution of signals may also cause a scheduler that is
241about to execute scheduled tasks to block waiting for the port
242lock. This is however more or less the only scenario where a scheduler
243needs to wait for the port lock. The maximum time it has to wait is
244the time it takes to execute one signal, since we always schedule
245signals when contention occurs.
246
247### Signal Operations ###
248
249Besides implementing the functionality enabling the scheduling,
250preparation of signal data without port lock, etc, each operation
251sending signals to ports had to be quite extensively re-written. This
252in order to move all sub-operations that can be done without the lock
253to a place before we have acquired the lock, and also since signals
254now sometimes are executed immediately and sometimes scheduled for
255execution at a later time which put different requirements on the data
256to pass along with the signal.
257
258### Some Benchmark Results ###
259
260When running some simple benchmarks where contention only occur due to
261I/O signals contending with signals from one single process we got a
262speedup of 5-15%. When multiple processes send signals to one single
263port the improvements can be much larger, but the scenario with one
264process contending with I/O is the most common one.
265
266The benchmarks were run on a relatively new machine with an Intel i7
267quad core processor with hyper-threading using 8 schedulers.