1ccc9971eSMauro Carvalho Chehab=================================================
2ccc9971eSMauro Carvalho ChehabA Tour Through TREE_RCU's Expedited Grace Periods
3ccc9971eSMauro Carvalho Chehab=================================================
4ccc9971eSMauro Carvalho Chehab
5ccc9971eSMauro Carvalho ChehabIntroduction
6ccc9971eSMauro Carvalho Chehab============
7ccc9971eSMauro Carvalho Chehab
8ccc9971eSMauro Carvalho ChehabThis document describes RCU's expedited grace periods.
9ccc9971eSMauro Carvalho ChehabUnlike RCU's normal grace periods, which accept long latencies to attain
10ccc9971eSMauro Carvalho Chehabhigh efficiency and minimal disturbance, expedited grace periods accept
11ccc9971eSMauro Carvalho Chehablower efficiency and significant disturbance to attain shorter latencies.
12ccc9971eSMauro Carvalho Chehab
13ccc9971eSMauro Carvalho ChehabThere are two flavors of RCU (RCU-preempt and RCU-sched), with an earlier
14ccc9971eSMauro Carvalho Chehabthird RCU-bh flavor having been implemented in terms of the other two.
15ccc9971eSMauro Carvalho ChehabEach of the two implementations is covered in its own section.
16ccc9971eSMauro Carvalho Chehab
17ccc9971eSMauro Carvalho ChehabExpedited Grace Period Design
18ccc9971eSMauro Carvalho Chehab=============================
19ccc9971eSMauro Carvalho Chehab
20ccc9971eSMauro Carvalho ChehabThe expedited RCU grace periods cannot be accused of being subtle,
21ccc9971eSMauro Carvalho Chehabgiven that they for all intents and purposes hammer every CPU that
22ccc9971eSMauro Carvalho Chehabhas not yet provided a quiescent state for the current expedited
23ccc9971eSMauro Carvalho Chehabgrace period.
24ccc9971eSMauro Carvalho ChehabThe one saving grace is that the hammer has grown a bit smaller
25ccc9971eSMauro Carvalho Chehabover time:  The old call to ``try_stop_cpus()`` has been
26ccc9971eSMauro Carvalho Chehabreplaced with a set of calls to ``smp_call_function_single()``,
27ccc9971eSMauro Carvalho Chehabeach of which results in an IPI to the target CPU.
28ccc9971eSMauro Carvalho ChehabThe corresponding handler function checks the CPU's state, motivating
29ccc9971eSMauro Carvalho Chehaba faster quiescent state where possible, and triggering a report
30ccc9971eSMauro Carvalho Chehabof that quiescent state.
31ccc9971eSMauro Carvalho ChehabAs always for RCU, once everything has spent some time in a quiescent
32ccc9971eSMauro Carvalho Chehabstate, the expedited grace period has completed.
33ccc9971eSMauro Carvalho Chehab
34ccc9971eSMauro Carvalho ChehabThe details of the ``smp_call_function_single()`` handler's
35ccc9971eSMauro Carvalho Chehaboperation depend on the RCU flavor, as described in the following
36ccc9971eSMauro Carvalho Chehabsections.
37ccc9971eSMauro Carvalho Chehab
38ccc9971eSMauro Carvalho ChehabRCU-preempt Expedited Grace Periods
39ccc9971eSMauro Carvalho Chehab===================================
40ccc9971eSMauro Carvalho Chehab
4181ad58beSSebastian Andrzej Siewior``CONFIG_PREEMPTION=y`` kernels implement RCU-preempt.
42ccc9971eSMauro Carvalho ChehabThe overall flow of the handling of a given CPU by an RCU-preempt
43ccc9971eSMauro Carvalho Chehabexpedited grace period is shown in the following diagram:
44ccc9971eSMauro Carvalho Chehab
45ccc9971eSMauro Carvalho Chehab.. kernel-figure:: ExpRCUFlow.svg
46ccc9971eSMauro Carvalho Chehab
47ccc9971eSMauro Carvalho ChehabThe solid arrows denote direct action, for example, a function call.
48ccc9971eSMauro Carvalho ChehabThe dotted arrows denote indirect action, for example, an IPI
49ccc9971eSMauro Carvalho Chehabor a state that is reached after some time.
50ccc9971eSMauro Carvalho Chehab
51ccc9971eSMauro Carvalho ChehabIf a given CPU is offline or idle, ``synchronize_rcu_expedited()``
52ccc9971eSMauro Carvalho Chehabwill ignore it because idle and offline CPUs are already residing
53ccc9971eSMauro Carvalho Chehabin quiescent states.
54ccc9971eSMauro Carvalho ChehabOtherwise, the expedited grace period will use
55ccc9971eSMauro Carvalho Chehab``smp_call_function_single()`` to send the CPU an IPI, which
56ccc9971eSMauro Carvalho Chehabis handled by ``rcu_exp_handler()``.
57ccc9971eSMauro Carvalho Chehab
58ccc9971eSMauro Carvalho ChehabHowever, because this is preemptible RCU, ``rcu_exp_handler()``
59ccc9971eSMauro Carvalho Chehabcan check to see if the CPU is currently running in an RCU read-side
60ccc9971eSMauro Carvalho Chehabcritical section.
61ccc9971eSMauro Carvalho ChehabIf not, the handler can immediately report a quiescent state.
62ccc9971eSMauro Carvalho ChehabOtherwise, it sets flags so that the outermost ``rcu_read_unlock()``
63ccc9971eSMauro Carvalho Chehabinvocation will provide the needed quiescent-state report.
64ccc9971eSMauro Carvalho ChehabThis flag-setting avoids the previous forced preemption of all
65ccc9971eSMauro Carvalho ChehabCPUs that might have RCU read-side critical sections.
66ccc9971eSMauro Carvalho ChehabIn addition, this flag-setting is done so as to avoid increasing
67ccc9971eSMauro Carvalho Chehabthe overhead of the common-case fastpath through the scheduler.
68ccc9971eSMauro Carvalho Chehab
69ccc9971eSMauro Carvalho ChehabAgain because this is preemptible RCU, an RCU read-side critical section
70ccc9971eSMauro Carvalho Chehabcan be preempted.
71ccc9971eSMauro Carvalho ChehabWhen that happens, RCU will enqueue the task, which will the continue to
72ccc9971eSMauro Carvalho Chehabblock the current expedited grace period until it resumes and finds its
73ccc9971eSMauro Carvalho Chehaboutermost ``rcu_read_unlock()``.
74ccc9971eSMauro Carvalho ChehabThe CPU will report a quiescent state just after enqueuing the task because
75ccc9971eSMauro Carvalho Chehabthe CPU is no longer blocking the grace period.
76ccc9971eSMauro Carvalho ChehabIt is instead the preempted task doing the blocking.
77ccc9971eSMauro Carvalho ChehabThe list of blocked tasks is managed by ``rcu_preempt_ctxt_queue()``,
78ccc9971eSMauro Carvalho Chehabwhich is called from ``rcu_preempt_note_context_switch()``, which
79ccc9971eSMauro Carvalho Chehabin turn is called from ``rcu_note_context_switch()``, which in
80ccc9971eSMauro Carvalho Chehabturn is called from the scheduler.
81ccc9971eSMauro Carvalho Chehab
82ccc9971eSMauro Carvalho Chehab
83ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
84ccc9971eSMauro Carvalho Chehab| **Quick Quiz**:                                                       |
85ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
86ccc9971eSMauro Carvalho Chehab| Why not just have the expedited grace period check the state of all   |
87ccc9971eSMauro Carvalho Chehab| the CPUs? After all, that would avoid all those real-time-unfriendly  |
88ccc9971eSMauro Carvalho Chehab| IPIs.                                                                 |
89ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
90ccc9971eSMauro Carvalho Chehab| **Answer**:                                                           |
91ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
92ccc9971eSMauro Carvalho Chehab| Because we want the RCU read-side critical sections to run fast,      |
93ccc9971eSMauro Carvalho Chehab| which means no memory barriers. Therefore, it is not possible to      |
94ccc9971eSMauro Carvalho Chehab| safely check the state from some other CPU. And even if it was        |
95ccc9971eSMauro Carvalho Chehab| possible to safely check the state, it would still be necessary to    |
96ccc9971eSMauro Carvalho Chehab| IPI the CPU to safely interact with the upcoming                      |
97ccc9971eSMauro Carvalho Chehab| ``rcu_read_unlock()`` invocation, which means that the remote state   |
98ccc9971eSMauro Carvalho Chehab| testing would not help the worst-case latency that real-time          |
99ccc9971eSMauro Carvalho Chehab| applications care about.                                              |
100ccc9971eSMauro Carvalho Chehab|                                                                       |
101ccc9971eSMauro Carvalho Chehab| One way to prevent your real-time application from getting hit with   |
102ccc9971eSMauro Carvalho Chehab| these IPIs is to build your kernel with ``CONFIG_NO_HZ_FULL=y``. RCU  |
103ccc9971eSMauro Carvalho Chehab| would then perceive the CPU running your application as being idle,   |
104ccc9971eSMauro Carvalho Chehab| and it would be able to safely detect that state without needing to   |
105ccc9971eSMauro Carvalho Chehab| IPI the CPU.                                                          |
106ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
107ccc9971eSMauro Carvalho Chehab
108ccc9971eSMauro Carvalho ChehabPlease note that this is just the overall flow: Additional complications
109ccc9971eSMauro Carvalho Chehabcan arise due to races with CPUs going idle or offline, among other
110ccc9971eSMauro Carvalho Chehabthings.
111ccc9971eSMauro Carvalho Chehab
112ccc9971eSMauro Carvalho ChehabRCU-sched Expedited Grace Periods
113ccc9971eSMauro Carvalho Chehab---------------------------------
114ccc9971eSMauro Carvalho Chehab
11581ad58beSSebastian Andrzej Siewior``CONFIG_PREEMPTION=n`` kernels implement RCU-sched. The overall flow of
116ccc9971eSMauro Carvalho Chehabthe handling of a given CPU by an RCU-sched expedited grace period is
117ccc9971eSMauro Carvalho Chehabshown in the following diagram:
118ccc9971eSMauro Carvalho Chehab
119ccc9971eSMauro Carvalho Chehab.. kernel-figure:: ExpSchedFlow.svg
120ccc9971eSMauro Carvalho Chehab
121ccc9971eSMauro Carvalho ChehabAs with RCU-preempt, RCU-sched's ``synchronize_rcu_expedited()`` ignores
122ccc9971eSMauro Carvalho Chehaboffline and idle CPUs, again because they are in remotely detectable
123ccc9971eSMauro Carvalho Chehabquiescent states. However, because the ``rcu_read_lock_sched()`` and
124ccc9971eSMauro Carvalho Chehab``rcu_read_unlock_sched()`` leave no trace of their invocation, in
125ccc9971eSMauro Carvalho Chehabgeneral it is not possible to tell whether or not the current CPU is in
126ccc9971eSMauro Carvalho Chehaban RCU read-side critical section. The best that RCU-sched's
127ccc9971eSMauro Carvalho Chehab``rcu_exp_handler()`` can do is to check for idle, on the off-chance
128ccc9971eSMauro Carvalho Chehabthat the CPU went idle while the IPI was in flight. If the CPU is idle,
129ccc9971eSMauro Carvalho Chehabthen ``rcu_exp_handler()`` reports the quiescent state.
130ccc9971eSMauro Carvalho Chehab
131ccc9971eSMauro Carvalho ChehabOtherwise, the handler forces a future context switch by setting the
132ccc9971eSMauro Carvalho ChehabNEED_RESCHED flag of the current task's thread flag and the CPU preempt
133ccc9971eSMauro Carvalho Chehabcounter. At the time of the context switch, the CPU reports the
134ccc9971eSMauro Carvalho Chehabquiescent state. Should the CPU go offline first, it will report the
135ccc9971eSMauro Carvalho Chehabquiescent state at that time.
136ccc9971eSMauro Carvalho Chehab
137ccc9971eSMauro Carvalho ChehabExpedited Grace Period and CPU Hotplug
138ccc9971eSMauro Carvalho Chehab--------------------------------------
139ccc9971eSMauro Carvalho Chehab
140ccc9971eSMauro Carvalho ChehabThe expedited nature of expedited grace periods require a much tighter
141ccc9971eSMauro Carvalho Chehabinteraction with CPU hotplug operations than is required for normal
142ccc9971eSMauro Carvalho Chehabgrace periods. In addition, attempting to IPI offline CPUs will result
143ccc9971eSMauro Carvalho Chehabin splats, but failing to IPI online CPUs can result in too-short grace
144ccc9971eSMauro Carvalho Chehabperiods. Neither option is acceptable in production kernels.
145ccc9971eSMauro Carvalho Chehab
146ccc9971eSMauro Carvalho ChehabThe interaction between expedited grace periods and CPU hotplug
147ccc9971eSMauro Carvalho Chehaboperations is carried out at several levels:
148ccc9971eSMauro Carvalho Chehab
149ccc9971eSMauro Carvalho Chehab#. The number of CPUs that have ever been online is tracked by the
150ccc9971eSMauro Carvalho Chehab   ``rcu_state`` structure's ``->ncpus`` field. The ``rcu_state``
151ccc9971eSMauro Carvalho Chehab   structure's ``->ncpus_snap`` field tracks the number of CPUs that
152ccc9971eSMauro Carvalho Chehab   have ever been online at the beginning of an RCU expedited grace
153ccc9971eSMauro Carvalho Chehab   period. Note that this number never decreases, at least in the
154ccc9971eSMauro Carvalho Chehab   absence of a time machine.
155ccc9971eSMauro Carvalho Chehab#. The identities of the CPUs that have ever been online is tracked by
156ccc9971eSMauro Carvalho Chehab   the ``rcu_node`` structure's ``->expmaskinitnext`` field. The
157ccc9971eSMauro Carvalho Chehab   ``rcu_node`` structure's ``->expmaskinit`` field tracks the
158ccc9971eSMauro Carvalho Chehab   identities of the CPUs that were online at least once at the
159ccc9971eSMauro Carvalho Chehab   beginning of the most recent RCU expedited grace period. The
160ccc9971eSMauro Carvalho Chehab   ``rcu_state`` structure's ``->ncpus`` and ``->ncpus_snap`` fields are
161ccc9971eSMauro Carvalho Chehab   used to detect when new CPUs have come online for the first time,
162ccc9971eSMauro Carvalho Chehab   that is, when the ``rcu_node`` structure's ``->expmaskinitnext``
163ccc9971eSMauro Carvalho Chehab   field has changed since the beginning of the last RCU expedited grace
164ccc9971eSMauro Carvalho Chehab   period, which triggers an update of each ``rcu_node`` structure's
165ccc9971eSMauro Carvalho Chehab   ``->expmaskinit`` field from its ``->expmaskinitnext`` field.
166ccc9971eSMauro Carvalho Chehab#. Each ``rcu_node`` structure's ``->expmaskinit`` field is used to
167ccc9971eSMauro Carvalho Chehab   initialize that structure's ``->expmask`` at the beginning of each
168ccc9971eSMauro Carvalho Chehab   RCU expedited grace period. This means that only those CPUs that have
169ccc9971eSMauro Carvalho Chehab   been online at least once will be considered for a given grace
170ccc9971eSMauro Carvalho Chehab   period.
171ccc9971eSMauro Carvalho Chehab#. Any CPU that goes offline will clear its bit in its leaf ``rcu_node``
172ccc9971eSMauro Carvalho Chehab   structure's ``->qsmaskinitnext`` field, so any CPU with that bit
173ccc9971eSMauro Carvalho Chehab   clear can safely be ignored. However, it is possible for a CPU coming
174ccc9971eSMauro Carvalho Chehab   online or going offline to have this bit set for some time while
175ccc9971eSMauro Carvalho Chehab   ``cpu_online`` returns ``false``.
176ccc9971eSMauro Carvalho Chehab#. For each non-idle CPU that RCU believes is currently online, the
177ccc9971eSMauro Carvalho Chehab   grace period invokes ``smp_call_function_single()``. If this
178ccc9971eSMauro Carvalho Chehab   succeeds, the CPU was fully online. Failure indicates that the CPU is
179ccc9971eSMauro Carvalho Chehab   in the process of coming online or going offline, in which case it is
180ccc9971eSMauro Carvalho Chehab   necessary to wait for a short time period and try again. The purpose
181ccc9971eSMauro Carvalho Chehab   of this wait (or series of waits, as the case may be) is to permit a
182ccc9971eSMauro Carvalho Chehab   concurrent CPU-hotplug operation to complete.
183ccc9971eSMauro Carvalho Chehab#. In the case of RCU-sched, one of the last acts of an outgoing CPU is
184*448e9f34SFrederic Weisbecker   to invoke ``rcutree_report_cpu_dead()``, which reports a quiescent state for
185ccc9971eSMauro Carvalho Chehab   that CPU. However, this is likely paranoia-induced redundancy.
186ccc9971eSMauro Carvalho Chehab
187ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
188ccc9971eSMauro Carvalho Chehab| **Quick Quiz**:                                                       |
189ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
190ccc9971eSMauro Carvalho Chehab| Why all the dancing around with multiple counters and masks tracking  |
191ccc9971eSMauro Carvalho Chehab| CPUs that were once online? Why not just have a single set of masks   |
192ccc9971eSMauro Carvalho Chehab| tracking the currently online CPUs and be done with it?               |
193ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
194ccc9971eSMauro Carvalho Chehab| **Answer**:                                                           |
195ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
196ccc9971eSMauro Carvalho Chehab| Maintaining single set of masks tracking the online CPUs *sounds*     |
197ccc9971eSMauro Carvalho Chehab| easier, at least until you try working out all the race conditions    |
198ccc9971eSMauro Carvalho Chehab| between grace-period initialization and CPU-hotplug operations. For   |
199ccc9971eSMauro Carvalho Chehab| example, suppose initialization is progressing down the tree while a  |
200ccc9971eSMauro Carvalho Chehab| CPU-offline operation is progressing up the tree. This situation can  |
201ccc9971eSMauro Carvalho Chehab| result in bits set at the top of the tree that have no counterparts   |
202ccc9971eSMauro Carvalho Chehab| at the bottom of the tree. Those bits will never be cleared, which    |
203ccc9971eSMauro Carvalho Chehab| will result in grace-period hangs. In short, that way lies madness,   |
204ccc9971eSMauro Carvalho Chehab| to say nothing of a great many bugs, hangs, and deadlocks.            |
205ccc9971eSMauro Carvalho Chehab| In contrast, the current multi-mask multi-counter scheme ensures that |
206ccc9971eSMauro Carvalho Chehab| grace-period initialization will always see consistent masks up and   |
207ccc9971eSMauro Carvalho Chehab| down the tree, which brings significant simplifications over the      |
208ccc9971eSMauro Carvalho Chehab| single-mask method.                                                   |
209ccc9971eSMauro Carvalho Chehab|                                                                       |
210ccc9971eSMauro Carvalho Chehab| This is an instance of `deferring work in order to avoid              |
211ccc9971eSMauro Carvalho Chehab| synchronization <http://www.cs.columbia.edu/~library/TR-repository/re |
212ccc9971eSMauro Carvalho Chehab| ports/reports-1992/cucs-039-92.ps.gz>`__.                             |
213ccc9971eSMauro Carvalho Chehab| Lazily recording CPU-hotplug events at the beginning of the next      |
214ccc9971eSMauro Carvalho Chehab| grace period greatly simplifies maintenance of the CPU-tracking       |
215ccc9971eSMauro Carvalho Chehab| bitmasks in the ``rcu_node`` tree.                                    |
216ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
217ccc9971eSMauro Carvalho Chehab
218ccc9971eSMauro Carvalho ChehabExpedited Grace Period Refinements
219ccc9971eSMauro Carvalho Chehab----------------------------------
220ccc9971eSMauro Carvalho Chehab
221ccc9971eSMauro Carvalho ChehabIdle-CPU Checks
222ccc9971eSMauro Carvalho Chehab~~~~~~~~~~~~~~~
223ccc9971eSMauro Carvalho Chehab
224ccc9971eSMauro Carvalho ChehabEach expedited grace period checks for idle CPUs when initially forming
225ccc9971eSMauro Carvalho Chehabthe mask of CPUs to be IPIed and again just before IPIing a CPU (both
226ccc9971eSMauro Carvalho Chehabchecks are carried out by ``sync_rcu_exp_select_cpus()``). If the CPU is
227ccc9971eSMauro Carvalho Chehabidle at any time between those two times, the CPU will not be IPIed.
228ccc9971eSMauro Carvalho ChehabInstead, the task pushing the grace period forward will include the idle
229ccc9971eSMauro Carvalho ChehabCPUs in the mask passed to ``rcu_report_exp_cpu_mult()``.
230ccc9971eSMauro Carvalho Chehab
231ccc9971eSMauro Carvalho ChehabFor RCU-sched, there is an additional check: If the IPI has interrupted
232ccc9971eSMauro Carvalho Chehabthe idle loop, then ``rcu_exp_handler()`` invokes
233ccc9971eSMauro Carvalho Chehab``rcu_report_exp_rdp()`` to report the corresponding quiescent state.
234ccc9971eSMauro Carvalho Chehab
235ccc9971eSMauro Carvalho ChehabFor RCU-preempt, there is no specific check for idle in the IPI handler
236ccc9971eSMauro Carvalho Chehab(``rcu_exp_handler()``), but because RCU read-side critical sections are
237ccc9971eSMauro Carvalho Chehabnot permitted within the idle loop, if ``rcu_exp_handler()`` sees that
238ccc9971eSMauro Carvalho Chehabthe CPU is within RCU read-side critical section, the CPU cannot
239ccc9971eSMauro Carvalho Chehabpossibly be idle. Otherwise, ``rcu_exp_handler()`` invokes
240ccc9971eSMauro Carvalho Chehab``rcu_report_exp_rdp()`` to report the corresponding quiescent state,
241ccc9971eSMauro Carvalho Chehabregardless of whether or not that quiescent state was due to the CPU
242ccc9971eSMauro Carvalho Chehabbeing idle.
243ccc9971eSMauro Carvalho Chehab
244ccc9971eSMauro Carvalho ChehabIn summary, RCU expedited grace periods check for idle when building the
245ccc9971eSMauro Carvalho Chehabbitmask of CPUs that must be IPIed, just before sending each IPI, and
246ccc9971eSMauro Carvalho Chehab(either explicitly or implicitly) within the IPI handler.
247ccc9971eSMauro Carvalho Chehab
248ccc9971eSMauro Carvalho ChehabBatching via Sequence Counter
249ccc9971eSMauro Carvalho Chehab~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
250ccc9971eSMauro Carvalho Chehab
251ccc9971eSMauro Carvalho ChehabIf each grace-period request was carried out separately, expedited grace
252ccc9971eSMauro Carvalho Chehabperiods would have abysmal scalability and problematic high-load
253ccc9971eSMauro Carvalho Chehabcharacteristics. Because each grace-period operation can serve an
254ccc9971eSMauro Carvalho Chehabunlimited number of updates, it is important to *batch* requests, so
255ccc9971eSMauro Carvalho Chehabthat a single expedited grace-period operation will cover all requests
256ccc9971eSMauro Carvalho Chehabin the corresponding batch.
257ccc9971eSMauro Carvalho Chehab
258ccc9971eSMauro Carvalho ChehabThis batching is controlled by a sequence counter named
259ccc9971eSMauro Carvalho Chehab``->expedited_sequence`` in the ``rcu_state`` structure. This counter
260ccc9971eSMauro Carvalho Chehabhas an odd value when there is an expedited grace period in progress and
261ccc9971eSMauro Carvalho Chehaban even value otherwise, so that dividing the counter value by two gives
262ccc9971eSMauro Carvalho Chehabthe number of completed grace periods. During any given update request,
263ccc9971eSMauro Carvalho Chehabthe counter must transition from even to odd and then back to even, thus
264ccc9971eSMauro Carvalho Chehabindicating that a grace period has elapsed. Therefore, if the initial
265ccc9971eSMauro Carvalho Chehabvalue of the counter is ``s``, the updater must wait until the counter
266ccc9971eSMauro Carvalho Chehabreaches at least the value ``(s+3)&~0x1``. This counter is managed by
267ccc9971eSMauro Carvalho Chehabthe following access functions:
268ccc9971eSMauro Carvalho Chehab
269ccc9971eSMauro Carvalho Chehab#. ``rcu_exp_gp_seq_start()``, which marks the start of an expedited
270ccc9971eSMauro Carvalho Chehab   grace period.
271ccc9971eSMauro Carvalho Chehab#. ``rcu_exp_gp_seq_end()``, which marks the end of an expedited grace
272ccc9971eSMauro Carvalho Chehab   period.
273ccc9971eSMauro Carvalho Chehab#. ``rcu_exp_gp_seq_snap()``, which obtains a snapshot of the counter.
274ccc9971eSMauro Carvalho Chehab#. ``rcu_exp_gp_seq_done()``, which returns ``true`` if a full expedited
275ccc9971eSMauro Carvalho Chehab   grace period has elapsed since the corresponding call to
276ccc9971eSMauro Carvalho Chehab   ``rcu_exp_gp_seq_snap()``.
277ccc9971eSMauro Carvalho Chehab
278ccc9971eSMauro Carvalho ChehabAgain, only one request in a given batch need actually carry out a
279ccc9971eSMauro Carvalho Chehabgrace-period operation, which means there must be an efficient way to
280c4af9e00SRandy Dunlapidentify which of many concurrent requests will initiate the grace
281ccc9971eSMauro Carvalho Chehabperiod, and that there be an efficient way for the remaining requests to
282ccc9971eSMauro Carvalho Chehabwait for that grace period to complete. However, that is the topic of
283ccc9971eSMauro Carvalho Chehabthe next section.
284ccc9971eSMauro Carvalho Chehab
285ccc9971eSMauro Carvalho ChehabFunnel Locking and Wait/Wakeup
286ccc9971eSMauro Carvalho Chehab~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
287ccc9971eSMauro Carvalho Chehab
288ccc9971eSMauro Carvalho ChehabThe natural way to sort out which of a batch of updaters will initiate
289ccc9971eSMauro Carvalho Chehabthe expedited grace period is to use the ``rcu_node`` combining tree, as
290ccc9971eSMauro Carvalho Chehabimplemented by the ``exp_funnel_lock()`` function. The first updater
291ccc9971eSMauro Carvalho Chehabcorresponding to a given grace period arriving at a given ``rcu_node``
292ccc9971eSMauro Carvalho Chehabstructure records its desired grace-period sequence number in the
293ccc9971eSMauro Carvalho Chehab``->exp_seq_rq`` field and moves up to the next level in the tree.
294ccc9971eSMauro Carvalho ChehabOtherwise, if the ``->exp_seq_rq`` field already contains the sequence
295ccc9971eSMauro Carvalho Chehabnumber for the desired grace period or some later one, the updater
296ccc9971eSMauro Carvalho Chehabblocks on one of four wait queues in the ``->exp_wq[]`` array, using the
297ccc9971eSMauro Carvalho Chehabsecond-from-bottom and third-from bottom bits as an index. An
298ccc9971eSMauro Carvalho Chehab``->exp_lock`` field in the ``rcu_node`` structure synchronizes access
299ccc9971eSMauro Carvalho Chehabto these fields.
300ccc9971eSMauro Carvalho Chehab
301ccc9971eSMauro Carvalho ChehabAn empty ``rcu_node`` tree is shown in the following diagram, with the
302ccc9971eSMauro Carvalho Chehabwhite cells representing the ``->exp_seq_rq`` field and the red cells
303ccc9971eSMauro Carvalho Chehabrepresenting the elements of the ``->exp_wq[]`` array.
304ccc9971eSMauro Carvalho Chehab
305ccc9971eSMauro Carvalho Chehab.. kernel-figure:: Funnel0.svg
306ccc9971eSMauro Carvalho Chehab
307ccc9971eSMauro Carvalho ChehabThe next diagram shows the situation after the arrival of Task A and
308ccc9971eSMauro Carvalho ChehabTask B at the leftmost and rightmost leaf ``rcu_node`` structures,
309ccc9971eSMauro Carvalho Chehabrespectively. The current value of the ``rcu_state`` structure's
310ccc9971eSMauro Carvalho Chehab``->expedited_sequence`` field is zero, so adding three and clearing the
311ccc9971eSMauro Carvalho Chehabbottom bit results in the value two, which both tasks record in the
312ccc9971eSMauro Carvalho Chehab``->exp_seq_rq`` field of their respective ``rcu_node`` structures:
313ccc9971eSMauro Carvalho Chehab
314ccc9971eSMauro Carvalho Chehab.. kernel-figure:: Funnel1.svg
315ccc9971eSMauro Carvalho Chehab
316ccc9971eSMauro Carvalho ChehabEach of Tasks A and B will move up to the root ``rcu_node`` structure.
317ccc9971eSMauro Carvalho ChehabSuppose that Task A wins, recording its desired grace-period sequence
318ccc9971eSMauro Carvalho Chehabnumber and resulting in the state shown below:
319ccc9971eSMauro Carvalho Chehab
320ccc9971eSMauro Carvalho Chehab.. kernel-figure:: Funnel2.svg
321ccc9971eSMauro Carvalho Chehab
322ccc9971eSMauro Carvalho ChehabTask A now advances to initiate a new grace period, while Task B moves
323ccc9971eSMauro Carvalho Chehabup to the root ``rcu_node`` structure, and, seeing that its desired
324ccc9971eSMauro Carvalho Chehabsequence number is already recorded, blocks on ``->exp_wq[1]``.
325ccc9971eSMauro Carvalho Chehab
326ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
327ccc9971eSMauro Carvalho Chehab| **Quick Quiz**:                                                       |
328ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
329ccc9971eSMauro Carvalho Chehab| Why ``->exp_wq[1]``? Given that the value of these tasks' desired     |
330ccc9971eSMauro Carvalho Chehab| sequence number is two, so shouldn't they instead block on            |
331ccc9971eSMauro Carvalho Chehab| ``->exp_wq[2]``?                                                      |
332ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
333ccc9971eSMauro Carvalho Chehab| **Answer**:                                                           |
334ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
335ccc9971eSMauro Carvalho Chehab| No.                                                                   |
336ccc9971eSMauro Carvalho Chehab| Recall that the bottom bit of the desired sequence number indicates   |
337ccc9971eSMauro Carvalho Chehab| whether or not a grace period is currently in progress. It is         |
338ccc9971eSMauro Carvalho Chehab| therefore necessary to shift the sequence number right one bit        |
339ccc9971eSMauro Carvalho Chehab| position to obtain the number of the grace period. This results in    |
340ccc9971eSMauro Carvalho Chehab| ``->exp_wq[1]``.                                                      |
341ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
342ccc9971eSMauro Carvalho Chehab
343ccc9971eSMauro Carvalho ChehabIf Tasks C and D also arrive at this point, they will compute the same
344ccc9971eSMauro Carvalho Chehabdesired grace-period sequence number, and see that both leaf
345ccc9971eSMauro Carvalho Chehab``rcu_node`` structures already have that value recorded. They will
346ccc9971eSMauro Carvalho Chehabtherefore block on their respective ``rcu_node`` structures'
347ccc9971eSMauro Carvalho Chehab``->exp_wq[1]`` fields, as shown below:
348ccc9971eSMauro Carvalho Chehab
349ccc9971eSMauro Carvalho Chehab.. kernel-figure:: Funnel3.svg
350ccc9971eSMauro Carvalho Chehab
351ccc9971eSMauro Carvalho ChehabTask A now acquires the ``rcu_state`` structure's ``->exp_mutex`` and
352ccc9971eSMauro Carvalho Chehabinitiates the grace period, which increments ``->expedited_sequence``.
353ccc9971eSMauro Carvalho ChehabTherefore, if Tasks E and F arrive, they will compute a desired sequence
354ccc9971eSMauro Carvalho Chehabnumber of 4 and will record this value as shown below:
355ccc9971eSMauro Carvalho Chehab
356ccc9971eSMauro Carvalho Chehab.. kernel-figure:: Funnel4.svg
357ccc9971eSMauro Carvalho Chehab
358ccc9971eSMauro Carvalho ChehabTasks E and F will propagate up the ``rcu_node`` combining tree, with
359ccc9971eSMauro Carvalho ChehabTask F blocking on the root ``rcu_node`` structure and Task E wait for
360ccc9971eSMauro Carvalho ChehabTask A to finish so that it can start the next grace period. The
361ccc9971eSMauro Carvalho Chehabresulting state is as shown below:
362ccc9971eSMauro Carvalho Chehab
363ccc9971eSMauro Carvalho Chehab.. kernel-figure:: Funnel5.svg
364ccc9971eSMauro Carvalho Chehab
365ccc9971eSMauro Carvalho ChehabOnce the grace period completes, Task A starts waking up the tasks
366ccc9971eSMauro Carvalho Chehabwaiting for this grace period to complete, increments the
367ccc9971eSMauro Carvalho Chehab``->expedited_sequence``, acquires the ``->exp_wake_mutex`` and then
368ccc9971eSMauro Carvalho Chehabreleases the ``->exp_mutex``. This results in the following state:
369ccc9971eSMauro Carvalho Chehab
370ccc9971eSMauro Carvalho Chehab.. kernel-figure:: Funnel6.svg
371ccc9971eSMauro Carvalho Chehab
372ccc9971eSMauro Carvalho ChehabTask E can then acquire ``->exp_mutex`` and increment
373ccc9971eSMauro Carvalho Chehab``->expedited_sequence`` to the value three. If new tasks G and H arrive
374ccc9971eSMauro Carvalho Chehaband moves up the combining tree at the same time, the state will be as
375ccc9971eSMauro Carvalho Chehabfollows:
376ccc9971eSMauro Carvalho Chehab
377ccc9971eSMauro Carvalho Chehab.. kernel-figure:: Funnel7.svg
378ccc9971eSMauro Carvalho Chehab
379ccc9971eSMauro Carvalho ChehabNote that three of the root ``rcu_node`` structure's waitqueues are now
380ccc9971eSMauro Carvalho Chehaboccupied. However, at some point, Task A will wake up the tasks blocked
381ccc9971eSMauro Carvalho Chehabon the ``->exp_wq`` waitqueues, resulting in the following state:
382ccc9971eSMauro Carvalho Chehab
383ccc9971eSMauro Carvalho Chehab.. kernel-figure:: Funnel8.svg
384ccc9971eSMauro Carvalho Chehab
385ccc9971eSMauro Carvalho ChehabExecution will continue with Tasks E and H completing their grace
386ccc9971eSMauro Carvalho Chehabperiods and carrying out their wakeups.
387ccc9971eSMauro Carvalho Chehab
388ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
389ccc9971eSMauro Carvalho Chehab| **Quick Quiz**:                                                       |
390ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
391ccc9971eSMauro Carvalho Chehab| What happens if Task A takes so long to do its wakeups that Task E's  |
392ccc9971eSMauro Carvalho Chehab| grace period completes?                                               |
393ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
394ccc9971eSMauro Carvalho Chehab| **Answer**:                                                           |
395ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
396ccc9971eSMauro Carvalho Chehab| Then Task E will block on the ``->exp_wake_mutex``, which will also   |
397ccc9971eSMauro Carvalho Chehab| prevent it from releasing ``->exp_mutex``, which in turn will prevent |
398ccc9971eSMauro Carvalho Chehab| the next grace period from starting. This last is important in        |
399ccc9971eSMauro Carvalho Chehab| preventing overflow of the ``->exp_wq[]`` array.                      |
400ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
401ccc9971eSMauro Carvalho Chehab
402ccc9971eSMauro Carvalho ChehabUse of Workqueues
403ccc9971eSMauro Carvalho Chehab~~~~~~~~~~~~~~~~~
404ccc9971eSMauro Carvalho Chehab
405ccc9971eSMauro Carvalho ChehabIn earlier implementations, the task requesting the expedited grace
406ccc9971eSMauro Carvalho Chehabperiod also drove it to completion. This straightforward approach had
407ccc9971eSMauro Carvalho Chehabthe disadvantage of needing to account for POSIX signals sent to user
408c4af9e00SRandy Dunlaptasks, so more recent implementations use the Linux kernel's
409404147faSAkira Yokosawaworkqueues (see Documentation/core-api/workqueue.rst).
410ccc9971eSMauro Carvalho Chehab
411ccc9971eSMauro Carvalho ChehabThe requesting task still does counter snapshotting and funnel-lock
412ccc9971eSMauro Carvalho Chehabprocessing, but the task reaching the top of the funnel lock does a
413ccc9971eSMauro Carvalho Chehab``schedule_work()`` (from ``_synchronize_rcu_expedited()`` so that a
414ccc9971eSMauro Carvalho Chehabworkqueue kthread does the actual grace-period processing. Because
415ccc9971eSMauro Carvalho Chehabworkqueue kthreads do not accept POSIX signals, grace-period-wait
416ccc9971eSMauro Carvalho Chehabprocessing need not allow for POSIX signals. In addition, this approach
417ccc9971eSMauro Carvalho Chehaballows wakeups for the previous expedited grace period to be overlapped
418ccc9971eSMauro Carvalho Chehabwith processing for the next expedited grace period. Because there are
419ccc9971eSMauro Carvalho Chehabonly four sets of waitqueues, it is necessary to ensure that the
420ccc9971eSMauro Carvalho Chehabprevious grace period's wakeups complete before the next grace period's
421ccc9971eSMauro Carvalho Chehabwakeups start. This is handled by having the ``->exp_mutex`` guard
422ccc9971eSMauro Carvalho Chehabexpedited grace-period processing and the ``->exp_wake_mutex`` guard
423ccc9971eSMauro Carvalho Chehabwakeups. The key point is that the ``->exp_mutex`` is not released until
424ccc9971eSMauro Carvalho Chehabthe first wakeup is complete, which means that the ``->exp_wake_mutex``
425ccc9971eSMauro Carvalho Chehabhas already been acquired at that point. This approach ensures that the
426ccc9971eSMauro Carvalho Chehabprevious grace period's wakeups can be carried out while the current
427ccc9971eSMauro Carvalho Chehabgrace period is in process, but that these wakeups will complete before
428ccc9971eSMauro Carvalho Chehabthe next grace period starts. This means that only three waitqueues are
429ccc9971eSMauro Carvalho Chehabrequired, guaranteeing that the four that are provided are sufficient.
430ccc9971eSMauro Carvalho Chehab
431ccc9971eSMauro Carvalho ChehabStall Warnings
432ccc9971eSMauro Carvalho Chehab~~~~~~~~~~~~~~
433ccc9971eSMauro Carvalho Chehab
434ccc9971eSMauro Carvalho ChehabExpediting grace periods does nothing to speed things up when RCU
435ccc9971eSMauro Carvalho Chehabreaders take too long, and therefore expedited grace periods check for
436ccc9971eSMauro Carvalho Chehabstalls just as normal grace periods do.
437ccc9971eSMauro Carvalho Chehab
438ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
439ccc9971eSMauro Carvalho Chehab| **Quick Quiz**:                                                       |
440ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
441ccc9971eSMauro Carvalho Chehab| But why not just let the normal grace-period machinery detect the     |
442ccc9971eSMauro Carvalho Chehab| stalls, given that a given reader must block both normal and          |
443ccc9971eSMauro Carvalho Chehab| expedited grace periods?                                              |
444ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
445ccc9971eSMauro Carvalho Chehab| **Answer**:                                                           |
446ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
447ccc9971eSMauro Carvalho Chehab| Because it is quite possible that at a given time there is no normal  |
448ccc9971eSMauro Carvalho Chehab| grace period in progress, in which case the normal grace period       |
449ccc9971eSMauro Carvalho Chehab| cannot emit a stall warning.                                          |
450ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
451ccc9971eSMauro Carvalho Chehab
452ccc9971eSMauro Carvalho ChehabThe ``synchronize_sched_expedited_wait()`` function loops waiting for
453ccc9971eSMauro Carvalho Chehabthe expedited grace period to end, but with a timeout set to the current
454ccc9971eSMauro Carvalho ChehabRCU CPU stall-warning time. If this time is exceeded, any CPUs or
455ccc9971eSMauro Carvalho Chehab``rcu_node`` structures blocking the current grace period are printed.
456ccc9971eSMauro Carvalho ChehabEach stall warning results in another pass through the loop, but the
457ccc9971eSMauro Carvalho Chehabsecond and subsequent passes use longer stall times.
458ccc9971eSMauro Carvalho Chehab
459ccc9971eSMauro Carvalho ChehabMid-boot operation
460ccc9971eSMauro Carvalho Chehab~~~~~~~~~~~~~~~~~~
461ccc9971eSMauro Carvalho Chehab
462ccc9971eSMauro Carvalho ChehabThe use of workqueues has the advantage that the expedited grace-period
463ccc9971eSMauro Carvalho Chehabcode need not worry about POSIX signals. Unfortunately, it has the
464ccc9971eSMauro Carvalho Chehabcorresponding disadvantage that workqueues cannot be used until they are
465ccc9971eSMauro Carvalho Chehabinitialized, which does not happen until some time after the scheduler
466ccc9971eSMauro Carvalho Chehabspawns the first task. Given that there are parts of the kernel that
467ccc9971eSMauro Carvalho Chehabreally do want to execute grace periods during this mid-boot “dead
468c4af9e00SRandy Dunlapzone”, expedited grace periods must do something else during this time.
469ccc9971eSMauro Carvalho Chehab
470ccc9971eSMauro Carvalho ChehabWhat they do is to fall back to the old practice of requiring that the
471ccc9971eSMauro Carvalho Chehabrequesting task drive the expedited grace period, as was the case before
472ccc9971eSMauro Carvalho Chehabthe use of workqueues. However, the requesting task is only required to
473ccc9971eSMauro Carvalho Chehabdrive the grace period during the mid-boot dead zone. Before mid-boot, a
474ccc9971eSMauro Carvalho Chehabsynchronous grace period is a no-op. Some time after mid-boot,
475ccc9971eSMauro Carvalho Chehabworkqueues are used.
476ccc9971eSMauro Carvalho Chehab
477ccc9971eSMauro Carvalho ChehabNon-expedited non-SRCU synchronous grace periods must also operate
478ccc9971eSMauro Carvalho Chehabnormally during mid-boot. This is handled by causing non-expedited grace
479ccc9971eSMauro Carvalho Chehabperiods to take the expedited code path during mid-boot.
480ccc9971eSMauro Carvalho Chehab
481ccc9971eSMauro Carvalho ChehabThe current code assumes that there are no POSIX signals during the
482ccc9971eSMauro Carvalho Chehabmid-boot dead zone. However, if an overwhelming need for POSIX signals
483ccc9971eSMauro Carvalho Chehabsomehow arises, appropriate adjustments can be made to the expedited
484ccc9971eSMauro Carvalho Chehabstall-warning code. One such adjustment would reinstate the
485ccc9971eSMauro Carvalho Chehabpre-workqueue stall-warning checks, but only during the mid-boot dead
486ccc9971eSMauro Carvalho Chehabzone.
487ccc9971eSMauro Carvalho Chehab
488ccc9971eSMauro Carvalho ChehabWith this refinement, synchronous grace periods can now be used from
489ccc9971eSMauro Carvalho Chehabtask context pretty much any time during the life of the kernel. That
490ccc9971eSMauro Carvalho Chehabis, aside from some points in the suspend, hibernate, or shutdown code
491ccc9971eSMauro Carvalho Chehabpath.
492ccc9971eSMauro Carvalho Chehab
493ccc9971eSMauro Carvalho ChehabSummary
494ccc9971eSMauro Carvalho Chehab~~~~~~~
495ccc9971eSMauro Carvalho Chehab
496ccc9971eSMauro Carvalho ChehabExpedited grace periods use a sequence-number approach to promote
497ccc9971eSMauro Carvalho Chehabbatching, so that a single grace-period operation can serve numerous
498ccc9971eSMauro Carvalho Chehabrequests. A funnel lock is used to efficiently identify the one task out
499ccc9971eSMauro Carvalho Chehabof a concurrent group that will request the grace period. All members of
500ccc9971eSMauro Carvalho Chehabthe group will block on waitqueues provided in the ``rcu_node``
501ccc9971eSMauro Carvalho Chehabstructure. The actual grace-period processing is carried out by a
502ccc9971eSMauro Carvalho Chehabworkqueue.
503ccc9971eSMauro Carvalho Chehab
504ccc9971eSMauro Carvalho ChehabCPU-hotplug operations are noted lazily in order to prevent the need for
505ccc9971eSMauro Carvalho Chehabtight synchronization between expedited grace periods and CPU-hotplug
506ccc9971eSMauro Carvalho Chehaboperations. The dyntick-idle counters are used to avoid sending IPIs to
507ccc9971eSMauro Carvalho Chehabidle CPUs, at least in the common case. RCU-preempt and RCU-sched use
508ccc9971eSMauro Carvalho Chehabdifferent IPI handlers and different code to respond to the state
509ccc9971eSMauro Carvalho Chehabchanges carried out by those handlers, but otherwise use common code.
510ccc9971eSMauro Carvalho Chehab
511ccc9971eSMauro Carvalho ChehabQuiescent states are tracked using the ``rcu_node`` tree, and once all
512ccc9971eSMauro Carvalho Chehabnecessary quiescent states have been reported, all tasks waiting on this
513ccc9971eSMauro Carvalho Chehabexpedited grace period are awakened. A pair of mutexes are used to allow
514ccc9971eSMauro Carvalho Chehabone grace period's wakeups to proceed concurrently with the next grace
515ccc9971eSMauro Carvalho Chehabperiod's processing.
516ccc9971eSMauro Carvalho Chehab
517ccc9971eSMauro Carvalho ChehabThis combination of mechanisms allows expedited grace periods to run
518ccc9971eSMauro Carvalho Chehabreasonably efficiently. However, for non-time-critical tasks, normal
519ccc9971eSMauro Carvalho Chehabgrace periods should be used instead because their longer duration
520ccc9971eSMauro Carvalho Chehabpermits much higher degrees of batching, and thus much lower per-request
521ccc9971eSMauro Carvalho Chehaboverheads.
522