1ccc9971eSMauro Carvalho Chehab================================================= 2ccc9971eSMauro Carvalho ChehabA Tour Through TREE_RCU's Expedited Grace Periods 3ccc9971eSMauro Carvalho Chehab================================================= 4ccc9971eSMauro Carvalho Chehab 5ccc9971eSMauro Carvalho ChehabIntroduction 6ccc9971eSMauro Carvalho Chehab============ 7ccc9971eSMauro Carvalho Chehab 8ccc9971eSMauro Carvalho ChehabThis document describes RCU's expedited grace periods. 9ccc9971eSMauro Carvalho ChehabUnlike RCU's normal grace periods, which accept long latencies to attain 10ccc9971eSMauro Carvalho Chehabhigh efficiency and minimal disturbance, expedited grace periods accept 11ccc9971eSMauro Carvalho Chehablower efficiency and significant disturbance to attain shorter latencies. 12ccc9971eSMauro Carvalho Chehab 13ccc9971eSMauro Carvalho ChehabThere are two flavors of RCU (RCU-preempt and RCU-sched), with an earlier 14ccc9971eSMauro Carvalho Chehabthird RCU-bh flavor having been implemented in terms of the other two. 15ccc9971eSMauro Carvalho ChehabEach of the two implementations is covered in its own section. 16ccc9971eSMauro Carvalho Chehab 17ccc9971eSMauro Carvalho ChehabExpedited Grace Period Design 18ccc9971eSMauro Carvalho Chehab============================= 19ccc9971eSMauro Carvalho Chehab 20ccc9971eSMauro Carvalho ChehabThe expedited RCU grace periods cannot be accused of being subtle, 21ccc9971eSMauro Carvalho Chehabgiven that they for all intents and purposes hammer every CPU that 22ccc9971eSMauro Carvalho Chehabhas not yet provided a quiescent state for the current expedited 23ccc9971eSMauro Carvalho Chehabgrace period. 24ccc9971eSMauro Carvalho ChehabThe one saving grace is that the hammer has grown a bit smaller 25ccc9971eSMauro Carvalho Chehabover time: The old call to ``try_stop_cpus()`` has been 26ccc9971eSMauro Carvalho Chehabreplaced with a set of calls to ``smp_call_function_single()``, 27ccc9971eSMauro Carvalho Chehabeach of which results in an IPI to the target CPU. 28ccc9971eSMauro Carvalho ChehabThe corresponding handler function checks the CPU's state, motivating 29ccc9971eSMauro Carvalho Chehaba faster quiescent state where possible, and triggering a report 30ccc9971eSMauro Carvalho Chehabof that quiescent state. 31ccc9971eSMauro Carvalho ChehabAs always for RCU, once everything has spent some time in a quiescent 32ccc9971eSMauro Carvalho Chehabstate, the expedited grace period has completed. 33ccc9971eSMauro Carvalho Chehab 34ccc9971eSMauro Carvalho ChehabThe details of the ``smp_call_function_single()`` handler's 35ccc9971eSMauro Carvalho Chehaboperation depend on the RCU flavor, as described in the following 36ccc9971eSMauro Carvalho Chehabsections. 37ccc9971eSMauro Carvalho Chehab 38ccc9971eSMauro Carvalho ChehabRCU-preempt Expedited Grace Periods 39ccc9971eSMauro Carvalho Chehab=================================== 40ccc9971eSMauro Carvalho Chehab 4181ad58beSSebastian Andrzej Siewior``CONFIG_PREEMPTION=y`` kernels implement RCU-preempt. 42ccc9971eSMauro Carvalho ChehabThe overall flow of the handling of a given CPU by an RCU-preempt 43ccc9971eSMauro Carvalho Chehabexpedited grace period is shown in the following diagram: 44ccc9971eSMauro Carvalho Chehab 45ccc9971eSMauro Carvalho Chehab.. kernel-figure:: ExpRCUFlow.svg 46ccc9971eSMauro Carvalho Chehab 47ccc9971eSMauro Carvalho ChehabThe solid arrows denote direct action, for example, a function call. 48ccc9971eSMauro Carvalho ChehabThe dotted arrows denote indirect action, for example, an IPI 49ccc9971eSMauro Carvalho Chehabor a state that is reached after some time. 50ccc9971eSMauro Carvalho Chehab 51ccc9971eSMauro Carvalho ChehabIf a given CPU is offline or idle, ``synchronize_rcu_expedited()`` 52ccc9971eSMauro Carvalho Chehabwill ignore it because idle and offline CPUs are already residing 53ccc9971eSMauro Carvalho Chehabin quiescent states. 54ccc9971eSMauro Carvalho ChehabOtherwise, the expedited grace period will use 55ccc9971eSMauro Carvalho Chehab``smp_call_function_single()`` to send the CPU an IPI, which 56ccc9971eSMauro Carvalho Chehabis handled by ``rcu_exp_handler()``. 57ccc9971eSMauro Carvalho Chehab 58ccc9971eSMauro Carvalho ChehabHowever, because this is preemptible RCU, ``rcu_exp_handler()`` 59ccc9971eSMauro Carvalho Chehabcan check to see if the CPU is currently running in an RCU read-side 60ccc9971eSMauro Carvalho Chehabcritical section. 61ccc9971eSMauro Carvalho ChehabIf not, the handler can immediately report a quiescent state. 62ccc9971eSMauro Carvalho ChehabOtherwise, it sets flags so that the outermost ``rcu_read_unlock()`` 63ccc9971eSMauro Carvalho Chehabinvocation will provide the needed quiescent-state report. 64ccc9971eSMauro Carvalho ChehabThis flag-setting avoids the previous forced preemption of all 65ccc9971eSMauro Carvalho ChehabCPUs that might have RCU read-side critical sections. 66ccc9971eSMauro Carvalho ChehabIn addition, this flag-setting is done so as to avoid increasing 67ccc9971eSMauro Carvalho Chehabthe overhead of the common-case fastpath through the scheduler. 68ccc9971eSMauro Carvalho Chehab 69ccc9971eSMauro Carvalho ChehabAgain because this is preemptible RCU, an RCU read-side critical section 70ccc9971eSMauro Carvalho Chehabcan be preempted. 71ccc9971eSMauro Carvalho ChehabWhen that happens, RCU will enqueue the task, which will the continue to 72ccc9971eSMauro Carvalho Chehabblock the current expedited grace period until it resumes and finds its 73ccc9971eSMauro Carvalho Chehaboutermost ``rcu_read_unlock()``. 74ccc9971eSMauro Carvalho ChehabThe CPU will report a quiescent state just after enqueuing the task because 75ccc9971eSMauro Carvalho Chehabthe CPU is no longer blocking the grace period. 76ccc9971eSMauro Carvalho ChehabIt is instead the preempted task doing the blocking. 77ccc9971eSMauro Carvalho ChehabThe list of blocked tasks is managed by ``rcu_preempt_ctxt_queue()``, 78ccc9971eSMauro Carvalho Chehabwhich is called from ``rcu_preempt_note_context_switch()``, which 79ccc9971eSMauro Carvalho Chehabin turn is called from ``rcu_note_context_switch()``, which in 80ccc9971eSMauro Carvalho Chehabturn is called from the scheduler. 81ccc9971eSMauro Carvalho Chehab 82ccc9971eSMauro Carvalho Chehab 83ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 84ccc9971eSMauro Carvalho Chehab| **Quick Quiz**: | 85ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 86ccc9971eSMauro Carvalho Chehab| Why not just have the expedited grace period check the state of all | 87ccc9971eSMauro Carvalho Chehab| the CPUs? After all, that would avoid all those real-time-unfriendly | 88ccc9971eSMauro Carvalho Chehab| IPIs. | 89ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 90ccc9971eSMauro Carvalho Chehab| **Answer**: | 91ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 92ccc9971eSMauro Carvalho Chehab| Because we want the RCU read-side critical sections to run fast, | 93ccc9971eSMauro Carvalho Chehab| which means no memory barriers. Therefore, it is not possible to | 94ccc9971eSMauro Carvalho Chehab| safely check the state from some other CPU. And even if it was | 95ccc9971eSMauro Carvalho Chehab| possible to safely check the state, it would still be necessary to | 96ccc9971eSMauro Carvalho Chehab| IPI the CPU to safely interact with the upcoming | 97ccc9971eSMauro Carvalho Chehab| ``rcu_read_unlock()`` invocation, which means that the remote state | 98ccc9971eSMauro Carvalho Chehab| testing would not help the worst-case latency that real-time | 99ccc9971eSMauro Carvalho Chehab| applications care about. | 100ccc9971eSMauro Carvalho Chehab| | 101ccc9971eSMauro Carvalho Chehab| One way to prevent your real-time application from getting hit with | 102ccc9971eSMauro Carvalho Chehab| these IPIs is to build your kernel with ``CONFIG_NO_HZ_FULL=y``. RCU | 103ccc9971eSMauro Carvalho Chehab| would then perceive the CPU running your application as being idle, | 104ccc9971eSMauro Carvalho Chehab| and it would be able to safely detect that state without needing to | 105ccc9971eSMauro Carvalho Chehab| IPI the CPU. | 106ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 107ccc9971eSMauro Carvalho Chehab 108ccc9971eSMauro Carvalho ChehabPlease note that this is just the overall flow: Additional complications 109ccc9971eSMauro Carvalho Chehabcan arise due to races with CPUs going idle or offline, among other 110ccc9971eSMauro Carvalho Chehabthings. 111ccc9971eSMauro Carvalho Chehab 112ccc9971eSMauro Carvalho ChehabRCU-sched Expedited Grace Periods 113ccc9971eSMauro Carvalho Chehab--------------------------------- 114ccc9971eSMauro Carvalho Chehab 11581ad58beSSebastian Andrzej Siewior``CONFIG_PREEMPTION=n`` kernels implement RCU-sched. The overall flow of 116ccc9971eSMauro Carvalho Chehabthe handling of a given CPU by an RCU-sched expedited grace period is 117ccc9971eSMauro Carvalho Chehabshown in the following diagram: 118ccc9971eSMauro Carvalho Chehab 119ccc9971eSMauro Carvalho Chehab.. kernel-figure:: ExpSchedFlow.svg 120ccc9971eSMauro Carvalho Chehab 121ccc9971eSMauro Carvalho ChehabAs with RCU-preempt, RCU-sched's ``synchronize_rcu_expedited()`` ignores 122ccc9971eSMauro Carvalho Chehaboffline and idle CPUs, again because they are in remotely detectable 123ccc9971eSMauro Carvalho Chehabquiescent states. However, because the ``rcu_read_lock_sched()`` and 124ccc9971eSMauro Carvalho Chehab``rcu_read_unlock_sched()`` leave no trace of their invocation, in 125ccc9971eSMauro Carvalho Chehabgeneral it is not possible to tell whether or not the current CPU is in 126ccc9971eSMauro Carvalho Chehaban RCU read-side critical section. The best that RCU-sched's 127ccc9971eSMauro Carvalho Chehab``rcu_exp_handler()`` can do is to check for idle, on the off-chance 128ccc9971eSMauro Carvalho Chehabthat the CPU went idle while the IPI was in flight. If the CPU is idle, 129ccc9971eSMauro Carvalho Chehabthen ``rcu_exp_handler()`` reports the quiescent state. 130ccc9971eSMauro Carvalho Chehab 131ccc9971eSMauro Carvalho ChehabOtherwise, the handler forces a future context switch by setting the 132ccc9971eSMauro Carvalho ChehabNEED_RESCHED flag of the current task's thread flag and the CPU preempt 133ccc9971eSMauro Carvalho Chehabcounter. At the time of the context switch, the CPU reports the 134ccc9971eSMauro Carvalho Chehabquiescent state. Should the CPU go offline first, it will report the 135ccc9971eSMauro Carvalho Chehabquiescent state at that time. 136ccc9971eSMauro Carvalho Chehab 137ccc9971eSMauro Carvalho ChehabExpedited Grace Period and CPU Hotplug 138ccc9971eSMauro Carvalho Chehab-------------------------------------- 139ccc9971eSMauro Carvalho Chehab 140ccc9971eSMauro Carvalho ChehabThe expedited nature of expedited grace periods require a much tighter 141ccc9971eSMauro Carvalho Chehabinteraction with CPU hotplug operations than is required for normal 142ccc9971eSMauro Carvalho Chehabgrace periods. In addition, attempting to IPI offline CPUs will result 143ccc9971eSMauro Carvalho Chehabin splats, but failing to IPI online CPUs can result in too-short grace 144ccc9971eSMauro Carvalho Chehabperiods. Neither option is acceptable in production kernels. 145ccc9971eSMauro Carvalho Chehab 146ccc9971eSMauro Carvalho ChehabThe interaction between expedited grace periods and CPU hotplug 147ccc9971eSMauro Carvalho Chehaboperations is carried out at several levels: 148ccc9971eSMauro Carvalho Chehab 149ccc9971eSMauro Carvalho Chehab#. The number of CPUs that have ever been online is tracked by the 150ccc9971eSMauro Carvalho Chehab ``rcu_state`` structure's ``->ncpus`` field. The ``rcu_state`` 151ccc9971eSMauro Carvalho Chehab structure's ``->ncpus_snap`` field tracks the number of CPUs that 152ccc9971eSMauro Carvalho Chehab have ever been online at the beginning of an RCU expedited grace 153ccc9971eSMauro Carvalho Chehab period. Note that this number never decreases, at least in the 154ccc9971eSMauro Carvalho Chehab absence of a time machine. 155ccc9971eSMauro Carvalho Chehab#. The identities of the CPUs that have ever been online is tracked by 156ccc9971eSMauro Carvalho Chehab the ``rcu_node`` structure's ``->expmaskinitnext`` field. The 157ccc9971eSMauro Carvalho Chehab ``rcu_node`` structure's ``->expmaskinit`` field tracks the 158ccc9971eSMauro Carvalho Chehab identities of the CPUs that were online at least once at the 159ccc9971eSMauro Carvalho Chehab beginning of the most recent RCU expedited grace period. The 160ccc9971eSMauro Carvalho Chehab ``rcu_state`` structure's ``->ncpus`` and ``->ncpus_snap`` fields are 161ccc9971eSMauro Carvalho Chehab used to detect when new CPUs have come online for the first time, 162ccc9971eSMauro Carvalho Chehab that is, when the ``rcu_node`` structure's ``->expmaskinitnext`` 163ccc9971eSMauro Carvalho Chehab field has changed since the beginning of the last RCU expedited grace 164ccc9971eSMauro Carvalho Chehab period, which triggers an update of each ``rcu_node`` structure's 165ccc9971eSMauro Carvalho Chehab ``->expmaskinit`` field from its ``->expmaskinitnext`` field. 166ccc9971eSMauro Carvalho Chehab#. Each ``rcu_node`` structure's ``->expmaskinit`` field is used to 167ccc9971eSMauro Carvalho Chehab initialize that structure's ``->expmask`` at the beginning of each 168ccc9971eSMauro Carvalho Chehab RCU expedited grace period. This means that only those CPUs that have 169ccc9971eSMauro Carvalho Chehab been online at least once will be considered for a given grace 170ccc9971eSMauro Carvalho Chehab period. 171ccc9971eSMauro Carvalho Chehab#. Any CPU that goes offline will clear its bit in its leaf ``rcu_node`` 172ccc9971eSMauro Carvalho Chehab structure's ``->qsmaskinitnext`` field, so any CPU with that bit 173ccc9971eSMauro Carvalho Chehab clear can safely be ignored. However, it is possible for a CPU coming 174ccc9971eSMauro Carvalho Chehab online or going offline to have this bit set for some time while 175ccc9971eSMauro Carvalho Chehab ``cpu_online`` returns ``false``. 176ccc9971eSMauro Carvalho Chehab#. For each non-idle CPU that RCU believes is currently online, the 177ccc9971eSMauro Carvalho Chehab grace period invokes ``smp_call_function_single()``. If this 178ccc9971eSMauro Carvalho Chehab succeeds, the CPU was fully online. Failure indicates that the CPU is 179ccc9971eSMauro Carvalho Chehab in the process of coming online or going offline, in which case it is 180ccc9971eSMauro Carvalho Chehab necessary to wait for a short time period and try again. The purpose 181ccc9971eSMauro Carvalho Chehab of this wait (or series of waits, as the case may be) is to permit a 182ccc9971eSMauro Carvalho Chehab concurrent CPU-hotplug operation to complete. 183ccc9971eSMauro Carvalho Chehab#. In the case of RCU-sched, one of the last acts of an outgoing CPU is 184*448e9f34SFrederic Weisbecker to invoke ``rcutree_report_cpu_dead()``, which reports a quiescent state for 185ccc9971eSMauro Carvalho Chehab that CPU. However, this is likely paranoia-induced redundancy. 186ccc9971eSMauro Carvalho Chehab 187ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 188ccc9971eSMauro Carvalho Chehab| **Quick Quiz**: | 189ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 190ccc9971eSMauro Carvalho Chehab| Why all the dancing around with multiple counters and masks tracking | 191ccc9971eSMauro Carvalho Chehab| CPUs that were once online? Why not just have a single set of masks | 192ccc9971eSMauro Carvalho Chehab| tracking the currently online CPUs and be done with it? | 193ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 194ccc9971eSMauro Carvalho Chehab| **Answer**: | 195ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 196ccc9971eSMauro Carvalho Chehab| Maintaining single set of masks tracking the online CPUs *sounds* | 197ccc9971eSMauro Carvalho Chehab| easier, at least until you try working out all the race conditions | 198ccc9971eSMauro Carvalho Chehab| between grace-period initialization and CPU-hotplug operations. For | 199ccc9971eSMauro Carvalho Chehab| example, suppose initialization is progressing down the tree while a | 200ccc9971eSMauro Carvalho Chehab| CPU-offline operation is progressing up the tree. This situation can | 201ccc9971eSMauro Carvalho Chehab| result in bits set at the top of the tree that have no counterparts | 202ccc9971eSMauro Carvalho Chehab| at the bottom of the tree. Those bits will never be cleared, which | 203ccc9971eSMauro Carvalho Chehab| will result in grace-period hangs. In short, that way lies madness, | 204ccc9971eSMauro Carvalho Chehab| to say nothing of a great many bugs, hangs, and deadlocks. | 205ccc9971eSMauro Carvalho Chehab| In contrast, the current multi-mask multi-counter scheme ensures that | 206ccc9971eSMauro Carvalho Chehab| grace-period initialization will always see consistent masks up and | 207ccc9971eSMauro Carvalho Chehab| down the tree, which brings significant simplifications over the | 208ccc9971eSMauro Carvalho Chehab| single-mask method. | 209ccc9971eSMauro Carvalho Chehab| | 210ccc9971eSMauro Carvalho Chehab| This is an instance of `deferring work in order to avoid | 211ccc9971eSMauro Carvalho Chehab| synchronization <http://www.cs.columbia.edu/~library/TR-repository/re | 212ccc9971eSMauro Carvalho Chehab| ports/reports-1992/cucs-039-92.ps.gz>`__. | 213ccc9971eSMauro Carvalho Chehab| Lazily recording CPU-hotplug events at the beginning of the next | 214ccc9971eSMauro Carvalho Chehab| grace period greatly simplifies maintenance of the CPU-tracking | 215ccc9971eSMauro Carvalho Chehab| bitmasks in the ``rcu_node`` tree. | 216ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 217ccc9971eSMauro Carvalho Chehab 218ccc9971eSMauro Carvalho ChehabExpedited Grace Period Refinements 219ccc9971eSMauro Carvalho Chehab---------------------------------- 220ccc9971eSMauro Carvalho Chehab 221ccc9971eSMauro Carvalho ChehabIdle-CPU Checks 222ccc9971eSMauro Carvalho Chehab~~~~~~~~~~~~~~~ 223ccc9971eSMauro Carvalho Chehab 224ccc9971eSMauro Carvalho ChehabEach expedited grace period checks for idle CPUs when initially forming 225ccc9971eSMauro Carvalho Chehabthe mask of CPUs to be IPIed and again just before IPIing a CPU (both 226ccc9971eSMauro Carvalho Chehabchecks are carried out by ``sync_rcu_exp_select_cpus()``). If the CPU is 227ccc9971eSMauro Carvalho Chehabidle at any time between those two times, the CPU will not be IPIed. 228ccc9971eSMauro Carvalho ChehabInstead, the task pushing the grace period forward will include the idle 229ccc9971eSMauro Carvalho ChehabCPUs in the mask passed to ``rcu_report_exp_cpu_mult()``. 230ccc9971eSMauro Carvalho Chehab 231ccc9971eSMauro Carvalho ChehabFor RCU-sched, there is an additional check: If the IPI has interrupted 232ccc9971eSMauro Carvalho Chehabthe idle loop, then ``rcu_exp_handler()`` invokes 233ccc9971eSMauro Carvalho Chehab``rcu_report_exp_rdp()`` to report the corresponding quiescent state. 234ccc9971eSMauro Carvalho Chehab 235ccc9971eSMauro Carvalho ChehabFor RCU-preempt, there is no specific check for idle in the IPI handler 236ccc9971eSMauro Carvalho Chehab(``rcu_exp_handler()``), but because RCU read-side critical sections are 237ccc9971eSMauro Carvalho Chehabnot permitted within the idle loop, if ``rcu_exp_handler()`` sees that 238ccc9971eSMauro Carvalho Chehabthe CPU is within RCU read-side critical section, the CPU cannot 239ccc9971eSMauro Carvalho Chehabpossibly be idle. Otherwise, ``rcu_exp_handler()`` invokes 240ccc9971eSMauro Carvalho Chehab``rcu_report_exp_rdp()`` to report the corresponding quiescent state, 241ccc9971eSMauro Carvalho Chehabregardless of whether or not that quiescent state was due to the CPU 242ccc9971eSMauro Carvalho Chehabbeing idle. 243ccc9971eSMauro Carvalho Chehab 244ccc9971eSMauro Carvalho ChehabIn summary, RCU expedited grace periods check for idle when building the 245ccc9971eSMauro Carvalho Chehabbitmask of CPUs that must be IPIed, just before sending each IPI, and 246ccc9971eSMauro Carvalho Chehab(either explicitly or implicitly) within the IPI handler. 247ccc9971eSMauro Carvalho Chehab 248ccc9971eSMauro Carvalho ChehabBatching via Sequence Counter 249ccc9971eSMauro Carvalho Chehab~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 250ccc9971eSMauro Carvalho Chehab 251ccc9971eSMauro Carvalho ChehabIf each grace-period request was carried out separately, expedited grace 252ccc9971eSMauro Carvalho Chehabperiods would have abysmal scalability and problematic high-load 253ccc9971eSMauro Carvalho Chehabcharacteristics. Because each grace-period operation can serve an 254ccc9971eSMauro Carvalho Chehabunlimited number of updates, it is important to *batch* requests, so 255ccc9971eSMauro Carvalho Chehabthat a single expedited grace-period operation will cover all requests 256ccc9971eSMauro Carvalho Chehabin the corresponding batch. 257ccc9971eSMauro Carvalho Chehab 258ccc9971eSMauro Carvalho ChehabThis batching is controlled by a sequence counter named 259ccc9971eSMauro Carvalho Chehab``->expedited_sequence`` in the ``rcu_state`` structure. This counter 260ccc9971eSMauro Carvalho Chehabhas an odd value when there is an expedited grace period in progress and 261ccc9971eSMauro Carvalho Chehaban even value otherwise, so that dividing the counter value by two gives 262ccc9971eSMauro Carvalho Chehabthe number of completed grace periods. During any given update request, 263ccc9971eSMauro Carvalho Chehabthe counter must transition from even to odd and then back to even, thus 264ccc9971eSMauro Carvalho Chehabindicating that a grace period has elapsed. Therefore, if the initial 265ccc9971eSMauro Carvalho Chehabvalue of the counter is ``s``, the updater must wait until the counter 266ccc9971eSMauro Carvalho Chehabreaches at least the value ``(s+3)&~0x1``. This counter is managed by 267ccc9971eSMauro Carvalho Chehabthe following access functions: 268ccc9971eSMauro Carvalho Chehab 269ccc9971eSMauro Carvalho Chehab#. ``rcu_exp_gp_seq_start()``, which marks the start of an expedited 270ccc9971eSMauro Carvalho Chehab grace period. 271ccc9971eSMauro Carvalho Chehab#. ``rcu_exp_gp_seq_end()``, which marks the end of an expedited grace 272ccc9971eSMauro Carvalho Chehab period. 273ccc9971eSMauro Carvalho Chehab#. ``rcu_exp_gp_seq_snap()``, which obtains a snapshot of the counter. 274ccc9971eSMauro Carvalho Chehab#. ``rcu_exp_gp_seq_done()``, which returns ``true`` if a full expedited 275ccc9971eSMauro Carvalho Chehab grace period has elapsed since the corresponding call to 276ccc9971eSMauro Carvalho Chehab ``rcu_exp_gp_seq_snap()``. 277ccc9971eSMauro Carvalho Chehab 278ccc9971eSMauro Carvalho ChehabAgain, only one request in a given batch need actually carry out a 279ccc9971eSMauro Carvalho Chehabgrace-period operation, which means there must be an efficient way to 280c4af9e00SRandy Dunlapidentify which of many concurrent requests will initiate the grace 281ccc9971eSMauro Carvalho Chehabperiod, and that there be an efficient way for the remaining requests to 282ccc9971eSMauro Carvalho Chehabwait for that grace period to complete. However, that is the topic of 283ccc9971eSMauro Carvalho Chehabthe next section. 284ccc9971eSMauro Carvalho Chehab 285ccc9971eSMauro Carvalho ChehabFunnel Locking and Wait/Wakeup 286ccc9971eSMauro Carvalho Chehab~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 287ccc9971eSMauro Carvalho Chehab 288ccc9971eSMauro Carvalho ChehabThe natural way to sort out which of a batch of updaters will initiate 289ccc9971eSMauro Carvalho Chehabthe expedited grace period is to use the ``rcu_node`` combining tree, as 290ccc9971eSMauro Carvalho Chehabimplemented by the ``exp_funnel_lock()`` function. The first updater 291ccc9971eSMauro Carvalho Chehabcorresponding to a given grace period arriving at a given ``rcu_node`` 292ccc9971eSMauro Carvalho Chehabstructure records its desired grace-period sequence number in the 293ccc9971eSMauro Carvalho Chehab``->exp_seq_rq`` field and moves up to the next level in the tree. 294ccc9971eSMauro Carvalho ChehabOtherwise, if the ``->exp_seq_rq`` field already contains the sequence 295ccc9971eSMauro Carvalho Chehabnumber for the desired grace period or some later one, the updater 296ccc9971eSMauro Carvalho Chehabblocks on one of four wait queues in the ``->exp_wq[]`` array, using the 297ccc9971eSMauro Carvalho Chehabsecond-from-bottom and third-from bottom bits as an index. An 298ccc9971eSMauro Carvalho Chehab``->exp_lock`` field in the ``rcu_node`` structure synchronizes access 299ccc9971eSMauro Carvalho Chehabto these fields. 300ccc9971eSMauro Carvalho Chehab 301ccc9971eSMauro Carvalho ChehabAn empty ``rcu_node`` tree is shown in the following diagram, with the 302ccc9971eSMauro Carvalho Chehabwhite cells representing the ``->exp_seq_rq`` field and the red cells 303ccc9971eSMauro Carvalho Chehabrepresenting the elements of the ``->exp_wq[]`` array. 304ccc9971eSMauro Carvalho Chehab 305ccc9971eSMauro Carvalho Chehab.. kernel-figure:: Funnel0.svg 306ccc9971eSMauro Carvalho Chehab 307ccc9971eSMauro Carvalho ChehabThe next diagram shows the situation after the arrival of Task A and 308ccc9971eSMauro Carvalho ChehabTask B at the leftmost and rightmost leaf ``rcu_node`` structures, 309ccc9971eSMauro Carvalho Chehabrespectively. The current value of the ``rcu_state`` structure's 310ccc9971eSMauro Carvalho Chehab``->expedited_sequence`` field is zero, so adding three and clearing the 311ccc9971eSMauro Carvalho Chehabbottom bit results in the value two, which both tasks record in the 312ccc9971eSMauro Carvalho Chehab``->exp_seq_rq`` field of their respective ``rcu_node`` structures: 313ccc9971eSMauro Carvalho Chehab 314ccc9971eSMauro Carvalho Chehab.. kernel-figure:: Funnel1.svg 315ccc9971eSMauro Carvalho Chehab 316ccc9971eSMauro Carvalho ChehabEach of Tasks A and B will move up to the root ``rcu_node`` structure. 317ccc9971eSMauro Carvalho ChehabSuppose that Task A wins, recording its desired grace-period sequence 318ccc9971eSMauro Carvalho Chehabnumber and resulting in the state shown below: 319ccc9971eSMauro Carvalho Chehab 320ccc9971eSMauro Carvalho Chehab.. kernel-figure:: Funnel2.svg 321ccc9971eSMauro Carvalho Chehab 322ccc9971eSMauro Carvalho ChehabTask A now advances to initiate a new grace period, while Task B moves 323ccc9971eSMauro Carvalho Chehabup to the root ``rcu_node`` structure, and, seeing that its desired 324ccc9971eSMauro Carvalho Chehabsequence number is already recorded, blocks on ``->exp_wq[1]``. 325ccc9971eSMauro Carvalho Chehab 326ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 327ccc9971eSMauro Carvalho Chehab| **Quick Quiz**: | 328ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 329ccc9971eSMauro Carvalho Chehab| Why ``->exp_wq[1]``? Given that the value of these tasks' desired | 330ccc9971eSMauro Carvalho Chehab| sequence number is two, so shouldn't they instead block on | 331ccc9971eSMauro Carvalho Chehab| ``->exp_wq[2]``? | 332ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 333ccc9971eSMauro Carvalho Chehab| **Answer**: | 334ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 335ccc9971eSMauro Carvalho Chehab| No. | 336ccc9971eSMauro Carvalho Chehab| Recall that the bottom bit of the desired sequence number indicates | 337ccc9971eSMauro Carvalho Chehab| whether or not a grace period is currently in progress. It is | 338ccc9971eSMauro Carvalho Chehab| therefore necessary to shift the sequence number right one bit | 339ccc9971eSMauro Carvalho Chehab| position to obtain the number of the grace period. This results in | 340ccc9971eSMauro Carvalho Chehab| ``->exp_wq[1]``. | 341ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 342ccc9971eSMauro Carvalho Chehab 343ccc9971eSMauro Carvalho ChehabIf Tasks C and D also arrive at this point, they will compute the same 344ccc9971eSMauro Carvalho Chehabdesired grace-period sequence number, and see that both leaf 345ccc9971eSMauro Carvalho Chehab``rcu_node`` structures already have that value recorded. They will 346ccc9971eSMauro Carvalho Chehabtherefore block on their respective ``rcu_node`` structures' 347ccc9971eSMauro Carvalho Chehab``->exp_wq[1]`` fields, as shown below: 348ccc9971eSMauro Carvalho Chehab 349ccc9971eSMauro Carvalho Chehab.. kernel-figure:: Funnel3.svg 350ccc9971eSMauro Carvalho Chehab 351ccc9971eSMauro Carvalho ChehabTask A now acquires the ``rcu_state`` structure's ``->exp_mutex`` and 352ccc9971eSMauro Carvalho Chehabinitiates the grace period, which increments ``->expedited_sequence``. 353ccc9971eSMauro Carvalho ChehabTherefore, if Tasks E and F arrive, they will compute a desired sequence 354ccc9971eSMauro Carvalho Chehabnumber of 4 and will record this value as shown below: 355ccc9971eSMauro Carvalho Chehab 356ccc9971eSMauro Carvalho Chehab.. kernel-figure:: Funnel4.svg 357ccc9971eSMauro Carvalho Chehab 358ccc9971eSMauro Carvalho ChehabTasks E and F will propagate up the ``rcu_node`` combining tree, with 359ccc9971eSMauro Carvalho ChehabTask F blocking on the root ``rcu_node`` structure and Task E wait for 360ccc9971eSMauro Carvalho ChehabTask A to finish so that it can start the next grace period. The 361ccc9971eSMauro Carvalho Chehabresulting state is as shown below: 362ccc9971eSMauro Carvalho Chehab 363ccc9971eSMauro Carvalho Chehab.. kernel-figure:: Funnel5.svg 364ccc9971eSMauro Carvalho Chehab 365ccc9971eSMauro Carvalho ChehabOnce the grace period completes, Task A starts waking up the tasks 366ccc9971eSMauro Carvalho Chehabwaiting for this grace period to complete, increments the 367ccc9971eSMauro Carvalho Chehab``->expedited_sequence``, acquires the ``->exp_wake_mutex`` and then 368ccc9971eSMauro Carvalho Chehabreleases the ``->exp_mutex``. This results in the following state: 369ccc9971eSMauro Carvalho Chehab 370ccc9971eSMauro Carvalho Chehab.. kernel-figure:: Funnel6.svg 371ccc9971eSMauro Carvalho Chehab 372ccc9971eSMauro Carvalho ChehabTask E can then acquire ``->exp_mutex`` and increment 373ccc9971eSMauro Carvalho Chehab``->expedited_sequence`` to the value three. If new tasks G and H arrive 374ccc9971eSMauro Carvalho Chehaband moves up the combining tree at the same time, the state will be as 375ccc9971eSMauro Carvalho Chehabfollows: 376ccc9971eSMauro Carvalho Chehab 377ccc9971eSMauro Carvalho Chehab.. kernel-figure:: Funnel7.svg 378ccc9971eSMauro Carvalho Chehab 379ccc9971eSMauro Carvalho ChehabNote that three of the root ``rcu_node`` structure's waitqueues are now 380ccc9971eSMauro Carvalho Chehaboccupied. However, at some point, Task A will wake up the tasks blocked 381ccc9971eSMauro Carvalho Chehabon the ``->exp_wq`` waitqueues, resulting in the following state: 382ccc9971eSMauro Carvalho Chehab 383ccc9971eSMauro Carvalho Chehab.. kernel-figure:: Funnel8.svg 384ccc9971eSMauro Carvalho Chehab 385ccc9971eSMauro Carvalho ChehabExecution will continue with Tasks E and H completing their grace 386ccc9971eSMauro Carvalho Chehabperiods and carrying out their wakeups. 387ccc9971eSMauro Carvalho Chehab 388ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 389ccc9971eSMauro Carvalho Chehab| **Quick Quiz**: | 390ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 391ccc9971eSMauro Carvalho Chehab| What happens if Task A takes so long to do its wakeups that Task E's | 392ccc9971eSMauro Carvalho Chehab| grace period completes? | 393ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 394ccc9971eSMauro Carvalho Chehab| **Answer**: | 395ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 396ccc9971eSMauro Carvalho Chehab| Then Task E will block on the ``->exp_wake_mutex``, which will also | 397ccc9971eSMauro Carvalho Chehab| prevent it from releasing ``->exp_mutex``, which in turn will prevent | 398ccc9971eSMauro Carvalho Chehab| the next grace period from starting. This last is important in | 399ccc9971eSMauro Carvalho Chehab| preventing overflow of the ``->exp_wq[]`` array. | 400ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 401ccc9971eSMauro Carvalho Chehab 402ccc9971eSMauro Carvalho ChehabUse of Workqueues 403ccc9971eSMauro Carvalho Chehab~~~~~~~~~~~~~~~~~ 404ccc9971eSMauro Carvalho Chehab 405ccc9971eSMauro Carvalho ChehabIn earlier implementations, the task requesting the expedited grace 406ccc9971eSMauro Carvalho Chehabperiod also drove it to completion. This straightforward approach had 407ccc9971eSMauro Carvalho Chehabthe disadvantage of needing to account for POSIX signals sent to user 408c4af9e00SRandy Dunlaptasks, so more recent implementations use the Linux kernel's 409404147faSAkira Yokosawaworkqueues (see Documentation/core-api/workqueue.rst). 410ccc9971eSMauro Carvalho Chehab 411ccc9971eSMauro Carvalho ChehabThe requesting task still does counter snapshotting and funnel-lock 412ccc9971eSMauro Carvalho Chehabprocessing, but the task reaching the top of the funnel lock does a 413ccc9971eSMauro Carvalho Chehab``schedule_work()`` (from ``_synchronize_rcu_expedited()`` so that a 414ccc9971eSMauro Carvalho Chehabworkqueue kthread does the actual grace-period processing. Because 415ccc9971eSMauro Carvalho Chehabworkqueue kthreads do not accept POSIX signals, grace-period-wait 416ccc9971eSMauro Carvalho Chehabprocessing need not allow for POSIX signals. In addition, this approach 417ccc9971eSMauro Carvalho Chehaballows wakeups for the previous expedited grace period to be overlapped 418ccc9971eSMauro Carvalho Chehabwith processing for the next expedited grace period. Because there are 419ccc9971eSMauro Carvalho Chehabonly four sets of waitqueues, it is necessary to ensure that the 420ccc9971eSMauro Carvalho Chehabprevious grace period's wakeups complete before the next grace period's 421ccc9971eSMauro Carvalho Chehabwakeups start. This is handled by having the ``->exp_mutex`` guard 422ccc9971eSMauro Carvalho Chehabexpedited grace-period processing and the ``->exp_wake_mutex`` guard 423ccc9971eSMauro Carvalho Chehabwakeups. The key point is that the ``->exp_mutex`` is not released until 424ccc9971eSMauro Carvalho Chehabthe first wakeup is complete, which means that the ``->exp_wake_mutex`` 425ccc9971eSMauro Carvalho Chehabhas already been acquired at that point. This approach ensures that the 426ccc9971eSMauro Carvalho Chehabprevious grace period's wakeups can be carried out while the current 427ccc9971eSMauro Carvalho Chehabgrace period is in process, but that these wakeups will complete before 428ccc9971eSMauro Carvalho Chehabthe next grace period starts. This means that only three waitqueues are 429ccc9971eSMauro Carvalho Chehabrequired, guaranteeing that the four that are provided are sufficient. 430ccc9971eSMauro Carvalho Chehab 431ccc9971eSMauro Carvalho ChehabStall Warnings 432ccc9971eSMauro Carvalho Chehab~~~~~~~~~~~~~~ 433ccc9971eSMauro Carvalho Chehab 434ccc9971eSMauro Carvalho ChehabExpediting grace periods does nothing to speed things up when RCU 435ccc9971eSMauro Carvalho Chehabreaders take too long, and therefore expedited grace periods check for 436ccc9971eSMauro Carvalho Chehabstalls just as normal grace periods do. 437ccc9971eSMauro Carvalho Chehab 438ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 439ccc9971eSMauro Carvalho Chehab| **Quick Quiz**: | 440ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 441ccc9971eSMauro Carvalho Chehab| But why not just let the normal grace-period machinery detect the | 442ccc9971eSMauro Carvalho Chehab| stalls, given that a given reader must block both normal and | 443ccc9971eSMauro Carvalho Chehab| expedited grace periods? | 444ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 445ccc9971eSMauro Carvalho Chehab| **Answer**: | 446ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 447ccc9971eSMauro Carvalho Chehab| Because it is quite possible that at a given time there is no normal | 448ccc9971eSMauro Carvalho Chehab| grace period in progress, in which case the normal grace period | 449ccc9971eSMauro Carvalho Chehab| cannot emit a stall warning. | 450ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 451ccc9971eSMauro Carvalho Chehab 452ccc9971eSMauro Carvalho ChehabThe ``synchronize_sched_expedited_wait()`` function loops waiting for 453ccc9971eSMauro Carvalho Chehabthe expedited grace period to end, but with a timeout set to the current 454ccc9971eSMauro Carvalho ChehabRCU CPU stall-warning time. If this time is exceeded, any CPUs or 455ccc9971eSMauro Carvalho Chehab``rcu_node`` structures blocking the current grace period are printed. 456ccc9971eSMauro Carvalho ChehabEach stall warning results in another pass through the loop, but the 457ccc9971eSMauro Carvalho Chehabsecond and subsequent passes use longer stall times. 458ccc9971eSMauro Carvalho Chehab 459ccc9971eSMauro Carvalho ChehabMid-boot operation 460ccc9971eSMauro Carvalho Chehab~~~~~~~~~~~~~~~~~~ 461ccc9971eSMauro Carvalho Chehab 462ccc9971eSMauro Carvalho ChehabThe use of workqueues has the advantage that the expedited grace-period 463ccc9971eSMauro Carvalho Chehabcode need not worry about POSIX signals. Unfortunately, it has the 464ccc9971eSMauro Carvalho Chehabcorresponding disadvantage that workqueues cannot be used until they are 465ccc9971eSMauro Carvalho Chehabinitialized, which does not happen until some time after the scheduler 466ccc9971eSMauro Carvalho Chehabspawns the first task. Given that there are parts of the kernel that 467ccc9971eSMauro Carvalho Chehabreally do want to execute grace periods during this mid-boot “dead 468c4af9e00SRandy Dunlapzone”, expedited grace periods must do something else during this time. 469ccc9971eSMauro Carvalho Chehab 470ccc9971eSMauro Carvalho ChehabWhat they do is to fall back to the old practice of requiring that the 471ccc9971eSMauro Carvalho Chehabrequesting task drive the expedited grace period, as was the case before 472ccc9971eSMauro Carvalho Chehabthe use of workqueues. However, the requesting task is only required to 473ccc9971eSMauro Carvalho Chehabdrive the grace period during the mid-boot dead zone. Before mid-boot, a 474ccc9971eSMauro Carvalho Chehabsynchronous grace period is a no-op. Some time after mid-boot, 475ccc9971eSMauro Carvalho Chehabworkqueues are used. 476ccc9971eSMauro Carvalho Chehab 477ccc9971eSMauro Carvalho ChehabNon-expedited non-SRCU synchronous grace periods must also operate 478ccc9971eSMauro Carvalho Chehabnormally during mid-boot. This is handled by causing non-expedited grace 479ccc9971eSMauro Carvalho Chehabperiods to take the expedited code path during mid-boot. 480ccc9971eSMauro Carvalho Chehab 481ccc9971eSMauro Carvalho ChehabThe current code assumes that there are no POSIX signals during the 482ccc9971eSMauro Carvalho Chehabmid-boot dead zone. However, if an overwhelming need for POSIX signals 483ccc9971eSMauro Carvalho Chehabsomehow arises, appropriate adjustments can be made to the expedited 484ccc9971eSMauro Carvalho Chehabstall-warning code. One such adjustment would reinstate the 485ccc9971eSMauro Carvalho Chehabpre-workqueue stall-warning checks, but only during the mid-boot dead 486ccc9971eSMauro Carvalho Chehabzone. 487ccc9971eSMauro Carvalho Chehab 488ccc9971eSMauro Carvalho ChehabWith this refinement, synchronous grace periods can now be used from 489ccc9971eSMauro Carvalho Chehabtask context pretty much any time during the life of the kernel. That 490ccc9971eSMauro Carvalho Chehabis, aside from some points in the suspend, hibernate, or shutdown code 491ccc9971eSMauro Carvalho Chehabpath. 492ccc9971eSMauro Carvalho Chehab 493ccc9971eSMauro Carvalho ChehabSummary 494ccc9971eSMauro Carvalho Chehab~~~~~~~ 495ccc9971eSMauro Carvalho Chehab 496ccc9971eSMauro Carvalho ChehabExpedited grace periods use a sequence-number approach to promote 497ccc9971eSMauro Carvalho Chehabbatching, so that a single grace-period operation can serve numerous 498ccc9971eSMauro Carvalho Chehabrequests. A funnel lock is used to efficiently identify the one task out 499ccc9971eSMauro Carvalho Chehabof a concurrent group that will request the grace period. All members of 500ccc9971eSMauro Carvalho Chehabthe group will block on waitqueues provided in the ``rcu_node`` 501ccc9971eSMauro Carvalho Chehabstructure. The actual grace-period processing is carried out by a 502ccc9971eSMauro Carvalho Chehabworkqueue. 503ccc9971eSMauro Carvalho Chehab 504ccc9971eSMauro Carvalho ChehabCPU-hotplug operations are noted lazily in order to prevent the need for 505ccc9971eSMauro Carvalho Chehabtight synchronization between expedited grace periods and CPU-hotplug 506ccc9971eSMauro Carvalho Chehaboperations. The dyntick-idle counters are used to avoid sending IPIs to 507ccc9971eSMauro Carvalho Chehabidle CPUs, at least in the common case. RCU-preempt and RCU-sched use 508ccc9971eSMauro Carvalho Chehabdifferent IPI handlers and different code to respond to the state 509ccc9971eSMauro Carvalho Chehabchanges carried out by those handlers, but otherwise use common code. 510ccc9971eSMauro Carvalho Chehab 511ccc9971eSMauro Carvalho ChehabQuiescent states are tracked using the ``rcu_node`` tree, and once all 512ccc9971eSMauro Carvalho Chehabnecessary quiescent states have been reported, all tasks waiting on this 513ccc9971eSMauro Carvalho Chehabexpedited grace period are awakened. A pair of mutexes are used to allow 514ccc9971eSMauro Carvalho Chehabone grace period's wakeups to proceed concurrently with the next grace 515ccc9971eSMauro Carvalho Chehabperiod's processing. 516ccc9971eSMauro Carvalho Chehab 517ccc9971eSMauro Carvalho ChehabThis combination of mechanisms allows expedited grace periods to run 518ccc9971eSMauro Carvalho Chehabreasonably efficiently. However, for non-time-critical tasks, normal 519ccc9971eSMauro Carvalho Chehabgrace periods should be used instead because their longer duration 520ccc9971eSMauro Carvalho Chehabpermits much higher degrees of batching, and thus much lower per-request 521ccc9971eSMauro Carvalho Chehaboverheads. 522