1ccc9971eSMauro Carvalho Chehab=================================================== 2ccc9971eSMauro Carvalho ChehabA Tour Through TREE_RCU's Data Structures [LWN.net] 3ccc9971eSMauro Carvalho Chehab=================================================== 4ccc9971eSMauro Carvalho Chehab 5ccc9971eSMauro Carvalho ChehabDecember 18, 2016 6ccc9971eSMauro Carvalho Chehab 7ccc9971eSMauro Carvalho ChehabThis article was contributed by Paul E. McKenney 8ccc9971eSMauro Carvalho Chehab 9ccc9971eSMauro Carvalho ChehabIntroduction 10ccc9971eSMauro Carvalho Chehab============ 11ccc9971eSMauro Carvalho Chehab 12ccc9971eSMauro Carvalho ChehabThis document describes RCU's major data structures and their relationship 13ccc9971eSMauro Carvalho Chehabto each other. 14ccc9971eSMauro Carvalho Chehab 15ccc9971eSMauro Carvalho ChehabData-Structure Relationships 16ccc9971eSMauro Carvalho Chehab============================ 17ccc9971eSMauro Carvalho Chehab 18ccc9971eSMauro Carvalho ChehabRCU is for all intents and purposes a large state machine, and its 19ccc9971eSMauro Carvalho Chehabdata structures maintain the state in such a way as to allow RCU readers 20ccc9971eSMauro Carvalho Chehabto execute extremely quickly, while also processing the RCU grace periods 21ccc9971eSMauro Carvalho Chehabrequested by updaters in an efficient and extremely scalable fashion. 22ccc9971eSMauro Carvalho ChehabThe efficiency and scalability of RCU updaters is provided primarily 23ccc9971eSMauro Carvalho Chehabby a combining tree, as shown below: 24ccc9971eSMauro Carvalho Chehab 25ccc9971eSMauro Carvalho Chehab.. kernel-figure:: BigTreeClassicRCU.svg 26ccc9971eSMauro Carvalho Chehab 27ccc9971eSMauro Carvalho ChehabThis diagram shows an enclosing ``rcu_state`` structure containing a tree 28ccc9971eSMauro Carvalho Chehabof ``rcu_node`` structures. Each leaf node of the ``rcu_node`` tree has up 29ccc9971eSMauro Carvalho Chehabto 16 ``rcu_data`` structures associated with it, so that there are 30ccc9971eSMauro Carvalho Chehab``NR_CPUS`` number of ``rcu_data`` structures, one for each possible CPU. 31ccc9971eSMauro Carvalho ChehabThis structure is adjusted at boot time, if needed, to handle the common 32ccc9971eSMauro Carvalho Chehabcase where ``nr_cpu_ids`` is much less than ``NR_CPUs``. 33ccc9971eSMauro Carvalho ChehabFor example, a number of Linux distributions set ``NR_CPUs=4096``, 34ccc9971eSMauro Carvalho Chehabwhich results in a three-level ``rcu_node`` tree. 35ccc9971eSMauro Carvalho ChehabIf the actual hardware has only 16 CPUs, RCU will adjust itself 36ccc9971eSMauro Carvalho Chehabat boot time, resulting in an ``rcu_node`` tree with only a single node. 37ccc9971eSMauro Carvalho Chehab 38ccc9971eSMauro Carvalho ChehabThe purpose of this combining tree is to allow per-CPU events 39ccc9971eSMauro Carvalho Chehabsuch as quiescent states, dyntick-idle transitions, 40ccc9971eSMauro Carvalho Chehaband CPU hotplug operations to be processed efficiently 41ccc9971eSMauro Carvalho Chehaband scalably. 42ccc9971eSMauro Carvalho ChehabQuiescent states are recorded by the per-CPU ``rcu_data`` structures, 43ccc9971eSMauro Carvalho Chehaband other events are recorded by the leaf-level ``rcu_node`` 44ccc9971eSMauro Carvalho Chehabstructures. 45ccc9971eSMauro Carvalho ChehabAll of these events are combined at each level of the tree until finally 46ccc9971eSMauro Carvalho Chehabgrace periods are completed at the tree's root ``rcu_node`` 47ccc9971eSMauro Carvalho Chehabstructure. 48ccc9971eSMauro Carvalho ChehabA grace period can be completed at the root once every CPU 49ccc9971eSMauro Carvalho Chehab(or, in the case of ``CONFIG_PREEMPT_RCU``, task) 50ccc9971eSMauro Carvalho Chehabhas passed through a quiescent state. 51ccc9971eSMauro Carvalho ChehabOnce a grace period has completed, record of that fact is propagated 52ccc9971eSMauro Carvalho Chehabback down the tree. 53ccc9971eSMauro Carvalho Chehab 54ccc9971eSMauro Carvalho ChehabAs can be seen from the diagram, on a 64-bit system 55ccc9971eSMauro Carvalho Chehaba two-level tree with 64 leaves can accommodate 1,024 CPUs, with a fanout 56ccc9971eSMauro Carvalho Chehabof 64 at the root and a fanout of 16 at the leaves. 57ccc9971eSMauro Carvalho Chehab 58ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 59ccc9971eSMauro Carvalho Chehab| **Quick Quiz**: | 60ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 61ccc9971eSMauro Carvalho Chehab| Why isn't the fanout at the leaves also 64? | 62ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 63ccc9971eSMauro Carvalho Chehab| **Answer**: | 64ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 65ccc9971eSMauro Carvalho Chehab| Because there are more types of events that affect the leaf-level | 66ccc9971eSMauro Carvalho Chehab| ``rcu_node`` structures than further up the tree. Therefore, if the | 67ccc9971eSMauro Carvalho Chehab| leaf ``rcu_node`` structures have fanout of 64, the contention on | 68ccc9971eSMauro Carvalho Chehab| these structures' ``->structures`` becomes excessive. Experimentation | 69ccc9971eSMauro Carvalho Chehab| on a wide variety of systems has shown that a fanout of 16 works well | 70ccc9971eSMauro Carvalho Chehab| for the leaves of the ``rcu_node`` tree. | 71ccc9971eSMauro Carvalho Chehab| | 72ccc9971eSMauro Carvalho Chehab| Of course, further experience with systems having hundreds or | 73ccc9971eSMauro Carvalho Chehab| thousands of CPUs may demonstrate that the fanout for the non-leaf | 74ccc9971eSMauro Carvalho Chehab| ``rcu_node`` structures must also be reduced. Such reduction can be | 75ccc9971eSMauro Carvalho Chehab| easily carried out when and if it proves necessary. In the meantime, | 76ccc9971eSMauro Carvalho Chehab| if you are using such a system and running into contention problems | 77ccc9971eSMauro Carvalho Chehab| on the non-leaf ``rcu_node`` structures, you may use the | 78ccc9971eSMauro Carvalho Chehab| ``CONFIG_RCU_FANOUT`` kernel configuration parameter to reduce the | 79ccc9971eSMauro Carvalho Chehab| non-leaf fanout as needed. | 80ccc9971eSMauro Carvalho Chehab| | 81ccc9971eSMauro Carvalho Chehab| Kernels built for systems with strong NUMA characteristics might | 82ccc9971eSMauro Carvalho Chehab| also need to adjust ``CONFIG_RCU_FANOUT`` so that the domains of | 83ccc9971eSMauro Carvalho Chehab| the ``rcu_node`` structures align with hardware boundaries. | 84ccc9971eSMauro Carvalho Chehab| However, there has thus far been no need for this. | 85ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 86ccc9971eSMauro Carvalho Chehab 87ccc9971eSMauro Carvalho ChehabIf your system has more than 1,024 CPUs (or more than 512 CPUs on a 88ccc9971eSMauro Carvalho Chehab32-bit system), then RCU will automatically add more levels to the tree. 89ccc9971eSMauro Carvalho ChehabFor example, if you are crazy enough to build a 64-bit system with 90ccc9971eSMauro Carvalho Chehab65,536 CPUs, RCU would configure the ``rcu_node`` tree as follows: 91ccc9971eSMauro Carvalho Chehab 92ccc9971eSMauro Carvalho Chehab.. kernel-figure:: HugeTreeClassicRCU.svg 93ccc9971eSMauro Carvalho Chehab 94ccc9971eSMauro Carvalho ChehabRCU currently permits up to a four-level tree, which on a 64-bit system 95ccc9971eSMauro Carvalho Chehabaccommodates up to 4,194,304 CPUs, though only a mere 524,288 CPUs for 96ccc9971eSMauro Carvalho Chehab32-bit systems. On the other hand, you can set both 97ccc9971eSMauro Carvalho Chehab``CONFIG_RCU_FANOUT`` and ``CONFIG_RCU_FANOUT_LEAF`` to be as small as 98ccc9971eSMauro Carvalho Chehab2, which would result in a 16-CPU test using a 4-level tree. This can be 99ccc9971eSMauro Carvalho Chehabuseful for testing large-system capabilities on small test machines. 100ccc9971eSMauro Carvalho Chehab 101ccc9971eSMauro Carvalho ChehabThis multi-level combining tree allows us to get most of the performance 102ccc9971eSMauro Carvalho Chehaband scalability benefits of partitioning, even though RCU grace-period 103ccc9971eSMauro Carvalho Chehabdetection is inherently a global operation. The trick here is that only 104ccc9971eSMauro Carvalho Chehabthe last CPU to report a quiescent state into a given ``rcu_node`` 105ccc9971eSMauro Carvalho Chehabstructure need advance to the ``rcu_node`` structure at the next level 106ccc9971eSMauro Carvalho Chehabup the tree. This means that at the leaf-level ``rcu_node`` structure, 107ccc9971eSMauro Carvalho Chehabonly one access out of sixteen will progress up the tree. For the 108ccc9971eSMauro Carvalho Chehabinternal ``rcu_node`` structures, the situation is even more extreme: 109ccc9971eSMauro Carvalho ChehabOnly one access out of sixty-four will progress up the tree. Because the 110ccc9971eSMauro Carvalho Chehabvast majority of the CPUs do not progress up the tree, the lock 111ccc9971eSMauro Carvalho Chehabcontention remains roughly constant up the tree. No matter how many CPUs 112ccc9971eSMauro Carvalho Chehabthere are in the system, at most 64 quiescent-state reports per grace 113ccc9971eSMauro Carvalho Chehabperiod will progress all the way to the root ``rcu_node`` structure, 114ccc9971eSMauro Carvalho Chehabthus ensuring that the lock contention on that root ``rcu_node`` 115ccc9971eSMauro Carvalho Chehabstructure remains acceptably low. 116ccc9971eSMauro Carvalho Chehab 117ccc9971eSMauro Carvalho ChehabIn effect, the combining tree acts like a big shock absorber, keeping 118ccc9971eSMauro Carvalho Chehablock contention under control at all tree levels regardless of the level 119ccc9971eSMauro Carvalho Chehabof loading on the system. 120ccc9971eSMauro Carvalho Chehab 121ccc9971eSMauro Carvalho ChehabRCU updaters wait for normal grace periods by registering RCU callbacks, 122ccc9971eSMauro Carvalho Chehabeither directly via ``call_rcu()`` or indirectly via 123ccc9971eSMauro Carvalho Chehab``synchronize_rcu()`` and friends. RCU callbacks are represented by 124ccc9971eSMauro Carvalho Chehab``rcu_head`` structures, which are queued on ``rcu_data`` structures 125ccc9971eSMauro Carvalho Chehabwhile they are waiting for a grace period to elapse, as shown in the 126ccc9971eSMauro Carvalho Chehabfollowing figure: 127ccc9971eSMauro Carvalho Chehab 128ccc9971eSMauro Carvalho Chehab.. kernel-figure:: BigTreePreemptRCUBHdyntickCB.svg 129ccc9971eSMauro Carvalho Chehab 130ccc9971eSMauro Carvalho ChehabThis figure shows how ``TREE_RCU``'s and ``PREEMPT_RCU``'s major data 131ccc9971eSMauro Carvalho Chehabstructures are related. Lesser data structures will be introduced with 132ccc9971eSMauro Carvalho Chehabthe algorithms that make use of them. 133ccc9971eSMauro Carvalho Chehab 134ccc9971eSMauro Carvalho ChehabNote that each of the data structures in the above figure has its own 135ccc9971eSMauro Carvalho Chehabsynchronization: 136ccc9971eSMauro Carvalho Chehab 137ccc9971eSMauro Carvalho Chehab#. Each ``rcu_state`` structures has a lock and a mutex, and some fields 138ccc9971eSMauro Carvalho Chehab are protected by the corresponding root ``rcu_node`` structure's lock. 139ccc9971eSMauro Carvalho Chehab#. Each ``rcu_node`` structure has a spinlock. 140ccc9971eSMauro Carvalho Chehab#. The fields in ``rcu_data`` are private to the corresponding CPU, 141ccc9971eSMauro Carvalho Chehab although a few can be read and written by other CPUs. 142ccc9971eSMauro Carvalho Chehab 143ccc9971eSMauro Carvalho ChehabIt is important to note that different data structures can have very 144ccc9971eSMauro Carvalho Chehabdifferent ideas about the state of RCU at any given time. For but one 145ccc9971eSMauro Carvalho Chehabexample, awareness of the start or end of a given RCU grace period 146ccc9971eSMauro Carvalho Chehabpropagates slowly through the data structures. This slow propagation is 147ccc9971eSMauro Carvalho Chehababsolutely necessary for RCU to have good read-side performance. If this 148ccc9971eSMauro Carvalho Chehabbalkanized implementation seems foreign to you, one useful trick is to 149ccc9971eSMauro Carvalho Chehabconsider each instance of these data structures to be a different 150ccc9971eSMauro Carvalho Chehabperson, each having the usual slightly different view of reality. 151ccc9971eSMauro Carvalho Chehab 152ccc9971eSMauro Carvalho ChehabThe general role of each of these data structures is as follows: 153ccc9971eSMauro Carvalho Chehab 154ccc9971eSMauro Carvalho Chehab#. ``rcu_state``: This structure forms the interconnection between the 155ccc9971eSMauro Carvalho Chehab ``rcu_node`` and ``rcu_data`` structures, tracks grace periods, 156ccc9971eSMauro Carvalho Chehab serves as short-term repository for callbacks orphaned by CPU-hotplug 157ccc9971eSMauro Carvalho Chehab events, maintains ``rcu_barrier()`` state, tracks expedited 158ccc9971eSMauro Carvalho Chehab grace-period state, and maintains state used to force quiescent 159ccc9971eSMauro Carvalho Chehab states when grace periods extend too long, 160ccc9971eSMauro Carvalho Chehab#. ``rcu_node``: This structure forms the combining tree that propagates 161ccc9971eSMauro Carvalho Chehab quiescent-state information from the leaves to the root, and also 162ccc9971eSMauro Carvalho Chehab propagates grace-period information from the root to the leaves. It 163ccc9971eSMauro Carvalho Chehab provides local copies of the grace-period state in order to allow 164ccc9971eSMauro Carvalho Chehab this information to be accessed in a synchronized manner without 165ccc9971eSMauro Carvalho Chehab suffering the scalability limitations that would otherwise be imposed 166ccc9971eSMauro Carvalho Chehab by global locking. In ``CONFIG_PREEMPT_RCU`` kernels, it manages the 167ccc9971eSMauro Carvalho Chehab lists of tasks that have blocked while in their current RCU read-side 168ccc9971eSMauro Carvalho Chehab critical section. In ``CONFIG_PREEMPT_RCU`` with 169ccc9971eSMauro Carvalho Chehab ``CONFIG_RCU_BOOST``, it manages the per-\ ``rcu_node`` 170ccc9971eSMauro Carvalho Chehab priority-boosting kernel threads (kthreads) and state. Finally, it 171ccc9971eSMauro Carvalho Chehab records CPU-hotplug state in order to determine which CPUs should be 172ccc9971eSMauro Carvalho Chehab ignored during a given grace period. 173ccc9971eSMauro Carvalho Chehab#. ``rcu_data``: This per-CPU structure is the focus of quiescent-state 174ccc9971eSMauro Carvalho Chehab detection and RCU callback queuing. It also tracks its relationship 175ccc9971eSMauro Carvalho Chehab to the corresponding leaf ``rcu_node`` structure to allow 176ccc9971eSMauro Carvalho Chehab more-efficient propagation of quiescent states up the ``rcu_node`` 177ccc9971eSMauro Carvalho Chehab combining tree. Like the ``rcu_node`` structure, it provides a local 178ccc9971eSMauro Carvalho Chehab copy of the grace-period information to allow for-free synchronized 179ccc9971eSMauro Carvalho Chehab access to this information from the corresponding CPU. Finally, this 180ccc9971eSMauro Carvalho Chehab structure records past dyntick-idle state for the corresponding CPU 181ccc9971eSMauro Carvalho Chehab and also tracks statistics. 182ccc9971eSMauro Carvalho Chehab#. ``rcu_head``: This structure represents RCU callbacks, and is the 183ccc9971eSMauro Carvalho Chehab only structure allocated and managed by RCU users. The ``rcu_head`` 184ccc9971eSMauro Carvalho Chehab structure is normally embedded within the RCU-protected data 185ccc9971eSMauro Carvalho Chehab structure. 186ccc9971eSMauro Carvalho Chehab 187ccc9971eSMauro Carvalho ChehabIf all you wanted from this article was a general notion of how RCU's 188ccc9971eSMauro Carvalho Chehabdata structures are related, you are done. Otherwise, each of the 189ccc9971eSMauro Carvalho Chehabfollowing sections give more details on the ``rcu_state``, ``rcu_node`` 190ccc9971eSMauro Carvalho Chehaband ``rcu_data`` data structures. 191ccc9971eSMauro Carvalho Chehab 192ccc9971eSMauro Carvalho ChehabThe ``rcu_state`` Structure 193ccc9971eSMauro Carvalho Chehab~~~~~~~~~~~~~~~~~~~~~~~~~~~ 194ccc9971eSMauro Carvalho Chehab 195ccc9971eSMauro Carvalho ChehabThe ``rcu_state`` structure is the base structure that represents the 196ccc9971eSMauro Carvalho Chehabstate of RCU in the system. This structure forms the interconnection 197ccc9971eSMauro Carvalho Chehabbetween the ``rcu_node`` and ``rcu_data`` structures, tracks grace 198ccc9971eSMauro Carvalho Chehabperiods, contains the lock used to synchronize with CPU-hotplug events, 199ccc9971eSMauro Carvalho Chehaband maintains state used to force quiescent states when grace periods 200ccc9971eSMauro Carvalho Chehabextend too long, 201ccc9971eSMauro Carvalho Chehab 202ccc9971eSMauro Carvalho ChehabA few of the ``rcu_state`` structure's fields are discussed, singly and 203ccc9971eSMauro Carvalho Chehabin groups, in the following sections. The more specialized fields are 204ccc9971eSMauro Carvalho Chehabcovered in the discussion of their use. 205ccc9971eSMauro Carvalho Chehab 206ccc9971eSMauro Carvalho ChehabRelationship to rcu_node and rcu_data Structures 207ccc9971eSMauro Carvalho Chehab'''''''''''''''''''''''''''''''''''''''''''''''' 208ccc9971eSMauro Carvalho Chehab 209ccc9971eSMauro Carvalho ChehabThis portion of the ``rcu_state`` structure is declared as follows: 210ccc9971eSMauro Carvalho Chehab 211ccc9971eSMauro Carvalho Chehab:: 212ccc9971eSMauro Carvalho Chehab 213ccc9971eSMauro Carvalho Chehab 1 struct rcu_node node[NUM_RCU_NODES]; 214ccc9971eSMauro Carvalho Chehab 2 struct rcu_node *level[NUM_RCU_LVLS + 1]; 215ccc9971eSMauro Carvalho Chehab 3 struct rcu_data __percpu *rda; 216ccc9971eSMauro Carvalho Chehab 217ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 218ccc9971eSMauro Carvalho Chehab| **Quick Quiz**: | 219ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 220ccc9971eSMauro Carvalho Chehab| Wait a minute! You said that the ``rcu_node`` structures formed a | 221ccc9971eSMauro Carvalho Chehab| tree, but they are declared as a flat array! What gives? | 222ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 223ccc9971eSMauro Carvalho Chehab| **Answer**: | 224ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 225ccc9971eSMauro Carvalho Chehab| The tree is laid out in the array. The first node In the array is the | 226ccc9971eSMauro Carvalho Chehab| head, the next set of nodes in the array are children of the head | 227ccc9971eSMauro Carvalho Chehab| node, and so on until the last set of nodes in the array are the | 228ccc9971eSMauro Carvalho Chehab| leaves. | 229ccc9971eSMauro Carvalho Chehab| See the following diagrams to see how this works. | 230ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 231ccc9971eSMauro Carvalho Chehab 232ccc9971eSMauro Carvalho ChehabThe ``rcu_node`` tree is embedded into the ``->node[]`` array as shown 233ccc9971eSMauro Carvalho Chehabin the following figure: 234ccc9971eSMauro Carvalho Chehab 235ccc9971eSMauro Carvalho Chehab.. kernel-figure:: TreeMapping.svg 236ccc9971eSMauro Carvalho Chehab 237ccc9971eSMauro Carvalho ChehabOne interesting consequence of this mapping is that a breadth-first 238ccc9971eSMauro Carvalho Chehabtraversal of the tree is implemented as a simple linear scan of the 239ccc9971eSMauro Carvalho Chehabarray, which is in fact what the ``rcu_for_each_node_breadth_first()`` 240ccc9971eSMauro Carvalho Chehabmacro does. This macro is used at the beginning and ends of grace 241ccc9971eSMauro Carvalho Chehabperiods. 242ccc9971eSMauro Carvalho Chehab 243ccc9971eSMauro Carvalho ChehabEach entry of the ``->level`` array references the first ``rcu_node`` 244ccc9971eSMauro Carvalho Chehabstructure on the corresponding level of the tree, for example, as shown 245ccc9971eSMauro Carvalho Chehabbelow: 246ccc9971eSMauro Carvalho Chehab 247ccc9971eSMauro Carvalho Chehab.. kernel-figure:: TreeMappingLevel.svg 248ccc9971eSMauro Carvalho Chehab 249ccc9971eSMauro Carvalho ChehabThe zero\ :sup:`th` element of the array references the root 250ccc9971eSMauro Carvalho Chehab``rcu_node`` structure, the first element references the first child of 251ccc9971eSMauro Carvalho Chehabthe root ``rcu_node``, and finally the second element references the 252ccc9971eSMauro Carvalho Chehabfirst leaf ``rcu_node`` structure. 253ccc9971eSMauro Carvalho Chehab 254ccc9971eSMauro Carvalho ChehabFor whatever it is worth, if you draw the tree to be tree-shaped rather 255ccc9971eSMauro Carvalho Chehabthan array-shaped, it is easy to draw a planar representation: 256ccc9971eSMauro Carvalho Chehab 257ccc9971eSMauro Carvalho Chehab.. kernel-figure:: TreeLevel.svg 258ccc9971eSMauro Carvalho Chehab 259ccc9971eSMauro Carvalho ChehabFinally, the ``->rda`` field references a per-CPU pointer to the 260ccc9971eSMauro Carvalho Chehabcorresponding CPU's ``rcu_data`` structure. 261ccc9971eSMauro Carvalho Chehab 262ccc9971eSMauro Carvalho ChehabAll of these fields are constant once initialization is complete, and 263ccc9971eSMauro Carvalho Chehabtherefore need no protection. 264ccc9971eSMauro Carvalho Chehab 265ccc9971eSMauro Carvalho ChehabGrace-Period Tracking 266ccc9971eSMauro Carvalho Chehab''''''''''''''''''''' 267ccc9971eSMauro Carvalho Chehab 268ccc9971eSMauro Carvalho ChehabThis portion of the ``rcu_state`` structure is declared as follows: 269ccc9971eSMauro Carvalho Chehab 270ccc9971eSMauro Carvalho Chehab:: 271ccc9971eSMauro Carvalho Chehab 272ccc9971eSMauro Carvalho Chehab 1 unsigned long gp_seq; 273ccc9971eSMauro Carvalho Chehab 274ccc9971eSMauro Carvalho ChehabRCU grace periods are numbered, and the ``->gp_seq`` field contains the 275ccc9971eSMauro Carvalho Chehabcurrent grace-period sequence number. The bottom two bits are the state 276ccc9971eSMauro Carvalho Chehabof the current grace period, which can be zero for not yet started or 277ccc9971eSMauro Carvalho Chehabone for in progress. In other words, if the bottom two bits of 278ccc9971eSMauro Carvalho Chehab``->gp_seq`` are zero, then RCU is idle. Any other value in the bottom 279ccc9971eSMauro Carvalho Chehabtwo bits indicates that something is broken. This field is protected by 280ccc9971eSMauro Carvalho Chehabthe root ``rcu_node`` structure's ``->lock`` field. 281ccc9971eSMauro Carvalho Chehab 282ccc9971eSMauro Carvalho ChehabThere are ``->gp_seq`` fields in the ``rcu_node`` and ``rcu_data`` 283ccc9971eSMauro Carvalho Chehabstructures as well. The fields in the ``rcu_state`` structure represent 284ccc9971eSMauro Carvalho Chehabthe most current value, and those of the other structures are compared 285ccc9971eSMauro Carvalho Chehabin order to detect the beginnings and ends of grace periods in a 286ccc9971eSMauro Carvalho Chehabdistributed fashion. The values flow from ``rcu_state`` to ``rcu_node`` 287ccc9971eSMauro Carvalho Chehab(down the tree from the root to the leaves) to ``rcu_data``. 288ccc9971eSMauro Carvalho Chehab 289ccc9971eSMauro Carvalho ChehabMiscellaneous 290ccc9971eSMauro Carvalho Chehab''''''''''''' 291ccc9971eSMauro Carvalho Chehab 292ccc9971eSMauro Carvalho ChehabThis portion of the ``rcu_state`` structure is declared as follows: 293ccc9971eSMauro Carvalho Chehab 294ccc9971eSMauro Carvalho Chehab:: 295ccc9971eSMauro Carvalho Chehab 296ccc9971eSMauro Carvalho Chehab 1 unsigned long gp_max; 297ccc9971eSMauro Carvalho Chehab 2 char abbr; 298ccc9971eSMauro Carvalho Chehab 3 char *name; 299ccc9971eSMauro Carvalho Chehab 300ccc9971eSMauro Carvalho ChehabThe ``->gp_max`` field tracks the duration of the longest grace period 301ccc9971eSMauro Carvalho Chehabin jiffies. It is protected by the root ``rcu_node``'s ``->lock``. 302ccc9971eSMauro Carvalho Chehab 303ccc9971eSMauro Carvalho ChehabThe ``->name`` and ``->abbr`` fields distinguish between preemptible RCU 304ccc9971eSMauro Carvalho Chehab(“rcu_preempt” and “p”) and non-preemptible RCU (“rcu_sched” and “s”). 305ccc9971eSMauro Carvalho ChehabThese fields are used for diagnostic and tracing purposes. 306ccc9971eSMauro Carvalho Chehab 307ccc9971eSMauro Carvalho ChehabThe ``rcu_node`` Structure 308ccc9971eSMauro Carvalho Chehab~~~~~~~~~~~~~~~~~~~~~~~~~~ 309ccc9971eSMauro Carvalho Chehab 310ccc9971eSMauro Carvalho ChehabThe ``rcu_node`` structures form the combining tree that propagates 311ccc9971eSMauro Carvalho Chehabquiescent-state information from the leaves to the root and also that 312ccc9971eSMauro Carvalho Chehabpropagates grace-period information from the root down to the leaves. 313ccc9971eSMauro Carvalho ChehabThey provides local copies of the grace-period state in order to allow 314ccc9971eSMauro Carvalho Chehabthis information to be accessed in a synchronized manner without 315ccc9971eSMauro Carvalho Chehabsuffering the scalability limitations that would otherwise be imposed by 316ccc9971eSMauro Carvalho Chehabglobal locking. In ``CONFIG_PREEMPT_RCU`` kernels, they manage the lists 317ccc9971eSMauro Carvalho Chehabof tasks that have blocked while in their current RCU read-side critical 318ccc9971eSMauro Carvalho Chehabsection. In ``CONFIG_PREEMPT_RCU`` with ``CONFIG_RCU_BOOST``, they 319ccc9971eSMauro Carvalho Chehabmanage the per-\ ``rcu_node`` priority-boosting kernel threads 320ccc9971eSMauro Carvalho Chehab(kthreads) and state. Finally, they record CPU-hotplug state in order to 321ccc9971eSMauro Carvalho Chehabdetermine which CPUs should be ignored during a given grace period. 322ccc9971eSMauro Carvalho Chehab 323ccc9971eSMauro Carvalho ChehabThe ``rcu_node`` structure's fields are discussed, singly and in groups, 324ccc9971eSMauro Carvalho Chehabin the following sections. 325ccc9971eSMauro Carvalho Chehab 326ccc9971eSMauro Carvalho ChehabConnection to Combining Tree 327ccc9971eSMauro Carvalho Chehab'''''''''''''''''''''''''''' 328ccc9971eSMauro Carvalho Chehab 329ccc9971eSMauro Carvalho ChehabThis portion of the ``rcu_node`` structure is declared as follows: 330ccc9971eSMauro Carvalho Chehab 331ccc9971eSMauro Carvalho Chehab:: 332ccc9971eSMauro Carvalho Chehab 333ccc9971eSMauro Carvalho Chehab 1 struct rcu_node *parent; 334ccc9971eSMauro Carvalho Chehab 2 u8 level; 335ccc9971eSMauro Carvalho Chehab 3 u8 grpnum; 336ccc9971eSMauro Carvalho Chehab 4 unsigned long grpmask; 337ccc9971eSMauro Carvalho Chehab 5 int grplo; 338ccc9971eSMauro Carvalho Chehab 6 int grphi; 339ccc9971eSMauro Carvalho Chehab 340ccc9971eSMauro Carvalho ChehabThe ``->parent`` pointer references the ``rcu_node`` one level up in the 341ccc9971eSMauro Carvalho Chehabtree, and is ``NULL`` for the root ``rcu_node``. The RCU implementation 342ccc9971eSMauro Carvalho Chehabmakes heavy use of this field to push quiescent states up the tree. The 343ccc9971eSMauro Carvalho Chehab``->level`` field gives the level in the tree, with the root being at 344ccc9971eSMauro Carvalho Chehablevel zero, its children at level one, and so on. The ``->grpnum`` field 345ccc9971eSMauro Carvalho Chehabgives this node's position within the children of its parent, so this 346ccc9971eSMauro Carvalho Chehabnumber can range between 0 and 31 on 32-bit systems and between 0 and 63 347ccc9971eSMauro Carvalho Chehabon 64-bit systems. The ``->level`` and ``->grpnum`` fields are used only 348ccc9971eSMauro Carvalho Chehabduring initialization and for tracing. The ``->grpmask`` field is the 349ccc9971eSMauro Carvalho Chehabbitmask counterpart of ``->grpnum``, and therefore always has exactly 350ccc9971eSMauro Carvalho Chehabone bit set. This mask is used to clear the bit corresponding to this 351ccc9971eSMauro Carvalho Chehab``rcu_node`` structure in its parent's bitmasks, which are described 352ccc9971eSMauro Carvalho Chehablater. Finally, the ``->grplo`` and ``->grphi`` fields contain the 353ccc9971eSMauro Carvalho Chehablowest and highest numbered CPU served by this ``rcu_node`` structure, 354ccc9971eSMauro Carvalho Chehabrespectively. 355ccc9971eSMauro Carvalho Chehab 356ccc9971eSMauro Carvalho ChehabAll of these fields are constant, and thus do not require any 357ccc9971eSMauro Carvalho Chehabsynchronization. 358ccc9971eSMauro Carvalho Chehab 359ccc9971eSMauro Carvalho ChehabSynchronization 360ccc9971eSMauro Carvalho Chehab''''''''''''''' 361ccc9971eSMauro Carvalho Chehab 362ccc9971eSMauro Carvalho ChehabThis field of the ``rcu_node`` structure is declared as follows: 363ccc9971eSMauro Carvalho Chehab 364ccc9971eSMauro Carvalho Chehab:: 365ccc9971eSMauro Carvalho Chehab 366ccc9971eSMauro Carvalho Chehab 1 raw_spinlock_t lock; 367ccc9971eSMauro Carvalho Chehab 368ccc9971eSMauro Carvalho ChehabThis field is used to protect the remaining fields in this structure, 369ccc9971eSMauro Carvalho Chehabunless otherwise stated. That said, all of the fields in this structure 370ccc9971eSMauro Carvalho Chehabcan be accessed without locking for tracing purposes. Yes, this can 371ccc9971eSMauro Carvalho Chehabresult in confusing traces, but better some tracing confusion than to be 372ccc9971eSMauro Carvalho Chehabheisenbugged out of existence. 373ccc9971eSMauro Carvalho Chehab 374ccc9971eSMauro Carvalho Chehab.. _grace-period-tracking-1: 375ccc9971eSMauro Carvalho Chehab 376ccc9971eSMauro Carvalho ChehabGrace-Period Tracking 377ccc9971eSMauro Carvalho Chehab''''''''''''''''''''' 378ccc9971eSMauro Carvalho Chehab 379ccc9971eSMauro Carvalho ChehabThis portion of the ``rcu_node`` structure is declared as follows: 380ccc9971eSMauro Carvalho Chehab 381ccc9971eSMauro Carvalho Chehab:: 382ccc9971eSMauro Carvalho Chehab 383ccc9971eSMauro Carvalho Chehab 1 unsigned long gp_seq; 384ccc9971eSMauro Carvalho Chehab 2 unsigned long gp_seq_needed; 385ccc9971eSMauro Carvalho Chehab 386ccc9971eSMauro Carvalho ChehabThe ``rcu_node`` structures' ``->gp_seq`` fields are the counterparts of 387ccc9971eSMauro Carvalho Chehabthe field of the same name in the ``rcu_state`` structure. They each may 388ccc9971eSMauro Carvalho Chehablag up to one step behind their ``rcu_state`` counterpart. If the bottom 389ccc9971eSMauro Carvalho Chehabtwo bits of a given ``rcu_node`` structure's ``->gp_seq`` field is zero, 390ccc9971eSMauro Carvalho Chehabthen this ``rcu_node`` structure believes that RCU is idle. 391ccc9971eSMauro Carvalho Chehab 392ccc9971eSMauro Carvalho ChehabThe ``>gp_seq`` field of each ``rcu_node`` structure is updated at the 393ccc9971eSMauro Carvalho Chehabbeginning and the end of each grace period. 394ccc9971eSMauro Carvalho Chehab 395ccc9971eSMauro Carvalho ChehabThe ``->gp_seq_needed`` fields record the furthest-in-the-future grace 396ccc9971eSMauro Carvalho Chehabperiod request seen by the corresponding ``rcu_node`` structure. The 397ccc9971eSMauro Carvalho Chehabrequest is considered fulfilled when the value of the ``->gp_seq`` field 398ccc9971eSMauro Carvalho Chehabequals or exceeds that of the ``->gp_seq_needed`` field. 399ccc9971eSMauro Carvalho Chehab 400ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 401ccc9971eSMauro Carvalho Chehab| **Quick Quiz**: | 402ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 403ccc9971eSMauro Carvalho Chehab| Suppose that this ``rcu_node`` structure doesn't see a request for a | 404ccc9971eSMauro Carvalho Chehab| very long time. Won't wrapping of the ``->gp_seq`` field cause | 405ccc9971eSMauro Carvalho Chehab| problems? | 406ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 407ccc9971eSMauro Carvalho Chehab| **Answer**: | 408ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 409ccc9971eSMauro Carvalho Chehab| No, because if the ``->gp_seq_needed`` field lags behind the | 410ccc9971eSMauro Carvalho Chehab| ``->gp_seq`` field, the ``->gp_seq_needed`` field will be updated at | 411ccc9971eSMauro Carvalho Chehab| the end of the grace period. Modulo-arithmetic comparisons therefore | 412ccc9971eSMauro Carvalho Chehab| will always get the correct answer, even with wrapping. | 413ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 414ccc9971eSMauro Carvalho Chehab 415ccc9971eSMauro Carvalho ChehabQuiescent-State Tracking 416ccc9971eSMauro Carvalho Chehab'''''''''''''''''''''''' 417ccc9971eSMauro Carvalho Chehab 418ccc9971eSMauro Carvalho ChehabThese fields manage the propagation of quiescent states up the combining 419ccc9971eSMauro Carvalho Chehabtree. 420ccc9971eSMauro Carvalho Chehab 421ccc9971eSMauro Carvalho ChehabThis portion of the ``rcu_node`` structure has fields as follows: 422ccc9971eSMauro Carvalho Chehab 423ccc9971eSMauro Carvalho Chehab:: 424ccc9971eSMauro Carvalho Chehab 425ccc9971eSMauro Carvalho Chehab 1 unsigned long qsmask; 426ccc9971eSMauro Carvalho Chehab 2 unsigned long expmask; 427ccc9971eSMauro Carvalho Chehab 3 unsigned long qsmaskinit; 428ccc9971eSMauro Carvalho Chehab 4 unsigned long expmaskinit; 429ccc9971eSMauro Carvalho Chehab 430ccc9971eSMauro Carvalho ChehabThe ``->qsmask`` field tracks which of this ``rcu_node`` structure's 431ccc9971eSMauro Carvalho Chehabchildren still need to report quiescent states for the current normal 432ccc9971eSMauro Carvalho Chehabgrace period. Such children will have a value of 1 in their 433ccc9971eSMauro Carvalho Chehabcorresponding bit. Note that the leaf ``rcu_node`` structures should be 434ccc9971eSMauro Carvalho Chehabthought of as having ``rcu_data`` structures as their children. 435ccc9971eSMauro Carvalho ChehabSimilarly, the ``->expmask`` field tracks which of this ``rcu_node`` 436ccc9971eSMauro Carvalho Chehabstructure's children still need to report quiescent states for the 437ccc9971eSMauro Carvalho Chehabcurrent expedited grace period. An expedited grace period has the same 438ccc9971eSMauro Carvalho Chehabconceptual properties as a normal grace period, but the expedited 439ccc9971eSMauro Carvalho Chehabimplementation accepts extreme CPU overhead to obtain much lower 440ccc9971eSMauro Carvalho Chehabgrace-period latency, for example, consuming a few tens of microseconds 441ccc9971eSMauro Carvalho Chehabworth of CPU time to reduce grace-period duration from milliseconds to 442ccc9971eSMauro Carvalho Chehabtens of microseconds. The ``->qsmaskinit`` field tracks which of this 443ccc9971eSMauro Carvalho Chehab``rcu_node`` structure's children cover for at least one online CPU. 444ccc9971eSMauro Carvalho ChehabThis mask is used to initialize ``->qsmask``, and ``->expmaskinit`` is 445ccc9971eSMauro Carvalho Chehabused to initialize ``->expmask`` and the beginning of the normal and 446ccc9971eSMauro Carvalho Chehabexpedited grace periods, respectively. 447ccc9971eSMauro Carvalho Chehab 448ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 449ccc9971eSMauro Carvalho Chehab| **Quick Quiz**: | 450ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 451ccc9971eSMauro Carvalho Chehab| Why are these bitmasks protected by locking? Come on, haven't you | 452ccc9971eSMauro Carvalho Chehab| heard of atomic instructions??? | 453ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 454ccc9971eSMauro Carvalho Chehab| **Answer**: | 455ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 456ccc9971eSMauro Carvalho Chehab| Lockless grace-period computation! Such a tantalizing possibility! | 457ccc9971eSMauro Carvalho Chehab| But consider the following sequence of events: | 458ccc9971eSMauro Carvalho Chehab| | 459ccc9971eSMauro Carvalho Chehab| #. CPU 0 has been in dyntick-idle mode for quite some time. When it | 460ccc9971eSMauro Carvalho Chehab| wakes up, it notices that the current RCU grace period needs it to | 461ccc9971eSMauro Carvalho Chehab| report in, so it sets a flag where the scheduling clock interrupt | 462ccc9971eSMauro Carvalho Chehab| will find it. | 463ccc9971eSMauro Carvalho Chehab| #. Meanwhile, CPU 1 is running ``force_quiescent_state()``, and | 464ccc9971eSMauro Carvalho Chehab| notices that CPU 0 has been in dyntick idle mode, which qualifies | 465ccc9971eSMauro Carvalho Chehab| as an extended quiescent state. | 466ccc9971eSMauro Carvalho Chehab| #. CPU 0's scheduling clock interrupt fires in the middle of an RCU | 467ccc9971eSMauro Carvalho Chehab| read-side critical section, and notices that the RCU core needs | 468ccc9971eSMauro Carvalho Chehab| something, so commences RCU softirq processing. | 469ccc9971eSMauro Carvalho Chehab| #. CPU 0's softirq handler executes and is just about ready to report | 470ccc9971eSMauro Carvalho Chehab| its quiescent state up the ``rcu_node`` tree. | 471ccc9971eSMauro Carvalho Chehab| #. But CPU 1 beats it to the punch, completing the current grace | 472ccc9971eSMauro Carvalho Chehab| period and starting a new one. | 473ccc9971eSMauro Carvalho Chehab| #. CPU 0 now reports its quiescent state for the wrong grace period. | 474ccc9971eSMauro Carvalho Chehab| That grace period might now end before the RCU read-side critical | 475ccc9971eSMauro Carvalho Chehab| section. If that happens, disaster will ensue. | 476ccc9971eSMauro Carvalho Chehab| | 477ccc9971eSMauro Carvalho Chehab| So the locking is absolutely required in order to coordinate clearing | 478ccc9971eSMauro Carvalho Chehab| of the bits with updating of the grace-period sequence number in | 479ccc9971eSMauro Carvalho Chehab| ``->gp_seq``. | 480ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 481ccc9971eSMauro Carvalho Chehab 482ccc9971eSMauro Carvalho ChehabBlocked-Task Management 483ccc9971eSMauro Carvalho Chehab''''''''''''''''''''''' 484ccc9971eSMauro Carvalho Chehab 485ccc9971eSMauro Carvalho Chehab``PREEMPT_RCU`` allows tasks to be preempted in the midst of their RCU 486ccc9971eSMauro Carvalho Chehabread-side critical sections, and these tasks must be tracked explicitly. 487ccc9971eSMauro Carvalho ChehabThe details of exactly why and how they are tracked will be covered in a 488ccc9971eSMauro Carvalho Chehabseparate article on RCU read-side processing. For now, it is enough to 489ccc9971eSMauro Carvalho Chehabknow that the ``rcu_node`` structure tracks them. 490ccc9971eSMauro Carvalho Chehab 491ccc9971eSMauro Carvalho Chehab:: 492ccc9971eSMauro Carvalho Chehab 493ccc9971eSMauro Carvalho Chehab 1 struct list_head blkd_tasks; 494ccc9971eSMauro Carvalho Chehab 2 struct list_head *gp_tasks; 495ccc9971eSMauro Carvalho Chehab 3 struct list_head *exp_tasks; 496ccc9971eSMauro Carvalho Chehab 4 bool wait_blkd_tasks; 497ccc9971eSMauro Carvalho Chehab 498ccc9971eSMauro Carvalho ChehabThe ``->blkd_tasks`` field is a list header for the list of blocked and 499ccc9971eSMauro Carvalho Chehabpreempted tasks. As tasks undergo context switches within RCU read-side 500ccc9971eSMauro Carvalho Chehabcritical sections, their ``task_struct`` structures are enqueued (via 501ccc9971eSMauro Carvalho Chehabthe ``task_struct``'s ``->rcu_node_entry`` field) onto the head of the 502ccc9971eSMauro Carvalho Chehab``->blkd_tasks`` list for the leaf ``rcu_node`` structure corresponding 503ccc9971eSMauro Carvalho Chehabto the CPU on which the outgoing context switch executed. As these tasks 504ccc9971eSMauro Carvalho Chehablater exit their RCU read-side critical sections, they remove themselves 505ccc9971eSMauro Carvalho Chehabfrom the list. This list is therefore in reverse time order, so that if 506ccc9971eSMauro Carvalho Chehabone of the tasks is blocking the current grace period, all subsequent 507ccc9971eSMauro Carvalho Chehabtasks must also be blocking that same grace period. Therefore, a single 508ccc9971eSMauro Carvalho Chehabpointer into this list suffices to track all tasks blocking a given 509ccc9971eSMauro Carvalho Chehabgrace period. That pointer is stored in ``->gp_tasks`` for normal grace 510ccc9971eSMauro Carvalho Chehabperiods and in ``->exp_tasks`` for expedited grace periods. These last 511ccc9971eSMauro Carvalho Chehabtwo fields are ``NULL`` if either there is no grace period in flight or 512ccc9971eSMauro Carvalho Chehabif there are no blocked tasks preventing that grace period from 513ccc9971eSMauro Carvalho Chehabcompleting. If either of these two pointers is referencing a task that 514ccc9971eSMauro Carvalho Chehabremoves itself from the ``->blkd_tasks`` list, then that task must 515ccc9971eSMauro Carvalho Chehabadvance the pointer to the next task on the list, or set the pointer to 516ccc9971eSMauro Carvalho Chehab``NULL`` if there are no subsequent tasks on the list. 517ccc9971eSMauro Carvalho Chehab 518ccc9971eSMauro Carvalho ChehabFor example, suppose that tasks T1, T2, and T3 are all hard-affinitied 519ccc9971eSMauro Carvalho Chehabto the largest-numbered CPU in the system. Then if task T1 blocked in an 520ccc9971eSMauro Carvalho ChehabRCU read-side critical section, then an expedited grace period started, 521ccc9971eSMauro Carvalho Chehabthen task T2 blocked in an RCU read-side critical section, then a normal 522ccc9971eSMauro Carvalho Chehabgrace period started, and finally task 3 blocked in an RCU read-side 523ccc9971eSMauro Carvalho Chehabcritical section, then the state of the last leaf ``rcu_node`` 524ccc9971eSMauro Carvalho Chehabstructure's blocked-task list would be as shown below: 525ccc9971eSMauro Carvalho Chehab 526ccc9971eSMauro Carvalho Chehab.. kernel-figure:: blkd_task.svg 527ccc9971eSMauro Carvalho Chehab 528ccc9971eSMauro Carvalho ChehabTask T1 is blocking both grace periods, task T2 is blocking only the 529ccc9971eSMauro Carvalho Chehabnormal grace period, and task T3 is blocking neither grace period. Note 530ccc9971eSMauro Carvalho Chehabthat these tasks will not remove themselves from this list immediately 531ccc9971eSMauro Carvalho Chehabupon resuming execution. They will instead remain on the list until they 532ccc9971eSMauro Carvalho Chehabexecute the outermost ``rcu_read_unlock()`` that ends their RCU 533ccc9971eSMauro Carvalho Chehabread-side critical section. 534ccc9971eSMauro Carvalho Chehab 535ccc9971eSMauro Carvalho ChehabThe ``->wait_blkd_tasks`` field indicates whether or not the current 536ccc9971eSMauro Carvalho Chehabgrace period is waiting on a blocked task. 537ccc9971eSMauro Carvalho Chehab 538ccc9971eSMauro Carvalho ChehabSizing the ``rcu_node`` Array 539ccc9971eSMauro Carvalho Chehab''''''''''''''''''''''''''''' 540ccc9971eSMauro Carvalho Chehab 541ccc9971eSMauro Carvalho ChehabThe ``rcu_node`` array is sized via a series of C-preprocessor 542ccc9971eSMauro Carvalho Chehabexpressions as follows: 543ccc9971eSMauro Carvalho Chehab 544ccc9971eSMauro Carvalho Chehab:: 545ccc9971eSMauro Carvalho Chehab 546ccc9971eSMauro Carvalho Chehab 1 #ifdef CONFIG_RCU_FANOUT 547ccc9971eSMauro Carvalho Chehab 2 #define RCU_FANOUT CONFIG_RCU_FANOUT 548ccc9971eSMauro Carvalho Chehab 3 #else 549ccc9971eSMauro Carvalho Chehab 4 # ifdef CONFIG_64BIT 550ccc9971eSMauro Carvalho Chehab 5 # define RCU_FANOUT 64 551ccc9971eSMauro Carvalho Chehab 6 # else 552ccc9971eSMauro Carvalho Chehab 7 # define RCU_FANOUT 32 553ccc9971eSMauro Carvalho Chehab 8 # endif 554ccc9971eSMauro Carvalho Chehab 9 #endif 555ccc9971eSMauro Carvalho Chehab 10 556ccc9971eSMauro Carvalho Chehab 11 #ifdef CONFIG_RCU_FANOUT_LEAF 557ccc9971eSMauro Carvalho Chehab 12 #define RCU_FANOUT_LEAF CONFIG_RCU_FANOUT_LEAF 558ccc9971eSMauro Carvalho Chehab 13 #else 559ccc9971eSMauro Carvalho Chehab 14 # ifdef CONFIG_64BIT 560ccc9971eSMauro Carvalho Chehab 15 # define RCU_FANOUT_LEAF 64 561ccc9971eSMauro Carvalho Chehab 16 # else 562ccc9971eSMauro Carvalho Chehab 17 # define RCU_FANOUT_LEAF 32 563ccc9971eSMauro Carvalho Chehab 18 # endif 564ccc9971eSMauro Carvalho Chehab 19 #endif 565ccc9971eSMauro Carvalho Chehab 20 566ccc9971eSMauro Carvalho Chehab 21 #define RCU_FANOUT_1 (RCU_FANOUT_LEAF) 567ccc9971eSMauro Carvalho Chehab 22 #define RCU_FANOUT_2 (RCU_FANOUT_1 * RCU_FANOUT) 568ccc9971eSMauro Carvalho Chehab 23 #define RCU_FANOUT_3 (RCU_FANOUT_2 * RCU_FANOUT) 569ccc9971eSMauro Carvalho Chehab 24 #define RCU_FANOUT_4 (RCU_FANOUT_3 * RCU_FANOUT) 570ccc9971eSMauro Carvalho Chehab 25 571ccc9971eSMauro Carvalho Chehab 26 #if NR_CPUS <= RCU_FANOUT_1 572ccc9971eSMauro Carvalho Chehab 27 # define RCU_NUM_LVLS 1 573ccc9971eSMauro Carvalho Chehab 28 # define NUM_RCU_LVL_0 1 574ccc9971eSMauro Carvalho Chehab 29 # define NUM_RCU_NODES NUM_RCU_LVL_0 575ccc9971eSMauro Carvalho Chehab 30 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0 } 576ccc9971eSMauro Carvalho Chehab 31 # define RCU_NODE_NAME_INIT { "rcu_node_0" } 577ccc9971eSMauro Carvalho Chehab 32 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0" } 578ccc9971eSMauro Carvalho Chehab 33 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0" } 579ccc9971eSMauro Carvalho Chehab 34 #elif NR_CPUS <= RCU_FANOUT_2 580ccc9971eSMauro Carvalho Chehab 35 # define RCU_NUM_LVLS 2 581ccc9971eSMauro Carvalho Chehab 36 # define NUM_RCU_LVL_0 1 582ccc9971eSMauro Carvalho Chehab 37 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) 583ccc9971eSMauro Carvalho Chehab 38 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1) 584ccc9971eSMauro Carvalho Chehab 39 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1 } 585ccc9971eSMauro Carvalho Chehab 40 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1" } 586ccc9971eSMauro Carvalho Chehab 41 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1" } 587ccc9971eSMauro Carvalho Chehab 42 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1" } 588ccc9971eSMauro Carvalho Chehab 43 #elif NR_CPUS <= RCU_FANOUT_3 589ccc9971eSMauro Carvalho Chehab 44 # define RCU_NUM_LVLS 3 590ccc9971eSMauro Carvalho Chehab 45 # define NUM_RCU_LVL_0 1 591ccc9971eSMauro Carvalho Chehab 46 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2) 592ccc9971eSMauro Carvalho Chehab 47 # define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) 593ccc9971eSMauro Carvalho Chehab 48 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2) 594ccc9971eSMauro Carvalho Chehab 49 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2 } 595ccc9971eSMauro Carvalho Chehab 50 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1", "rcu_node_2" } 596ccc9971eSMauro Carvalho Chehab 51 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2" } 597ccc9971eSMauro Carvalho Chehab 52 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2" } 598ccc9971eSMauro Carvalho Chehab 53 #elif NR_CPUS <= RCU_FANOUT_4 599ccc9971eSMauro Carvalho Chehab 54 # define RCU_NUM_LVLS 4 600ccc9971eSMauro Carvalho Chehab 55 # define NUM_RCU_LVL_0 1 601ccc9971eSMauro Carvalho Chehab 56 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_3) 602ccc9971eSMauro Carvalho Chehab 57 # define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2) 603ccc9971eSMauro Carvalho Chehab 58 # define NUM_RCU_LVL_3 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) 604ccc9971eSMauro Carvalho Chehab 59 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2 + NUM_RCU_LVL_3) 605ccc9971eSMauro Carvalho Chehab 60 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2, NUM_RCU_LVL_3 } 606ccc9971eSMauro Carvalho Chehab 61 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1", "rcu_node_2", "rcu_node_3" } 607ccc9971eSMauro Carvalho Chehab 62 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2", "rcu_node_fqs_3" } 608ccc9971eSMauro Carvalho Chehab 63 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2", "rcu_node_exp_3" } 609ccc9971eSMauro Carvalho Chehab 64 #else 610ccc9971eSMauro Carvalho Chehab 65 # error "CONFIG_RCU_FANOUT insufficient for NR_CPUS" 611ccc9971eSMauro Carvalho Chehab 66 #endif 612ccc9971eSMauro Carvalho Chehab 613ccc9971eSMauro Carvalho ChehabThe maximum number of levels in the ``rcu_node`` structure is currently 614ccc9971eSMauro Carvalho Chehablimited to four, as specified by lines 21-24 and the structure of the 615ccc9971eSMauro Carvalho Chehabsubsequent “if” statement. For 32-bit systems, this allows 616ccc9971eSMauro Carvalho Chehab16*32*32*32=524,288 CPUs, which should be sufficient for the next few 617ccc9971eSMauro Carvalho Chehabyears at least. For 64-bit systems, 16*64*64*64=4,194,304 CPUs is 618ccc9971eSMauro Carvalho Chehaballowed, which should see us through the next decade or so. This 619ccc9971eSMauro Carvalho Chehabfour-level tree also allows kernels built with ``CONFIG_RCU_FANOUT=8`` 620ccc9971eSMauro Carvalho Chehabto support up to 4096 CPUs, which might be useful in very large systems 621ccc9971eSMauro Carvalho Chehabhaving eight CPUs per socket (but please note that no one has yet shown 622ccc9971eSMauro Carvalho Chehabany measurable performance degradation due to misaligned socket and 623ccc9971eSMauro Carvalho Chehab``rcu_node`` boundaries). In addition, building kernels with a full four 624ccc9971eSMauro Carvalho Chehablevels of ``rcu_node`` tree permits better testing of RCU's 625ccc9971eSMauro Carvalho Chehabcombining-tree code. 626ccc9971eSMauro Carvalho Chehab 627ccc9971eSMauro Carvalho ChehabThe ``RCU_FANOUT`` symbol controls how many children are permitted at 628ccc9971eSMauro Carvalho Chehabeach non-leaf level of the ``rcu_node`` tree. If the 629ccc9971eSMauro Carvalho Chehab``CONFIG_RCU_FANOUT`` Kconfig option is not specified, it is set based 630ccc9971eSMauro Carvalho Chehabon the word size of the system, which is also the Kconfig default. 631ccc9971eSMauro Carvalho Chehab 632ccc9971eSMauro Carvalho ChehabThe ``RCU_FANOUT_LEAF`` symbol controls how many CPUs are handled by 633ccc9971eSMauro Carvalho Chehabeach leaf ``rcu_node`` structure. Experience has shown that allowing a 634ccc9971eSMauro Carvalho Chehabgiven leaf ``rcu_node`` structure to handle 64 CPUs, as permitted by the 635ccc9971eSMauro Carvalho Chehabnumber of bits in the ``->qsmask`` field on a 64-bit system, results in 636ccc9971eSMauro Carvalho Chehabexcessive contention for the leaf ``rcu_node`` structures' ``->lock`` 637ccc9971eSMauro Carvalho Chehabfields. The number of CPUs per leaf ``rcu_node`` structure is therefore 638ccc9971eSMauro Carvalho Chehablimited to 16 given the default value of ``CONFIG_RCU_FANOUT_LEAF``. If 639ccc9971eSMauro Carvalho Chehab``CONFIG_RCU_FANOUT_LEAF`` is unspecified, the value selected is based 640ccc9971eSMauro Carvalho Chehabon the word size of the system, just as for ``CONFIG_RCU_FANOUT``. 641ccc9971eSMauro Carvalho ChehabLines 11-19 perform this computation. 642ccc9971eSMauro Carvalho Chehab 643ccc9971eSMauro Carvalho ChehabLines 21-24 compute the maximum number of CPUs supported by a 644ccc9971eSMauro Carvalho Chehabsingle-level (which contains a single ``rcu_node`` structure), 645ccc9971eSMauro Carvalho Chehabtwo-level, three-level, and four-level ``rcu_node`` tree, respectively, 646ccc9971eSMauro Carvalho Chehabgiven the fanout specified by ``RCU_FANOUT`` and ``RCU_FANOUT_LEAF``. 647ccc9971eSMauro Carvalho ChehabThese numbers of CPUs are retained in the ``RCU_FANOUT_1``, 648ccc9971eSMauro Carvalho Chehab``RCU_FANOUT_2``, ``RCU_FANOUT_3``, and ``RCU_FANOUT_4`` C-preprocessor 649ccc9971eSMauro Carvalho Chehabvariables, respectively. 650ccc9971eSMauro Carvalho Chehab 651ccc9971eSMauro Carvalho ChehabThese variables are used to control the C-preprocessor ``#if`` statement 652ccc9971eSMauro Carvalho Chehabspanning lines 26-66 that computes the number of ``rcu_node`` structures 653ccc9971eSMauro Carvalho Chehabrequired for each level of the tree, as well as the number of levels 654ccc9971eSMauro Carvalho Chehabrequired. The number of levels is placed in the ``NUM_RCU_LVLS`` 655ccc9971eSMauro Carvalho ChehabC-preprocessor variable by lines 27, 35, 44, and 54. The number of 656ccc9971eSMauro Carvalho Chehab``rcu_node`` structures for the topmost level of the tree is always 657ccc9971eSMauro Carvalho Chehabexactly one, and this value is unconditionally placed into 658ccc9971eSMauro Carvalho Chehab``NUM_RCU_LVL_0`` by lines 28, 36, 45, and 55. The rest of the levels 659ccc9971eSMauro Carvalho Chehab(if any) of the ``rcu_node`` tree are computed by dividing the maximum 660ccc9971eSMauro Carvalho Chehabnumber of CPUs by the fanout supported by the number of levels from the 661ccc9971eSMauro Carvalho Chehabcurrent level down, rounding up. This computation is performed by 662ccc9971eSMauro Carvalho Chehablines 37, 46-47, and 56-58. Lines 31-33, 40-42, 50-52, and 62-63 create 663ccc9971eSMauro Carvalho Chehabinitializers for lockdep lock-class names. Finally, lines 64-66 produce 664ccc9971eSMauro Carvalho Chehaban error if the maximum number of CPUs is too large for the specified 665ccc9971eSMauro Carvalho Chehabfanout. 666ccc9971eSMauro Carvalho Chehab 667ccc9971eSMauro Carvalho ChehabThe ``rcu_segcblist`` Structure 668ccc9971eSMauro Carvalho Chehab~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 669ccc9971eSMauro Carvalho Chehab 670ccc9971eSMauro Carvalho ChehabThe ``rcu_segcblist`` structure maintains a segmented list of callbacks 671ccc9971eSMauro Carvalho Chehabas follows: 672ccc9971eSMauro Carvalho Chehab 673ccc9971eSMauro Carvalho Chehab:: 674ccc9971eSMauro Carvalho Chehab 675ccc9971eSMauro Carvalho Chehab 1 #define RCU_DONE_TAIL 0 676ccc9971eSMauro Carvalho Chehab 2 #define RCU_WAIT_TAIL 1 677ccc9971eSMauro Carvalho Chehab 3 #define RCU_NEXT_READY_TAIL 2 678ccc9971eSMauro Carvalho Chehab 4 #define RCU_NEXT_TAIL 3 679ccc9971eSMauro Carvalho Chehab 5 #define RCU_CBLIST_NSEGS 4 680ccc9971eSMauro Carvalho Chehab 6 681ccc9971eSMauro Carvalho Chehab 7 struct rcu_segcblist { 682ccc9971eSMauro Carvalho Chehab 8 struct rcu_head *head; 683ccc9971eSMauro Carvalho Chehab 9 struct rcu_head **tails[RCU_CBLIST_NSEGS]; 684ccc9971eSMauro Carvalho Chehab 10 unsigned long gp_seq[RCU_CBLIST_NSEGS]; 685ccc9971eSMauro Carvalho Chehab 11 long len; 686ccc9971eSMauro Carvalho Chehab 12 long len_lazy; 687ccc9971eSMauro Carvalho Chehab 13 }; 688ccc9971eSMauro Carvalho Chehab 689ccc9971eSMauro Carvalho ChehabThe segments are as follows: 690ccc9971eSMauro Carvalho Chehab 691ccc9971eSMauro Carvalho Chehab#. ``RCU_DONE_TAIL``: Callbacks whose grace periods have elapsed. These 692ccc9971eSMauro Carvalho Chehab callbacks are ready to be invoked. 693ccc9971eSMauro Carvalho Chehab#. ``RCU_WAIT_TAIL``: Callbacks that are waiting for the current grace 694ccc9971eSMauro Carvalho Chehab period. Note that different CPUs can have different ideas about which 695ccc9971eSMauro Carvalho Chehab grace period is current, hence the ``->gp_seq`` field. 696ccc9971eSMauro Carvalho Chehab#. ``RCU_NEXT_READY_TAIL``: Callbacks waiting for the next grace period 697ccc9971eSMauro Carvalho Chehab to start. 698ccc9971eSMauro Carvalho Chehab#. ``RCU_NEXT_TAIL``: Callbacks that have not yet been associated with a 699ccc9971eSMauro Carvalho Chehab grace period. 700ccc9971eSMauro Carvalho Chehab 701ccc9971eSMauro Carvalho ChehabThe ``->head`` pointer references the first callback or is ``NULL`` if 702ccc9971eSMauro Carvalho Chehabthe list contains no callbacks (which is *not* the same as being empty). 703ccc9971eSMauro Carvalho ChehabEach element of the ``->tails[]`` array references the ``->next`` 704ccc9971eSMauro Carvalho Chehabpointer of the last callback in the corresponding segment of the list, 705ccc9971eSMauro Carvalho Chehabor the list's ``->head`` pointer if that segment and all previous 706ccc9971eSMauro Carvalho Chehabsegments are empty. If the corresponding segment is empty but some 707ccc9971eSMauro Carvalho Chehabprevious segment is not empty, then the array element is identical to 708ccc9971eSMauro Carvalho Chehabits predecessor. Older callbacks are closer to the head of the list, and 709ccc9971eSMauro Carvalho Chehabnew callbacks are added at the tail. This relationship between the 710ccc9971eSMauro Carvalho Chehab``->head`` pointer, the ``->tails[]`` array, and the callbacks is shown 711ccc9971eSMauro Carvalho Chehabin this diagram: 712ccc9971eSMauro Carvalho Chehab 713ccc9971eSMauro Carvalho Chehab.. kernel-figure:: nxtlist.svg 714ccc9971eSMauro Carvalho Chehab 715ccc9971eSMauro Carvalho ChehabIn this figure, the ``->head`` pointer references the first RCU callback 716ccc9971eSMauro Carvalho Chehabin the list. The ``->tails[RCU_DONE_TAIL]`` array element references the 717ccc9971eSMauro Carvalho Chehab``->head`` pointer itself, indicating that none of the callbacks is 718ccc9971eSMauro Carvalho Chehabready to invoke. The ``->tails[RCU_WAIT_TAIL]`` array element references 719ccc9971eSMauro Carvalho Chehabcallback CB 2's ``->next`` pointer, which indicates that CB 1 and CB 2 720ccc9971eSMauro Carvalho Chehabare both waiting on the current grace period, give or take possible 721ccc9971eSMauro Carvalho Chehabdisagreements about exactly which grace period is the current one. The 722ccc9971eSMauro Carvalho Chehab``->tails[RCU_NEXT_READY_TAIL]`` array element references the same RCU 723ccc9971eSMauro Carvalho Chehabcallback that ``->tails[RCU_WAIT_TAIL]`` does, which indicates that 724ccc9971eSMauro Carvalho Chehabthere are no callbacks waiting on the next RCU grace period. The 725ccc9971eSMauro Carvalho Chehab``->tails[RCU_NEXT_TAIL]`` array element references CB 4's ``->next`` 726ccc9971eSMauro Carvalho Chehabpointer, indicating that all the remaining RCU callbacks have not yet 727ccc9971eSMauro Carvalho Chehabbeen assigned to an RCU grace period. Note that the 728ccc9971eSMauro Carvalho Chehab``->tails[RCU_NEXT_TAIL]`` array element always references the last RCU 729ccc9971eSMauro Carvalho Chehabcallback's ``->next`` pointer unless the callback list is empty, in 730ccc9971eSMauro Carvalho Chehabwhich case it references the ``->head`` pointer. 731ccc9971eSMauro Carvalho Chehab 732ccc9971eSMauro Carvalho ChehabThere is one additional important special case for the 733ccc9971eSMauro Carvalho Chehab``->tails[RCU_NEXT_TAIL]`` array element: It can be ``NULL`` when this 734ccc9971eSMauro Carvalho Chehablist is *disabled*. Lists are disabled when the corresponding CPU is 735ccc9971eSMauro Carvalho Chehaboffline or when the corresponding CPU's callbacks are offloaded to a 736ccc9971eSMauro Carvalho Chehabkthread, both of which are described elsewhere. 737ccc9971eSMauro Carvalho Chehab 738ccc9971eSMauro Carvalho ChehabCPUs advance their callbacks from the ``RCU_NEXT_TAIL`` to the 739ccc9971eSMauro Carvalho Chehab``RCU_NEXT_READY_TAIL`` to the ``RCU_WAIT_TAIL`` to the 740ccc9971eSMauro Carvalho Chehab``RCU_DONE_TAIL`` list segments as grace periods advance. 741ccc9971eSMauro Carvalho Chehab 742ccc9971eSMauro Carvalho ChehabThe ``->gp_seq[]`` array records grace-period numbers corresponding to 743ccc9971eSMauro Carvalho Chehabthe list segments. This is what allows different CPUs to have different 744ccc9971eSMauro Carvalho Chehabideas as to which is the current grace period while still avoiding 745ccc9971eSMauro Carvalho Chehabpremature invocation of their callbacks. In particular, this allows CPUs 746ccc9971eSMauro Carvalho Chehabthat go idle for extended periods to determine which of their callbacks 747ccc9971eSMauro Carvalho Chehabare ready to be invoked after reawakening. 748ccc9971eSMauro Carvalho Chehab 749ccc9971eSMauro Carvalho ChehabThe ``->len`` counter contains the number of callbacks in ``->head``, 750ccc9971eSMauro Carvalho Chehaband the ``->len_lazy`` contains the number of those callbacks that are 751ccc9971eSMauro Carvalho Chehabknown to only free memory, and whose invocation can therefore be safely 752ccc9971eSMauro Carvalho Chehabdeferred. 753ccc9971eSMauro Carvalho Chehab 754ccc9971eSMauro Carvalho Chehab.. important:: 755ccc9971eSMauro Carvalho Chehab 756ccc9971eSMauro Carvalho Chehab It is the ``->len`` field that determines whether or 757ccc9971eSMauro Carvalho Chehab not there are callbacks associated with this ``rcu_segcblist`` 758ccc9971eSMauro Carvalho Chehab structure, *not* the ``->head`` pointer. The reason for this is that all 759ccc9971eSMauro Carvalho Chehab the ready-to-invoke callbacks (that is, those in the ``RCU_DONE_TAIL`` 760ccc9971eSMauro Carvalho Chehab segment) are extracted all at once at callback-invocation time 761ccc9971eSMauro Carvalho Chehab (``rcu_do_batch``), due to which ``->head`` may be set to NULL if there 762ccc9971eSMauro Carvalho Chehab are no not-done callbacks remaining in the ``rcu_segcblist``. If 763ccc9971eSMauro Carvalho Chehab callback invocation must be postponed, for example, because a 764ccc9971eSMauro Carvalho Chehab high-priority process just woke up on this CPU, then the remaining 765ccc9971eSMauro Carvalho Chehab callbacks are placed back on the ``RCU_DONE_TAIL`` segment and 766ccc9971eSMauro Carvalho Chehab ``->head`` once again points to the start of the segment. In short, the 767ccc9971eSMauro Carvalho Chehab head field can briefly be ``NULL`` even though the CPU has callbacks 768ccc9971eSMauro Carvalho Chehab present the entire time. Therefore, it is not appropriate to test the 769ccc9971eSMauro Carvalho Chehab ``->head`` pointer for ``NULL``. 770ccc9971eSMauro Carvalho Chehab 771ccc9971eSMauro Carvalho ChehabIn contrast, the ``->len`` and ``->len_lazy`` counts are adjusted only 772ccc9971eSMauro Carvalho Chehabafter the corresponding callbacks have been invoked. This means that the 773ccc9971eSMauro Carvalho Chehab``->len`` count is zero only if the ``rcu_segcblist`` structure really 774ccc9971eSMauro Carvalho Chehabis devoid of callbacks. Of course, off-CPU sampling of the ``->len`` 775ccc9971eSMauro Carvalho Chehabcount requires careful use of appropriate synchronization, for example, 776ccc9971eSMauro Carvalho Chehabmemory barriers. This synchronization can be a bit subtle, particularly 777ccc9971eSMauro Carvalho Chehabin the case of ``rcu_barrier()``. 778ccc9971eSMauro Carvalho Chehab 779ccc9971eSMauro Carvalho ChehabThe ``rcu_data`` Structure 780ccc9971eSMauro Carvalho Chehab~~~~~~~~~~~~~~~~~~~~~~~~~~ 781ccc9971eSMauro Carvalho Chehab 782ccc9971eSMauro Carvalho ChehabThe ``rcu_data`` maintains the per-CPU state for the RCU subsystem. The 783ccc9971eSMauro Carvalho Chehabfields in this structure may be accessed only from the corresponding CPU 784ccc9971eSMauro Carvalho Chehab(and from tracing) unless otherwise stated. This structure is the focus 785ccc9971eSMauro Carvalho Chehabof quiescent-state detection and RCU callback queuing. It also tracks 786ccc9971eSMauro Carvalho Chehabits relationship to the corresponding leaf ``rcu_node`` structure to 787ccc9971eSMauro Carvalho Chehaballow more-efficient propagation of quiescent states up the ``rcu_node`` 788ccc9971eSMauro Carvalho Chehabcombining tree. Like the ``rcu_node`` structure, it provides a local 789ccc9971eSMauro Carvalho Chehabcopy of the grace-period information to allow for-free synchronized 790ccc9971eSMauro Carvalho Chehabaccess to this information from the corresponding CPU. Finally, this 791ccc9971eSMauro Carvalho Chehabstructure records past dyntick-idle state for the corresponding CPU and 792ccc9971eSMauro Carvalho Chehabalso tracks statistics. 793ccc9971eSMauro Carvalho Chehab 794ccc9971eSMauro Carvalho ChehabThe ``rcu_data`` structure's fields are discussed, singly and in groups, 795ccc9971eSMauro Carvalho Chehabin the following sections. 796ccc9971eSMauro Carvalho Chehab 797ccc9971eSMauro Carvalho ChehabConnection to Other Data Structures 798ccc9971eSMauro Carvalho Chehab''''''''''''''''''''''''''''''''''' 799ccc9971eSMauro Carvalho Chehab 800ccc9971eSMauro Carvalho ChehabThis portion of the ``rcu_data`` structure is declared as follows: 801ccc9971eSMauro Carvalho Chehab 802ccc9971eSMauro Carvalho Chehab:: 803ccc9971eSMauro Carvalho Chehab 804ccc9971eSMauro Carvalho Chehab 1 int cpu; 805ccc9971eSMauro Carvalho Chehab 2 struct rcu_node *mynode; 806ccc9971eSMauro Carvalho Chehab 3 unsigned long grpmask; 807ccc9971eSMauro Carvalho Chehab 4 bool beenonline; 808ccc9971eSMauro Carvalho Chehab 809ccc9971eSMauro Carvalho ChehabThe ``->cpu`` field contains the number of the corresponding CPU and the 810ccc9971eSMauro Carvalho Chehab``->mynode`` field references the corresponding ``rcu_node`` structure. 811ccc9971eSMauro Carvalho ChehabThe ``->mynode`` is used to propagate quiescent states up the combining 812ccc9971eSMauro Carvalho Chehabtree. These two fields are constant and therefore do not require 813ccc9971eSMauro Carvalho Chehabsynchronization. 814ccc9971eSMauro Carvalho Chehab 815ccc9971eSMauro Carvalho ChehabThe ``->grpmask`` field indicates the bit in the ``->mynode->qsmask`` 816ccc9971eSMauro Carvalho Chehabcorresponding to this ``rcu_data`` structure, and is also used when 817ccc9971eSMauro Carvalho Chehabpropagating quiescent states. The ``->beenonline`` flag is set whenever 818ccc9971eSMauro Carvalho Chehabthe corresponding CPU comes online, which means that the debugfs tracing 819ccc9971eSMauro Carvalho Chehabneed not dump out any ``rcu_data`` structure for which this flag is not 820ccc9971eSMauro Carvalho Chehabset. 821ccc9971eSMauro Carvalho Chehab 822ccc9971eSMauro Carvalho ChehabQuiescent-State and Grace-Period Tracking 823ccc9971eSMauro Carvalho Chehab''''''''''''''''''''''''''''''''''''''''' 824ccc9971eSMauro Carvalho Chehab 825ccc9971eSMauro Carvalho ChehabThis portion of the ``rcu_data`` structure is declared as follows: 826ccc9971eSMauro Carvalho Chehab 827ccc9971eSMauro Carvalho Chehab:: 828ccc9971eSMauro Carvalho Chehab 829ccc9971eSMauro Carvalho Chehab 1 unsigned long gp_seq; 830ccc9971eSMauro Carvalho Chehab 2 unsigned long gp_seq_needed; 831ccc9971eSMauro Carvalho Chehab 3 bool cpu_no_qs; 832ccc9971eSMauro Carvalho Chehab 4 bool core_needs_qs; 833ccc9971eSMauro Carvalho Chehab 5 bool gpwrap; 834ccc9971eSMauro Carvalho Chehab 835ccc9971eSMauro Carvalho ChehabThe ``->gp_seq`` field is the counterpart of the field of the same name 836ccc9971eSMauro Carvalho Chehabin the ``rcu_state`` and ``rcu_node`` structures. The 837ccc9971eSMauro Carvalho Chehab``->gp_seq_needed`` field is the counterpart of the field of the same 838ccc9971eSMauro Carvalho Chehabname in the rcu_node structure. They may each lag up to one behind their 839ccc9971eSMauro Carvalho Chehab``rcu_node`` counterparts, but in ``CONFIG_NO_HZ_IDLE`` and 840ccc9971eSMauro Carvalho Chehab``CONFIG_NO_HZ_FULL`` kernels can lag arbitrarily far behind for CPUs in 841ccc9971eSMauro Carvalho Chehabdyntick-idle mode (but these counters will catch up upon exit from 842ccc9971eSMauro Carvalho Chehabdyntick-idle mode). If the lower two bits of a given ``rcu_data`` 843ccc9971eSMauro Carvalho Chehabstructure's ``->gp_seq`` are zero, then this ``rcu_data`` structure 844ccc9971eSMauro Carvalho Chehabbelieves that RCU is idle. 845ccc9971eSMauro Carvalho Chehab 846ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 847ccc9971eSMauro Carvalho Chehab| **Quick Quiz**: | 848ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 849ccc9971eSMauro Carvalho Chehab| All this replication of the grace period numbers can only cause | 850ccc9971eSMauro Carvalho Chehab| massive confusion. Why not just keep a global sequence number and be | 851ccc9971eSMauro Carvalho Chehab| done with it??? | 852ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 853ccc9971eSMauro Carvalho Chehab| **Answer**: | 854ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 855ccc9971eSMauro Carvalho Chehab| Because if there was only a single global sequence numbers, there | 856ccc9971eSMauro Carvalho Chehab| would need to be a single global lock to allow safely accessing and | 857ccc9971eSMauro Carvalho Chehab| updating it. And if we are not going to have a single global lock, we | 858ccc9971eSMauro Carvalho Chehab| need to carefully manage the numbers on a per-node basis. Recall from | 859ccc9971eSMauro Carvalho Chehab| the answer to a previous Quick Quiz that the consequences of applying | 860ccc9971eSMauro Carvalho Chehab| a previously sampled quiescent state to the wrong grace period are | 861ccc9971eSMauro Carvalho Chehab| quite severe. | 862ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 863ccc9971eSMauro Carvalho Chehab 864ccc9971eSMauro Carvalho ChehabThe ``->cpu_no_qs`` flag indicates that the CPU has not yet passed 865ccc9971eSMauro Carvalho Chehabthrough a quiescent state, while the ``->core_needs_qs`` flag indicates 866ccc9971eSMauro Carvalho Chehabthat the RCU core needs a quiescent state from the corresponding CPU. 867ccc9971eSMauro Carvalho ChehabThe ``->gpwrap`` field indicates that the corresponding CPU has remained 868ccc9971eSMauro Carvalho Chehabidle for so long that the ``gp_seq`` counter is in danger of overflow, 869ccc9971eSMauro Carvalho Chehabwhich will cause the CPU to disregard the values of its counters on its 870ccc9971eSMauro Carvalho Chehabnext exit from idle. 871ccc9971eSMauro Carvalho Chehab 872ccc9971eSMauro Carvalho ChehabRCU Callback Handling 873ccc9971eSMauro Carvalho Chehab''''''''''''''''''''' 874ccc9971eSMauro Carvalho Chehab 875ccc9971eSMauro Carvalho ChehabIn the absence of CPU-hotplug events, RCU callbacks are invoked by the 876ccc9971eSMauro Carvalho Chehabsame CPU that registered them. This is strictly a cache-locality 877ccc9971eSMauro Carvalho Chehaboptimization: callbacks can and do get invoked on CPUs other than the 878ccc9971eSMauro Carvalho Chehabone that registered them. After all, if the CPU that registered a given 879ccc9971eSMauro Carvalho Chehabcallback has gone offline before the callback can be invoked, there 880ccc9971eSMauro Carvalho Chehabreally is no other choice. 881ccc9971eSMauro Carvalho Chehab 882ccc9971eSMauro Carvalho ChehabThis portion of the ``rcu_data`` structure is declared as follows: 883ccc9971eSMauro Carvalho Chehab 884ccc9971eSMauro Carvalho Chehab:: 885ccc9971eSMauro Carvalho Chehab 886ccc9971eSMauro Carvalho Chehab 1 struct rcu_segcblist cblist; 887ccc9971eSMauro Carvalho Chehab 2 long qlen_last_fqs_check; 888ccc9971eSMauro Carvalho Chehab 3 unsigned long n_cbs_invoked; 889ccc9971eSMauro Carvalho Chehab 4 unsigned long n_nocbs_invoked; 890ccc9971eSMauro Carvalho Chehab 5 unsigned long n_cbs_orphaned; 891ccc9971eSMauro Carvalho Chehab 6 unsigned long n_cbs_adopted; 892ccc9971eSMauro Carvalho Chehab 7 unsigned long n_force_qs_snap; 893ccc9971eSMauro Carvalho Chehab 8 long blimit; 894ccc9971eSMauro Carvalho Chehab 895ccc9971eSMauro Carvalho ChehabThe ``->cblist`` structure is the segmented callback list described 896ccc9971eSMauro Carvalho Chehabearlier. The CPU advances the callbacks in its ``rcu_data`` structure 897ccc9971eSMauro Carvalho Chehabwhenever it notices that another RCU grace period has completed. The CPU 898ccc9971eSMauro Carvalho Chehabdetects the completion of an RCU grace period by noticing that the value 899ccc9971eSMauro Carvalho Chehabof its ``rcu_data`` structure's ``->gp_seq`` field differs from that of 900ccc9971eSMauro Carvalho Chehabits leaf ``rcu_node`` structure. Recall that each ``rcu_node`` 901ccc9971eSMauro Carvalho Chehabstructure's ``->gp_seq`` field is updated at the beginnings and ends of 902ccc9971eSMauro Carvalho Chehabeach grace period. 903ccc9971eSMauro Carvalho Chehab 904ccc9971eSMauro Carvalho ChehabThe ``->qlen_last_fqs_check`` and ``->n_force_qs_snap`` coordinate the 905ccc9971eSMauro Carvalho Chehabforcing of quiescent states from ``call_rcu()`` and friends when 906ccc9971eSMauro Carvalho Chehabcallback lists grow excessively long. 907ccc9971eSMauro Carvalho Chehab 908ccc9971eSMauro Carvalho ChehabThe ``->n_cbs_invoked``, ``->n_cbs_orphaned``, and ``->n_cbs_adopted`` 909ccc9971eSMauro Carvalho Chehabfields count the number of callbacks invoked, sent to other CPUs when 910ccc9971eSMauro Carvalho Chehabthis CPU goes offline, and received from other CPUs when those other 911ccc9971eSMauro Carvalho ChehabCPUs go offline. The ``->n_nocbs_invoked`` is used when the CPU's 912ccc9971eSMauro Carvalho Chehabcallbacks are offloaded to a kthread. 913ccc9971eSMauro Carvalho Chehab 914ccc9971eSMauro Carvalho ChehabFinally, the ``->blimit`` counter is the maximum number of RCU callbacks 915ccc9971eSMauro Carvalho Chehabthat may be invoked at a given time. 916ccc9971eSMauro Carvalho Chehab 917ccc9971eSMauro Carvalho ChehabDyntick-Idle Handling 918ccc9971eSMauro Carvalho Chehab''''''''''''''''''''' 919ccc9971eSMauro Carvalho Chehab 920ccc9971eSMauro Carvalho ChehabThis portion of the ``rcu_data`` structure is declared as follows: 921ccc9971eSMauro Carvalho Chehab 922ccc9971eSMauro Carvalho Chehab:: 923ccc9971eSMauro Carvalho Chehab 924ccc9971eSMauro Carvalho Chehab 1 int dynticks_snap; 925ccc9971eSMauro Carvalho Chehab 2 unsigned long dynticks_fqs; 926ccc9971eSMauro Carvalho Chehab 927ccc9971eSMauro Carvalho ChehabThe ``->dynticks_snap`` field is used to take a snapshot of the 928ccc9971eSMauro Carvalho Chehabcorresponding CPU's dyntick-idle state when forcing quiescent states, 929ccc9971eSMauro Carvalho Chehaband is therefore accessed from other CPUs. Finally, the 930ccc9971eSMauro Carvalho Chehab``->dynticks_fqs`` field is used to count the number of times this CPU 931ccc9971eSMauro Carvalho Chehabis determined to be in dyntick-idle state, and is used for tracing and 932ccc9971eSMauro Carvalho Chehabdebugging purposes. 933ccc9971eSMauro Carvalho Chehab 934ccc9971eSMauro Carvalho ChehabThis portion of the rcu_data structure is declared as follows: 935ccc9971eSMauro Carvalho Chehab 936ccc9971eSMauro Carvalho Chehab:: 937ccc9971eSMauro Carvalho Chehab 938ccc9971eSMauro Carvalho Chehab 1 long dynticks_nesting; 939ccc9971eSMauro Carvalho Chehab 2 long dynticks_nmi_nesting; 940ccc9971eSMauro Carvalho Chehab 3 atomic_t dynticks; 941ccc9971eSMauro Carvalho Chehab 4 bool rcu_need_heavy_qs; 942ccc9971eSMauro Carvalho Chehab 5 bool rcu_urgent_qs; 943ccc9971eSMauro Carvalho Chehab 944ccc9971eSMauro Carvalho ChehabThese fields in the rcu_data structure maintain the per-CPU dyntick-idle 945ccc9971eSMauro Carvalho Chehabstate for the corresponding CPU. The fields may be accessed only from 946ccc9971eSMauro Carvalho Chehabthe corresponding CPU (and from tracing) unless otherwise stated. 947ccc9971eSMauro Carvalho Chehab 948ccc9971eSMauro Carvalho ChehabThe ``->dynticks_nesting`` field counts the nesting depth of process 949ccc9971eSMauro Carvalho Chehabexecution, so that in normal circumstances this counter has value zero 950ccc9971eSMauro Carvalho Chehabor one. NMIs, irqs, and tracers are counted by the 951ccc9971eSMauro Carvalho Chehab``->dynticks_nmi_nesting`` field. Because NMIs cannot be masked, changes 952ccc9971eSMauro Carvalho Chehabto this variable have to be undertaken carefully using an algorithm 953ccc9971eSMauro Carvalho Chehabprovided by Andy Lutomirski. The initial transition from idle adds one, 954ccc9971eSMauro Carvalho Chehaband nested transitions add two, so that a nesting level of five is 955ccc9971eSMauro Carvalho Chehabrepresented by a ``->dynticks_nmi_nesting`` value of nine. This counter 956ccc9971eSMauro Carvalho Chehabcan therefore be thought of as counting the number of reasons why this 957ccc9971eSMauro Carvalho ChehabCPU cannot be permitted to enter dyntick-idle mode, aside from 958ccc9971eSMauro Carvalho Chehabprocess-level transitions. 959ccc9971eSMauro Carvalho Chehab 960ccc9971eSMauro Carvalho ChehabHowever, it turns out that when running in non-idle kernel context, the 961ccc9971eSMauro Carvalho ChehabLinux kernel is fully capable of entering interrupt handlers that never 962ccc9971eSMauro Carvalho Chehabexit and perhaps also vice versa. Therefore, whenever the 963ccc9971eSMauro Carvalho Chehab``->dynticks_nesting`` field is incremented up from zero, the 964ccc9971eSMauro Carvalho Chehab``->dynticks_nmi_nesting`` field is set to a large positive number, and 965ccc9971eSMauro Carvalho Chehabwhenever the ``->dynticks_nesting`` field is decremented down to zero, 9661b98b7c5SRandy Dunlapthe ``->dynticks_nmi_nesting`` field is set to zero. Assuming that 967ccc9971eSMauro Carvalho Chehabthe number of misnested interrupts is not sufficient to overflow the 968ccc9971eSMauro Carvalho Chehabcounter, this approach corrects the ``->dynticks_nmi_nesting`` field 969ccc9971eSMauro Carvalho Chehabevery time the corresponding CPU enters the idle loop from process 970ccc9971eSMauro Carvalho Chehabcontext. 971ccc9971eSMauro Carvalho Chehab 972ccc9971eSMauro Carvalho ChehabThe ``->dynticks`` field counts the corresponding CPU's transitions to 973ccc9971eSMauro Carvalho Chehaband from either dyntick-idle or user mode, so that this counter has an 974ccc9971eSMauro Carvalho Chehabeven value when the CPU is in dyntick-idle mode or user mode and an odd 975ccc9971eSMauro Carvalho Chehabvalue otherwise. The transitions to/from user mode need to be counted 976404147faSAkira Yokosawafor user mode adaptive-ticks support (see Documentation/timers/no_hz.rst). 977ccc9971eSMauro Carvalho Chehab 978ccc9971eSMauro Carvalho ChehabThe ``->rcu_need_heavy_qs`` field is used to record the fact that the 979ccc9971eSMauro Carvalho ChehabRCU core code would really like to see a quiescent state from the 980ccc9971eSMauro Carvalho Chehabcorresponding CPU, so much so that it is willing to call for 981ccc9971eSMauro Carvalho Chehabheavy-weight dyntick-counter operations. This flag is checked by RCU's 982ccc9971eSMauro Carvalho Chehabcontext-switch and ``cond_resched()`` code, which provide a momentary 983ccc9971eSMauro Carvalho Chehabidle sojourn in response. 984ccc9971eSMauro Carvalho Chehab 985ccc9971eSMauro Carvalho ChehabFinally, the ``->rcu_urgent_qs`` field is used to record the fact that 986ccc9971eSMauro Carvalho Chehabthe RCU core code would really like to see a quiescent state from the 987ccc9971eSMauro Carvalho Chehabcorresponding CPU, with the various other fields indicating just how 988ccc9971eSMauro Carvalho Chehabbadly RCU wants this quiescent state. This flag is checked by RCU's 989ccc9971eSMauro Carvalho Chehabcontext-switch path (``rcu_note_context_switch``) and the cond_resched 990ccc9971eSMauro Carvalho Chehabcode. 991ccc9971eSMauro Carvalho Chehab 992ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 993ccc9971eSMauro Carvalho Chehab| **Quick Quiz**: | 994ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 995ccc9971eSMauro Carvalho Chehab| Why not simply combine the ``->dynticks_nesting`` and | 996ccc9971eSMauro Carvalho Chehab| ``->dynticks_nmi_nesting`` counters into a single counter that just | 997ccc9971eSMauro Carvalho Chehab| counts the number of reasons that the corresponding CPU is non-idle? | 998ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 999ccc9971eSMauro Carvalho Chehab| **Answer**: | 1000ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 1001ccc9971eSMauro Carvalho Chehab| Because this would fail in the presence of interrupts whose handlers | 1002ccc9971eSMauro Carvalho Chehab| never return and of handlers that manage to return from a made-up | 1003ccc9971eSMauro Carvalho Chehab| interrupt. | 1004ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 1005ccc9971eSMauro Carvalho Chehab 1006ccc9971eSMauro Carvalho ChehabAdditional fields are present for some special-purpose builds, and are 1007ccc9971eSMauro Carvalho Chehabdiscussed separately. 1008ccc9971eSMauro Carvalho Chehab 1009ccc9971eSMauro Carvalho ChehabThe ``rcu_head`` Structure 1010ccc9971eSMauro Carvalho Chehab~~~~~~~~~~~~~~~~~~~~~~~~~~ 1011ccc9971eSMauro Carvalho Chehab 1012ccc9971eSMauro Carvalho ChehabEach ``rcu_head`` structure represents an RCU callback. These structures 1013ccc9971eSMauro Carvalho Chehabare normally embedded within RCU-protected data structures whose 1014ccc9971eSMauro Carvalho Chehabalgorithms use asynchronous grace periods. In contrast, when using 1015ccc9971eSMauro Carvalho Chehabalgorithms that block waiting for RCU grace periods, RCU users need not 1016ccc9971eSMauro Carvalho Chehabprovide ``rcu_head`` structures. 1017ccc9971eSMauro Carvalho Chehab 1018ccc9971eSMauro Carvalho ChehabThe ``rcu_head`` structure has fields as follows: 1019ccc9971eSMauro Carvalho Chehab 1020ccc9971eSMauro Carvalho Chehab:: 1021ccc9971eSMauro Carvalho Chehab 1022ccc9971eSMauro Carvalho Chehab 1 struct rcu_head *next; 1023ccc9971eSMauro Carvalho Chehab 2 void (*func)(struct rcu_head *head); 1024ccc9971eSMauro Carvalho Chehab 1025ccc9971eSMauro Carvalho ChehabThe ``->next`` field is used to link the ``rcu_head`` structures 1026ccc9971eSMauro Carvalho Chehabtogether in the lists within the ``rcu_data`` structures. The ``->func`` 1027ccc9971eSMauro Carvalho Chehabfield is a pointer to the function to be called when the callback is 1028ccc9971eSMauro Carvalho Chehabready to be invoked, and this function is passed a pointer to the 1029ccc9971eSMauro Carvalho Chehab``rcu_head`` structure. However, ``kfree_rcu()`` uses the ``->func`` 1030ccc9971eSMauro Carvalho Chehabfield to record the offset of the ``rcu_head`` structure within the 1031ccc9971eSMauro Carvalho Chehabenclosing RCU-protected data structure. 1032ccc9971eSMauro Carvalho Chehab 1033ccc9971eSMauro Carvalho ChehabBoth of these fields are used internally by RCU. From the viewpoint of 1034ccc9971eSMauro Carvalho ChehabRCU users, this structure is an opaque “cookie”. 1035ccc9971eSMauro Carvalho Chehab 1036ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 1037ccc9971eSMauro Carvalho Chehab| **Quick Quiz**: | 1038ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 1039ccc9971eSMauro Carvalho Chehab| Given that the callback function ``->func`` is passed a pointer to | 1040ccc9971eSMauro Carvalho Chehab| the ``rcu_head`` structure, how is that function supposed to find the | 1041ccc9971eSMauro Carvalho Chehab| beginning of the enclosing RCU-protected data structure? | 1042ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 1043ccc9971eSMauro Carvalho Chehab| **Answer**: | 1044ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 1045ccc9971eSMauro Carvalho Chehab| In actual practice, there is a separate callback function per type of | 1046ccc9971eSMauro Carvalho Chehab| RCU-protected data structure. The callback function can therefore use | 1047ccc9971eSMauro Carvalho Chehab| the ``container_of()`` macro in the Linux kernel (or other | 1048ccc9971eSMauro Carvalho Chehab| pointer-manipulation facilities in other software environments) to | 1049ccc9971eSMauro Carvalho Chehab| find the beginning of the enclosing structure. | 1050ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 1051ccc9971eSMauro Carvalho Chehab 1052ccc9971eSMauro Carvalho ChehabRCU-Specific Fields in the ``task_struct`` Structure 1053ccc9971eSMauro Carvalho Chehab~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1054ccc9971eSMauro Carvalho Chehab 1055ccc9971eSMauro Carvalho ChehabThe ``CONFIG_PREEMPT_RCU`` implementation uses some additional fields in 1056ccc9971eSMauro Carvalho Chehabthe ``task_struct`` structure: 1057ccc9971eSMauro Carvalho Chehab 1058ccc9971eSMauro Carvalho Chehab:: 1059ccc9971eSMauro Carvalho Chehab 1060ccc9971eSMauro Carvalho Chehab 1 #ifdef CONFIG_PREEMPT_RCU 1061ccc9971eSMauro Carvalho Chehab 2 int rcu_read_lock_nesting; 1062ccc9971eSMauro Carvalho Chehab 3 union rcu_special rcu_read_unlock_special; 1063ccc9971eSMauro Carvalho Chehab 4 struct list_head rcu_node_entry; 1064ccc9971eSMauro Carvalho Chehab 5 struct rcu_node *rcu_blocked_node; 1065ccc9971eSMauro Carvalho Chehab 6 #endif /* #ifdef CONFIG_PREEMPT_RCU */ 1066ccc9971eSMauro Carvalho Chehab 7 #ifdef CONFIG_TASKS_RCU 1067ccc9971eSMauro Carvalho Chehab 8 unsigned long rcu_tasks_nvcsw; 1068ccc9971eSMauro Carvalho Chehab 9 bool rcu_tasks_holdout; 1069ccc9971eSMauro Carvalho Chehab 10 struct list_head rcu_tasks_holdout_list; 1070ccc9971eSMauro Carvalho Chehab 11 int rcu_tasks_idle_cpu; 1071ccc9971eSMauro Carvalho Chehab 12 #endif /* #ifdef CONFIG_TASKS_RCU */ 1072ccc9971eSMauro Carvalho Chehab 1073ccc9971eSMauro Carvalho ChehabThe ``->rcu_read_lock_nesting`` field records the nesting level for RCU 1074ccc9971eSMauro Carvalho Chehabread-side critical sections, and the ``->rcu_read_unlock_special`` field 1075ccc9971eSMauro Carvalho Chehabis a bitmask that records special conditions that require 1076ccc9971eSMauro Carvalho Chehab``rcu_read_unlock()`` to do additional work. The ``->rcu_node_entry`` 1077ccc9971eSMauro Carvalho Chehabfield is used to form lists of tasks that have blocked within 1078ccc9971eSMauro Carvalho Chehabpreemptible-RCU read-side critical sections and the 1079ccc9971eSMauro Carvalho Chehab``->rcu_blocked_node`` field references the ``rcu_node`` structure whose 1080ccc9971eSMauro Carvalho Chehablist this task is a member of, or ``NULL`` if it is not blocked within a 1081ccc9971eSMauro Carvalho Chehabpreemptible-RCU read-side critical section. 1082ccc9971eSMauro Carvalho Chehab 1083ccc9971eSMauro Carvalho ChehabThe ``->rcu_tasks_nvcsw`` field tracks the number of voluntary context 1084ccc9971eSMauro Carvalho Chehabswitches that this task had undergone at the beginning of the current 1085ccc9971eSMauro Carvalho Chehabtasks-RCU grace period, ``->rcu_tasks_holdout`` is set if the current 1086ccc9971eSMauro Carvalho Chehabtasks-RCU grace period is waiting on this task, 1087ccc9971eSMauro Carvalho Chehab``->rcu_tasks_holdout_list`` is a list element enqueuing this task on 1088ccc9971eSMauro Carvalho Chehabthe holdout list, and ``->rcu_tasks_idle_cpu`` tracks which CPU this 1089ccc9971eSMauro Carvalho Chehabidle task is running, but only if the task is currently running, that 1090ccc9971eSMauro Carvalho Chehabis, if the CPU is currently idle. 1091ccc9971eSMauro Carvalho Chehab 1092ccc9971eSMauro Carvalho ChehabAccessor Functions 1093ccc9971eSMauro Carvalho Chehab~~~~~~~~~~~~~~~~~~ 1094ccc9971eSMauro Carvalho Chehab 1095ccc9971eSMauro Carvalho ChehabThe following listing shows the ``rcu_get_root()``, 1096ccc9971eSMauro Carvalho Chehab``rcu_for_each_node_breadth_first`` and ``rcu_for_each_leaf_node()`` 1097ccc9971eSMauro Carvalho Chehabfunction and macros: 1098ccc9971eSMauro Carvalho Chehab 1099ccc9971eSMauro Carvalho Chehab:: 1100ccc9971eSMauro Carvalho Chehab 1101ccc9971eSMauro Carvalho Chehab 1 static struct rcu_node *rcu_get_root(struct rcu_state *rsp) 1102ccc9971eSMauro Carvalho Chehab 2 { 1103ccc9971eSMauro Carvalho Chehab 3 return &rsp->node[0]; 1104ccc9971eSMauro Carvalho Chehab 4 } 1105ccc9971eSMauro Carvalho Chehab 5 1106ccc9971eSMauro Carvalho Chehab 6 #define rcu_for_each_node_breadth_first(rsp, rnp) \ 1107ccc9971eSMauro Carvalho Chehab 7 for ((rnp) = &(rsp)->node[0]; \ 1108ccc9971eSMauro Carvalho Chehab 8 (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++) 1109ccc9971eSMauro Carvalho Chehab 9 1110ccc9971eSMauro Carvalho Chehab 10 #define rcu_for_each_leaf_node(rsp, rnp) \ 1111ccc9971eSMauro Carvalho Chehab 11 for ((rnp) = (rsp)->level[NUM_RCU_LVLS - 1]; \ 1112ccc9971eSMauro Carvalho Chehab 12 (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++) 1113ccc9971eSMauro Carvalho Chehab 1114ccc9971eSMauro Carvalho ChehabThe ``rcu_get_root()`` simply returns a pointer to the first element of 1115ccc9971eSMauro Carvalho Chehabthe specified ``rcu_state`` structure's ``->node[]`` array, which is the 1116ccc9971eSMauro Carvalho Chehabroot ``rcu_node`` structure. 1117ccc9971eSMauro Carvalho Chehab 1118ccc9971eSMauro Carvalho ChehabAs noted earlier, the ``rcu_for_each_node_breadth_first()`` macro takes 1119ccc9971eSMauro Carvalho Chehabadvantage of the layout of the ``rcu_node`` structures in the 1120ccc9971eSMauro Carvalho Chehab``rcu_state`` structure's ``->node[]`` array, performing a breadth-first 1121ccc9971eSMauro Carvalho Chehabtraversal by simply traversing the array in order. Similarly, the 1122ccc9971eSMauro Carvalho Chehab``rcu_for_each_leaf_node()`` macro traverses only the last part of the 1123ccc9971eSMauro Carvalho Chehabarray, thus traversing only the leaf ``rcu_node`` structures. 1124ccc9971eSMauro Carvalho Chehab 1125ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 1126ccc9971eSMauro Carvalho Chehab| **Quick Quiz**: | 1127ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 1128ccc9971eSMauro Carvalho Chehab| What does ``rcu_for_each_leaf_node()`` do if the ``rcu_node`` tree | 1129ccc9971eSMauro Carvalho Chehab| contains only a single node? | 1130ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 1131ccc9971eSMauro Carvalho Chehab| **Answer**: | 1132ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 1133ccc9971eSMauro Carvalho Chehab| In the single-node case, ``rcu_for_each_leaf_node()`` traverses the | 1134ccc9971eSMauro Carvalho Chehab| single node. | 1135ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+ 1136ccc9971eSMauro Carvalho Chehab 1137ccc9971eSMauro Carvalho ChehabSummary 1138ccc9971eSMauro Carvalho Chehab~~~~~~~ 1139ccc9971eSMauro Carvalho Chehab 1140ccc9971eSMauro Carvalho ChehabSo the state of RCU is represented by an ``rcu_state`` structure, which 1141ccc9971eSMauro Carvalho Chehabcontains a combining tree of ``rcu_node`` and ``rcu_data`` structures. 1142ccc9971eSMauro Carvalho ChehabFinally, in ``CONFIG_NO_HZ_IDLE`` kernels, each CPU's dyntick-idle state 1143ccc9971eSMauro Carvalho Chehabis tracked by dynticks-related fields in the ``rcu_data`` structure. If 1144ccc9971eSMauro Carvalho Chehabyou made it this far, you are well prepared to read the code 1145ccc9971eSMauro Carvalho Chehabwalkthroughs in the other articles in this series. 1146ccc9971eSMauro Carvalho Chehab 1147ccc9971eSMauro Carvalho ChehabAcknowledgments 1148ccc9971eSMauro Carvalho Chehab~~~~~~~~~~~~~~~ 1149ccc9971eSMauro Carvalho Chehab 1150ccc9971eSMauro Carvalho ChehabI owe thanks to Cyrill Gorcunov, Mathieu Desnoyers, Dhaval Giani, Paul 1151ccc9971eSMauro Carvalho ChehabTurner, Abhishek Srivastava, Matt Kowalczyk, and Serge Hallyn for 1152ccc9971eSMauro Carvalho Chehabhelping me get this document into a more human-readable state. 1153ccc9971eSMauro Carvalho Chehab 1154ccc9971eSMauro Carvalho ChehabLegal Statement 1155ccc9971eSMauro Carvalho Chehab~~~~~~~~~~~~~~~ 1156ccc9971eSMauro Carvalho Chehab 1157ccc9971eSMauro Carvalho ChehabThis work represents the view of the author and does not necessarily 1158ccc9971eSMauro Carvalho Chehabrepresent the view of IBM. 1159ccc9971eSMauro Carvalho Chehab 1160ccc9971eSMauro Carvalho ChehabLinux is a registered trademark of Linus Torvalds. 1161ccc9971eSMauro Carvalho Chehab 1162ccc9971eSMauro Carvalho ChehabOther company, product, and service names may be trademarks or service 1163ccc9971eSMauro Carvalho Chehabmarks of others. 1164