1ccc9971eSMauro Carvalho Chehab===================================================
2ccc9971eSMauro Carvalho ChehabA Tour Through TREE_RCU's Data Structures [LWN.net]
3ccc9971eSMauro Carvalho Chehab===================================================
4ccc9971eSMauro Carvalho Chehab
5ccc9971eSMauro Carvalho ChehabDecember 18, 2016
6ccc9971eSMauro Carvalho Chehab
7ccc9971eSMauro Carvalho ChehabThis article was contributed by Paul E. McKenney
8ccc9971eSMauro Carvalho Chehab
9ccc9971eSMauro Carvalho ChehabIntroduction
10ccc9971eSMauro Carvalho Chehab============
11ccc9971eSMauro Carvalho Chehab
12ccc9971eSMauro Carvalho ChehabThis document describes RCU's major data structures and their relationship
13ccc9971eSMauro Carvalho Chehabto each other.
14ccc9971eSMauro Carvalho Chehab
15ccc9971eSMauro Carvalho ChehabData-Structure Relationships
16ccc9971eSMauro Carvalho Chehab============================
17ccc9971eSMauro Carvalho Chehab
18ccc9971eSMauro Carvalho ChehabRCU is for all intents and purposes a large state machine, and its
19ccc9971eSMauro Carvalho Chehabdata structures maintain the state in such a way as to allow RCU readers
20ccc9971eSMauro Carvalho Chehabto execute extremely quickly, while also processing the RCU grace periods
21ccc9971eSMauro Carvalho Chehabrequested by updaters in an efficient and extremely scalable fashion.
22ccc9971eSMauro Carvalho ChehabThe efficiency and scalability of RCU updaters is provided primarily
23ccc9971eSMauro Carvalho Chehabby a combining tree, as shown below:
24ccc9971eSMauro Carvalho Chehab
25ccc9971eSMauro Carvalho Chehab.. kernel-figure:: BigTreeClassicRCU.svg
26ccc9971eSMauro Carvalho Chehab
27ccc9971eSMauro Carvalho ChehabThis diagram shows an enclosing ``rcu_state`` structure containing a tree
28ccc9971eSMauro Carvalho Chehabof ``rcu_node`` structures. Each leaf node of the ``rcu_node`` tree has up
29ccc9971eSMauro Carvalho Chehabto 16 ``rcu_data`` structures associated with it, so that there are
30ccc9971eSMauro Carvalho Chehab``NR_CPUS`` number of ``rcu_data`` structures, one for each possible CPU.
31ccc9971eSMauro Carvalho ChehabThis structure is adjusted at boot time, if needed, to handle the common
32ccc9971eSMauro Carvalho Chehabcase where ``nr_cpu_ids`` is much less than ``NR_CPUs``.
33ccc9971eSMauro Carvalho ChehabFor example, a number of Linux distributions set ``NR_CPUs=4096``,
34ccc9971eSMauro Carvalho Chehabwhich results in a three-level ``rcu_node`` tree.
35ccc9971eSMauro Carvalho ChehabIf the actual hardware has only 16 CPUs, RCU will adjust itself
36ccc9971eSMauro Carvalho Chehabat boot time, resulting in an ``rcu_node`` tree with only a single node.
37ccc9971eSMauro Carvalho Chehab
38ccc9971eSMauro Carvalho ChehabThe purpose of this combining tree is to allow per-CPU events
39ccc9971eSMauro Carvalho Chehabsuch as quiescent states, dyntick-idle transitions,
40ccc9971eSMauro Carvalho Chehaband CPU hotplug operations to be processed efficiently
41ccc9971eSMauro Carvalho Chehaband scalably.
42ccc9971eSMauro Carvalho ChehabQuiescent states are recorded by the per-CPU ``rcu_data`` structures,
43ccc9971eSMauro Carvalho Chehaband other events are recorded by the leaf-level ``rcu_node``
44ccc9971eSMauro Carvalho Chehabstructures.
45ccc9971eSMauro Carvalho ChehabAll of these events are combined at each level of the tree until finally
46ccc9971eSMauro Carvalho Chehabgrace periods are completed at the tree's root ``rcu_node``
47ccc9971eSMauro Carvalho Chehabstructure.
48ccc9971eSMauro Carvalho ChehabA grace period can be completed at the root once every CPU
49ccc9971eSMauro Carvalho Chehab(or, in the case of ``CONFIG_PREEMPT_RCU``, task)
50ccc9971eSMauro Carvalho Chehabhas passed through a quiescent state.
51ccc9971eSMauro Carvalho ChehabOnce a grace period has completed, record of that fact is propagated
52ccc9971eSMauro Carvalho Chehabback down the tree.
53ccc9971eSMauro Carvalho Chehab
54ccc9971eSMauro Carvalho ChehabAs can be seen from the diagram, on a 64-bit system
55ccc9971eSMauro Carvalho Chehaba two-level tree with 64 leaves can accommodate 1,024 CPUs, with a fanout
56ccc9971eSMauro Carvalho Chehabof 64 at the root and a fanout of 16 at the leaves.
57ccc9971eSMauro Carvalho Chehab
58ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
59ccc9971eSMauro Carvalho Chehab| **Quick Quiz**:                                                       |
60ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
61ccc9971eSMauro Carvalho Chehab| Why isn't the fanout at the leaves also 64?                           |
62ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
63ccc9971eSMauro Carvalho Chehab| **Answer**:                                                           |
64ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
65ccc9971eSMauro Carvalho Chehab| Because there are more types of events that affect the leaf-level     |
66ccc9971eSMauro Carvalho Chehab| ``rcu_node`` structures than further up the tree. Therefore, if the   |
67ccc9971eSMauro Carvalho Chehab| leaf ``rcu_node`` structures have fanout of 64, the contention on     |
68ccc9971eSMauro Carvalho Chehab| these structures' ``->structures`` becomes excessive. Experimentation |
69ccc9971eSMauro Carvalho Chehab| on a wide variety of systems has shown that a fanout of 16 works well |
70ccc9971eSMauro Carvalho Chehab| for the leaves of the ``rcu_node`` tree.                              |
71ccc9971eSMauro Carvalho Chehab|                                                                       |
72ccc9971eSMauro Carvalho Chehab| Of course, further experience with systems having hundreds or         |
73ccc9971eSMauro Carvalho Chehab| thousands of CPUs may demonstrate that the fanout for the non-leaf    |
74ccc9971eSMauro Carvalho Chehab| ``rcu_node`` structures must also be reduced. Such reduction can be   |
75ccc9971eSMauro Carvalho Chehab| easily carried out when and if it proves necessary. In the meantime,  |
76ccc9971eSMauro Carvalho Chehab| if you are using such a system and running into contention problems   |
77ccc9971eSMauro Carvalho Chehab| on the non-leaf ``rcu_node`` structures, you may use the              |
78ccc9971eSMauro Carvalho Chehab| ``CONFIG_RCU_FANOUT`` kernel configuration parameter to reduce the    |
79ccc9971eSMauro Carvalho Chehab| non-leaf fanout as needed.                                            |
80ccc9971eSMauro Carvalho Chehab|                                                                       |
81ccc9971eSMauro Carvalho Chehab| Kernels built for systems with strong NUMA characteristics might      |
82ccc9971eSMauro Carvalho Chehab| also need to adjust ``CONFIG_RCU_FANOUT`` so that the domains of      |
83ccc9971eSMauro Carvalho Chehab| the ``rcu_node`` structures align with hardware boundaries.           |
84ccc9971eSMauro Carvalho Chehab| However, there has thus far been no need for this.                    |
85ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
86ccc9971eSMauro Carvalho Chehab
87ccc9971eSMauro Carvalho ChehabIf your system has more than 1,024 CPUs (or more than 512 CPUs on a
88ccc9971eSMauro Carvalho Chehab32-bit system), then RCU will automatically add more levels to the tree.
89ccc9971eSMauro Carvalho ChehabFor example, if you are crazy enough to build a 64-bit system with
90ccc9971eSMauro Carvalho Chehab65,536 CPUs, RCU would configure the ``rcu_node`` tree as follows:
91ccc9971eSMauro Carvalho Chehab
92ccc9971eSMauro Carvalho Chehab.. kernel-figure:: HugeTreeClassicRCU.svg
93ccc9971eSMauro Carvalho Chehab
94ccc9971eSMauro Carvalho ChehabRCU currently permits up to a four-level tree, which on a 64-bit system
95ccc9971eSMauro Carvalho Chehabaccommodates up to 4,194,304 CPUs, though only a mere 524,288 CPUs for
96ccc9971eSMauro Carvalho Chehab32-bit systems. On the other hand, you can set both
97ccc9971eSMauro Carvalho Chehab``CONFIG_RCU_FANOUT`` and ``CONFIG_RCU_FANOUT_LEAF`` to be as small as
98ccc9971eSMauro Carvalho Chehab2, which would result in a 16-CPU test using a 4-level tree. This can be
99ccc9971eSMauro Carvalho Chehabuseful for testing large-system capabilities on small test machines.
100ccc9971eSMauro Carvalho Chehab
101ccc9971eSMauro Carvalho ChehabThis multi-level combining tree allows us to get most of the performance
102ccc9971eSMauro Carvalho Chehaband scalability benefits of partitioning, even though RCU grace-period
103ccc9971eSMauro Carvalho Chehabdetection is inherently a global operation. The trick here is that only
104ccc9971eSMauro Carvalho Chehabthe last CPU to report a quiescent state into a given ``rcu_node``
105ccc9971eSMauro Carvalho Chehabstructure need advance to the ``rcu_node`` structure at the next level
106ccc9971eSMauro Carvalho Chehabup the tree. This means that at the leaf-level ``rcu_node`` structure,
107ccc9971eSMauro Carvalho Chehabonly one access out of sixteen will progress up the tree. For the
108ccc9971eSMauro Carvalho Chehabinternal ``rcu_node`` structures, the situation is even more extreme:
109ccc9971eSMauro Carvalho ChehabOnly one access out of sixty-four will progress up the tree. Because the
110ccc9971eSMauro Carvalho Chehabvast majority of the CPUs do not progress up the tree, the lock
111ccc9971eSMauro Carvalho Chehabcontention remains roughly constant up the tree. No matter how many CPUs
112ccc9971eSMauro Carvalho Chehabthere are in the system, at most 64 quiescent-state reports per grace
113ccc9971eSMauro Carvalho Chehabperiod will progress all the way to the root ``rcu_node`` structure,
114ccc9971eSMauro Carvalho Chehabthus ensuring that the lock contention on that root ``rcu_node``
115ccc9971eSMauro Carvalho Chehabstructure remains acceptably low.
116ccc9971eSMauro Carvalho Chehab
117ccc9971eSMauro Carvalho ChehabIn effect, the combining tree acts like a big shock absorber, keeping
118ccc9971eSMauro Carvalho Chehablock contention under control at all tree levels regardless of the level
119ccc9971eSMauro Carvalho Chehabof loading on the system.
120ccc9971eSMauro Carvalho Chehab
121ccc9971eSMauro Carvalho ChehabRCU updaters wait for normal grace periods by registering RCU callbacks,
122ccc9971eSMauro Carvalho Chehabeither directly via ``call_rcu()`` or indirectly via
123ccc9971eSMauro Carvalho Chehab``synchronize_rcu()`` and friends. RCU callbacks are represented by
124ccc9971eSMauro Carvalho Chehab``rcu_head`` structures, which are queued on ``rcu_data`` structures
125ccc9971eSMauro Carvalho Chehabwhile they are waiting for a grace period to elapse, as shown in the
126ccc9971eSMauro Carvalho Chehabfollowing figure:
127ccc9971eSMauro Carvalho Chehab
128ccc9971eSMauro Carvalho Chehab.. kernel-figure:: BigTreePreemptRCUBHdyntickCB.svg
129ccc9971eSMauro Carvalho Chehab
130ccc9971eSMauro Carvalho ChehabThis figure shows how ``TREE_RCU``'s and ``PREEMPT_RCU``'s major data
131ccc9971eSMauro Carvalho Chehabstructures are related. Lesser data structures will be introduced with
132ccc9971eSMauro Carvalho Chehabthe algorithms that make use of them.
133ccc9971eSMauro Carvalho Chehab
134ccc9971eSMauro Carvalho ChehabNote that each of the data structures in the above figure has its own
135ccc9971eSMauro Carvalho Chehabsynchronization:
136ccc9971eSMauro Carvalho Chehab
137ccc9971eSMauro Carvalho Chehab#. Each ``rcu_state`` structures has a lock and a mutex, and some fields
138ccc9971eSMauro Carvalho Chehab   are protected by the corresponding root ``rcu_node`` structure's lock.
139ccc9971eSMauro Carvalho Chehab#. Each ``rcu_node`` structure has a spinlock.
140ccc9971eSMauro Carvalho Chehab#. The fields in ``rcu_data`` are private to the corresponding CPU,
141ccc9971eSMauro Carvalho Chehab   although a few can be read and written by other CPUs.
142ccc9971eSMauro Carvalho Chehab
143ccc9971eSMauro Carvalho ChehabIt is important to note that different data structures can have very
144ccc9971eSMauro Carvalho Chehabdifferent ideas about the state of RCU at any given time. For but one
145ccc9971eSMauro Carvalho Chehabexample, awareness of the start or end of a given RCU grace period
146ccc9971eSMauro Carvalho Chehabpropagates slowly through the data structures. This slow propagation is
147ccc9971eSMauro Carvalho Chehababsolutely necessary for RCU to have good read-side performance. If this
148ccc9971eSMauro Carvalho Chehabbalkanized implementation seems foreign to you, one useful trick is to
149ccc9971eSMauro Carvalho Chehabconsider each instance of these data structures to be a different
150ccc9971eSMauro Carvalho Chehabperson, each having the usual slightly different view of reality.
151ccc9971eSMauro Carvalho Chehab
152ccc9971eSMauro Carvalho ChehabThe general role of each of these data structures is as follows:
153ccc9971eSMauro Carvalho Chehab
154ccc9971eSMauro Carvalho Chehab#. ``rcu_state``: This structure forms the interconnection between the
155ccc9971eSMauro Carvalho Chehab   ``rcu_node`` and ``rcu_data`` structures, tracks grace periods,
156ccc9971eSMauro Carvalho Chehab   serves as short-term repository for callbacks orphaned by CPU-hotplug
157ccc9971eSMauro Carvalho Chehab   events, maintains ``rcu_barrier()`` state, tracks expedited
158ccc9971eSMauro Carvalho Chehab   grace-period state, and maintains state used to force quiescent
159ccc9971eSMauro Carvalho Chehab   states when grace periods extend too long,
160ccc9971eSMauro Carvalho Chehab#. ``rcu_node``: This structure forms the combining tree that propagates
161ccc9971eSMauro Carvalho Chehab   quiescent-state information from the leaves to the root, and also
162ccc9971eSMauro Carvalho Chehab   propagates grace-period information from the root to the leaves. It
163ccc9971eSMauro Carvalho Chehab   provides local copies of the grace-period state in order to allow
164ccc9971eSMauro Carvalho Chehab   this information to be accessed in a synchronized manner without
165ccc9971eSMauro Carvalho Chehab   suffering the scalability limitations that would otherwise be imposed
166ccc9971eSMauro Carvalho Chehab   by global locking. In ``CONFIG_PREEMPT_RCU`` kernels, it manages the
167ccc9971eSMauro Carvalho Chehab   lists of tasks that have blocked while in their current RCU read-side
168ccc9971eSMauro Carvalho Chehab   critical section. In ``CONFIG_PREEMPT_RCU`` with
169ccc9971eSMauro Carvalho Chehab   ``CONFIG_RCU_BOOST``, it manages the per-\ ``rcu_node``
170ccc9971eSMauro Carvalho Chehab   priority-boosting kernel threads (kthreads) and state. Finally, it
171ccc9971eSMauro Carvalho Chehab   records CPU-hotplug state in order to determine which CPUs should be
172ccc9971eSMauro Carvalho Chehab   ignored during a given grace period.
173ccc9971eSMauro Carvalho Chehab#. ``rcu_data``: This per-CPU structure is the focus of quiescent-state
174ccc9971eSMauro Carvalho Chehab   detection and RCU callback queuing. It also tracks its relationship
175ccc9971eSMauro Carvalho Chehab   to the corresponding leaf ``rcu_node`` structure to allow
176ccc9971eSMauro Carvalho Chehab   more-efficient propagation of quiescent states up the ``rcu_node``
177ccc9971eSMauro Carvalho Chehab   combining tree. Like the ``rcu_node`` structure, it provides a local
178ccc9971eSMauro Carvalho Chehab   copy of the grace-period information to allow for-free synchronized
179ccc9971eSMauro Carvalho Chehab   access to this information from the corresponding CPU. Finally, this
180ccc9971eSMauro Carvalho Chehab   structure records past dyntick-idle state for the corresponding CPU
181ccc9971eSMauro Carvalho Chehab   and also tracks statistics.
182ccc9971eSMauro Carvalho Chehab#. ``rcu_head``: This structure represents RCU callbacks, and is the
183ccc9971eSMauro Carvalho Chehab   only structure allocated and managed by RCU users. The ``rcu_head``
184ccc9971eSMauro Carvalho Chehab   structure is normally embedded within the RCU-protected data
185ccc9971eSMauro Carvalho Chehab   structure.
186ccc9971eSMauro Carvalho Chehab
187ccc9971eSMauro Carvalho ChehabIf all you wanted from this article was a general notion of how RCU's
188ccc9971eSMauro Carvalho Chehabdata structures are related, you are done. Otherwise, each of the
189ccc9971eSMauro Carvalho Chehabfollowing sections give more details on the ``rcu_state``, ``rcu_node``
190ccc9971eSMauro Carvalho Chehaband ``rcu_data`` data structures.
191ccc9971eSMauro Carvalho Chehab
192ccc9971eSMauro Carvalho ChehabThe ``rcu_state`` Structure
193ccc9971eSMauro Carvalho Chehab~~~~~~~~~~~~~~~~~~~~~~~~~~~
194ccc9971eSMauro Carvalho Chehab
195ccc9971eSMauro Carvalho ChehabThe ``rcu_state`` structure is the base structure that represents the
196ccc9971eSMauro Carvalho Chehabstate of RCU in the system. This structure forms the interconnection
197ccc9971eSMauro Carvalho Chehabbetween the ``rcu_node`` and ``rcu_data`` structures, tracks grace
198ccc9971eSMauro Carvalho Chehabperiods, contains the lock used to synchronize with CPU-hotplug events,
199ccc9971eSMauro Carvalho Chehaband maintains state used to force quiescent states when grace periods
200ccc9971eSMauro Carvalho Chehabextend too long,
201ccc9971eSMauro Carvalho Chehab
202ccc9971eSMauro Carvalho ChehabA few of the ``rcu_state`` structure's fields are discussed, singly and
203ccc9971eSMauro Carvalho Chehabin groups, in the following sections. The more specialized fields are
204ccc9971eSMauro Carvalho Chehabcovered in the discussion of their use.
205ccc9971eSMauro Carvalho Chehab
206ccc9971eSMauro Carvalho ChehabRelationship to rcu_node and rcu_data Structures
207ccc9971eSMauro Carvalho Chehab''''''''''''''''''''''''''''''''''''''''''''''''
208ccc9971eSMauro Carvalho Chehab
209ccc9971eSMauro Carvalho ChehabThis portion of the ``rcu_state`` structure is declared as follows:
210ccc9971eSMauro Carvalho Chehab
211ccc9971eSMauro Carvalho Chehab::
212ccc9971eSMauro Carvalho Chehab
213ccc9971eSMauro Carvalho Chehab     1   struct rcu_node node[NUM_RCU_NODES];
214ccc9971eSMauro Carvalho Chehab     2   struct rcu_node *level[NUM_RCU_LVLS + 1];
215ccc9971eSMauro Carvalho Chehab     3   struct rcu_data __percpu *rda;
216ccc9971eSMauro Carvalho Chehab
217ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
218ccc9971eSMauro Carvalho Chehab| **Quick Quiz**:                                                       |
219ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
220ccc9971eSMauro Carvalho Chehab| Wait a minute! You said that the ``rcu_node`` structures formed a     |
221ccc9971eSMauro Carvalho Chehab| tree, but they are declared as a flat array! What gives?              |
222ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
223ccc9971eSMauro Carvalho Chehab| **Answer**:                                                           |
224ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
225ccc9971eSMauro Carvalho Chehab| The tree is laid out in the array. The first node In the array is the |
226ccc9971eSMauro Carvalho Chehab| head, the next set of nodes in the array are children of the head     |
227ccc9971eSMauro Carvalho Chehab| node, and so on until the last set of nodes in the array are the      |
228ccc9971eSMauro Carvalho Chehab| leaves.                                                               |
229ccc9971eSMauro Carvalho Chehab| See the following diagrams to see how this works.                     |
230ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
231ccc9971eSMauro Carvalho Chehab
232ccc9971eSMauro Carvalho ChehabThe ``rcu_node`` tree is embedded into the ``->node[]`` array as shown
233ccc9971eSMauro Carvalho Chehabin the following figure:
234ccc9971eSMauro Carvalho Chehab
235ccc9971eSMauro Carvalho Chehab.. kernel-figure:: TreeMapping.svg
236ccc9971eSMauro Carvalho Chehab
237ccc9971eSMauro Carvalho ChehabOne interesting consequence of this mapping is that a breadth-first
238ccc9971eSMauro Carvalho Chehabtraversal of the tree is implemented as a simple linear scan of the
239ccc9971eSMauro Carvalho Chehabarray, which is in fact what the ``rcu_for_each_node_breadth_first()``
240ccc9971eSMauro Carvalho Chehabmacro does. This macro is used at the beginning and ends of grace
241ccc9971eSMauro Carvalho Chehabperiods.
242ccc9971eSMauro Carvalho Chehab
243ccc9971eSMauro Carvalho ChehabEach entry of the ``->level`` array references the first ``rcu_node``
244ccc9971eSMauro Carvalho Chehabstructure on the corresponding level of the tree, for example, as shown
245ccc9971eSMauro Carvalho Chehabbelow:
246ccc9971eSMauro Carvalho Chehab
247ccc9971eSMauro Carvalho Chehab.. kernel-figure:: TreeMappingLevel.svg
248ccc9971eSMauro Carvalho Chehab
249ccc9971eSMauro Carvalho ChehabThe zero\ :sup:`th` element of the array references the root
250ccc9971eSMauro Carvalho Chehab``rcu_node`` structure, the first element references the first child of
251ccc9971eSMauro Carvalho Chehabthe root ``rcu_node``, and finally the second element references the
252ccc9971eSMauro Carvalho Chehabfirst leaf ``rcu_node`` structure.
253ccc9971eSMauro Carvalho Chehab
254ccc9971eSMauro Carvalho ChehabFor whatever it is worth, if you draw the tree to be tree-shaped rather
255ccc9971eSMauro Carvalho Chehabthan array-shaped, it is easy to draw a planar representation:
256ccc9971eSMauro Carvalho Chehab
257ccc9971eSMauro Carvalho Chehab.. kernel-figure:: TreeLevel.svg
258ccc9971eSMauro Carvalho Chehab
259ccc9971eSMauro Carvalho ChehabFinally, the ``->rda`` field references a per-CPU pointer to the
260ccc9971eSMauro Carvalho Chehabcorresponding CPU's ``rcu_data`` structure.
261ccc9971eSMauro Carvalho Chehab
262ccc9971eSMauro Carvalho ChehabAll of these fields are constant once initialization is complete, and
263ccc9971eSMauro Carvalho Chehabtherefore need no protection.
264ccc9971eSMauro Carvalho Chehab
265ccc9971eSMauro Carvalho ChehabGrace-Period Tracking
266ccc9971eSMauro Carvalho Chehab'''''''''''''''''''''
267ccc9971eSMauro Carvalho Chehab
268ccc9971eSMauro Carvalho ChehabThis portion of the ``rcu_state`` structure is declared as follows:
269ccc9971eSMauro Carvalho Chehab
270ccc9971eSMauro Carvalho Chehab::
271ccc9971eSMauro Carvalho Chehab
272ccc9971eSMauro Carvalho Chehab     1   unsigned long gp_seq;
273ccc9971eSMauro Carvalho Chehab
274ccc9971eSMauro Carvalho ChehabRCU grace periods are numbered, and the ``->gp_seq`` field contains the
275ccc9971eSMauro Carvalho Chehabcurrent grace-period sequence number. The bottom two bits are the state
276ccc9971eSMauro Carvalho Chehabof the current grace period, which can be zero for not yet started or
277ccc9971eSMauro Carvalho Chehabone for in progress. In other words, if the bottom two bits of
278ccc9971eSMauro Carvalho Chehab``->gp_seq`` are zero, then RCU is idle. Any other value in the bottom
279ccc9971eSMauro Carvalho Chehabtwo bits indicates that something is broken. This field is protected by
280ccc9971eSMauro Carvalho Chehabthe root ``rcu_node`` structure's ``->lock`` field.
281ccc9971eSMauro Carvalho Chehab
282ccc9971eSMauro Carvalho ChehabThere are ``->gp_seq`` fields in the ``rcu_node`` and ``rcu_data``
283ccc9971eSMauro Carvalho Chehabstructures as well. The fields in the ``rcu_state`` structure represent
284ccc9971eSMauro Carvalho Chehabthe most current value, and those of the other structures are compared
285ccc9971eSMauro Carvalho Chehabin order to detect the beginnings and ends of grace periods in a
286ccc9971eSMauro Carvalho Chehabdistributed fashion. The values flow from ``rcu_state`` to ``rcu_node``
287ccc9971eSMauro Carvalho Chehab(down the tree from the root to the leaves) to ``rcu_data``.
288ccc9971eSMauro Carvalho Chehab
289ccc9971eSMauro Carvalho ChehabMiscellaneous
290ccc9971eSMauro Carvalho Chehab'''''''''''''
291ccc9971eSMauro Carvalho Chehab
292ccc9971eSMauro Carvalho ChehabThis portion of the ``rcu_state`` structure is declared as follows:
293ccc9971eSMauro Carvalho Chehab
294ccc9971eSMauro Carvalho Chehab::
295ccc9971eSMauro Carvalho Chehab
296ccc9971eSMauro Carvalho Chehab     1   unsigned long gp_max;
297ccc9971eSMauro Carvalho Chehab     2   char abbr;
298ccc9971eSMauro Carvalho Chehab     3   char *name;
299ccc9971eSMauro Carvalho Chehab
300ccc9971eSMauro Carvalho ChehabThe ``->gp_max`` field tracks the duration of the longest grace period
301ccc9971eSMauro Carvalho Chehabin jiffies. It is protected by the root ``rcu_node``'s ``->lock``.
302ccc9971eSMauro Carvalho Chehab
303ccc9971eSMauro Carvalho ChehabThe ``->name`` and ``->abbr`` fields distinguish between preemptible RCU
304ccc9971eSMauro Carvalho Chehab(“rcu_preempt” and “p”) and non-preemptible RCU (“rcu_sched” and “s”).
305ccc9971eSMauro Carvalho ChehabThese fields are used for diagnostic and tracing purposes.
306ccc9971eSMauro Carvalho Chehab
307ccc9971eSMauro Carvalho ChehabThe ``rcu_node`` Structure
308ccc9971eSMauro Carvalho Chehab~~~~~~~~~~~~~~~~~~~~~~~~~~
309ccc9971eSMauro Carvalho Chehab
310ccc9971eSMauro Carvalho ChehabThe ``rcu_node`` structures form the combining tree that propagates
311ccc9971eSMauro Carvalho Chehabquiescent-state information from the leaves to the root and also that
312ccc9971eSMauro Carvalho Chehabpropagates grace-period information from the root down to the leaves.
313ccc9971eSMauro Carvalho ChehabThey provides local copies of the grace-period state in order to allow
314ccc9971eSMauro Carvalho Chehabthis information to be accessed in a synchronized manner without
315ccc9971eSMauro Carvalho Chehabsuffering the scalability limitations that would otherwise be imposed by
316ccc9971eSMauro Carvalho Chehabglobal locking. In ``CONFIG_PREEMPT_RCU`` kernels, they manage the lists
317ccc9971eSMauro Carvalho Chehabof tasks that have blocked while in their current RCU read-side critical
318ccc9971eSMauro Carvalho Chehabsection. In ``CONFIG_PREEMPT_RCU`` with ``CONFIG_RCU_BOOST``, they
319ccc9971eSMauro Carvalho Chehabmanage the per-\ ``rcu_node`` priority-boosting kernel threads
320ccc9971eSMauro Carvalho Chehab(kthreads) and state. Finally, they record CPU-hotplug state in order to
321ccc9971eSMauro Carvalho Chehabdetermine which CPUs should be ignored during a given grace period.
322ccc9971eSMauro Carvalho Chehab
323ccc9971eSMauro Carvalho ChehabThe ``rcu_node`` structure's fields are discussed, singly and in groups,
324ccc9971eSMauro Carvalho Chehabin the following sections.
325ccc9971eSMauro Carvalho Chehab
326ccc9971eSMauro Carvalho ChehabConnection to Combining Tree
327ccc9971eSMauro Carvalho Chehab''''''''''''''''''''''''''''
328ccc9971eSMauro Carvalho Chehab
329ccc9971eSMauro Carvalho ChehabThis portion of the ``rcu_node`` structure is declared as follows:
330ccc9971eSMauro Carvalho Chehab
331ccc9971eSMauro Carvalho Chehab::
332ccc9971eSMauro Carvalho Chehab
333ccc9971eSMauro Carvalho Chehab     1   struct rcu_node *parent;
334ccc9971eSMauro Carvalho Chehab     2   u8 level;
335ccc9971eSMauro Carvalho Chehab     3   u8 grpnum;
336ccc9971eSMauro Carvalho Chehab     4   unsigned long grpmask;
337ccc9971eSMauro Carvalho Chehab     5   int grplo;
338ccc9971eSMauro Carvalho Chehab     6   int grphi;
339ccc9971eSMauro Carvalho Chehab
340ccc9971eSMauro Carvalho ChehabThe ``->parent`` pointer references the ``rcu_node`` one level up in the
341ccc9971eSMauro Carvalho Chehabtree, and is ``NULL`` for the root ``rcu_node``. The RCU implementation
342ccc9971eSMauro Carvalho Chehabmakes heavy use of this field to push quiescent states up the tree. The
343ccc9971eSMauro Carvalho Chehab``->level`` field gives the level in the tree, with the root being at
344ccc9971eSMauro Carvalho Chehablevel zero, its children at level one, and so on. The ``->grpnum`` field
345ccc9971eSMauro Carvalho Chehabgives this node's position within the children of its parent, so this
346ccc9971eSMauro Carvalho Chehabnumber can range between 0 and 31 on 32-bit systems and between 0 and 63
347ccc9971eSMauro Carvalho Chehabon 64-bit systems. The ``->level`` and ``->grpnum`` fields are used only
348ccc9971eSMauro Carvalho Chehabduring initialization and for tracing. The ``->grpmask`` field is the
349ccc9971eSMauro Carvalho Chehabbitmask counterpart of ``->grpnum``, and therefore always has exactly
350ccc9971eSMauro Carvalho Chehabone bit set. This mask is used to clear the bit corresponding to this
351ccc9971eSMauro Carvalho Chehab``rcu_node`` structure in its parent's bitmasks, which are described
352ccc9971eSMauro Carvalho Chehablater. Finally, the ``->grplo`` and ``->grphi`` fields contain the
353ccc9971eSMauro Carvalho Chehablowest and highest numbered CPU served by this ``rcu_node`` structure,
354ccc9971eSMauro Carvalho Chehabrespectively.
355ccc9971eSMauro Carvalho Chehab
356ccc9971eSMauro Carvalho ChehabAll of these fields are constant, and thus do not require any
357ccc9971eSMauro Carvalho Chehabsynchronization.
358ccc9971eSMauro Carvalho Chehab
359ccc9971eSMauro Carvalho ChehabSynchronization
360ccc9971eSMauro Carvalho Chehab'''''''''''''''
361ccc9971eSMauro Carvalho Chehab
362ccc9971eSMauro Carvalho ChehabThis field of the ``rcu_node`` structure is declared as follows:
363ccc9971eSMauro Carvalho Chehab
364ccc9971eSMauro Carvalho Chehab::
365ccc9971eSMauro Carvalho Chehab
366ccc9971eSMauro Carvalho Chehab     1   raw_spinlock_t lock;
367ccc9971eSMauro Carvalho Chehab
368ccc9971eSMauro Carvalho ChehabThis field is used to protect the remaining fields in this structure,
369ccc9971eSMauro Carvalho Chehabunless otherwise stated. That said, all of the fields in this structure
370ccc9971eSMauro Carvalho Chehabcan be accessed without locking for tracing purposes. Yes, this can
371ccc9971eSMauro Carvalho Chehabresult in confusing traces, but better some tracing confusion than to be
372ccc9971eSMauro Carvalho Chehabheisenbugged out of existence.
373ccc9971eSMauro Carvalho Chehab
374ccc9971eSMauro Carvalho Chehab.. _grace-period-tracking-1:
375ccc9971eSMauro Carvalho Chehab
376ccc9971eSMauro Carvalho ChehabGrace-Period Tracking
377ccc9971eSMauro Carvalho Chehab'''''''''''''''''''''
378ccc9971eSMauro Carvalho Chehab
379ccc9971eSMauro Carvalho ChehabThis portion of the ``rcu_node`` structure is declared as follows:
380ccc9971eSMauro Carvalho Chehab
381ccc9971eSMauro Carvalho Chehab::
382ccc9971eSMauro Carvalho Chehab
383ccc9971eSMauro Carvalho Chehab     1   unsigned long gp_seq;
384ccc9971eSMauro Carvalho Chehab     2   unsigned long gp_seq_needed;
385ccc9971eSMauro Carvalho Chehab
386ccc9971eSMauro Carvalho ChehabThe ``rcu_node`` structures' ``->gp_seq`` fields are the counterparts of
387ccc9971eSMauro Carvalho Chehabthe field of the same name in the ``rcu_state`` structure. They each may
388ccc9971eSMauro Carvalho Chehablag up to one step behind their ``rcu_state`` counterpart. If the bottom
389ccc9971eSMauro Carvalho Chehabtwo bits of a given ``rcu_node`` structure's ``->gp_seq`` field is zero,
390ccc9971eSMauro Carvalho Chehabthen this ``rcu_node`` structure believes that RCU is idle.
391ccc9971eSMauro Carvalho Chehab
392ccc9971eSMauro Carvalho ChehabThe ``>gp_seq`` field of each ``rcu_node`` structure is updated at the
393ccc9971eSMauro Carvalho Chehabbeginning and the end of each grace period.
394ccc9971eSMauro Carvalho Chehab
395ccc9971eSMauro Carvalho ChehabThe ``->gp_seq_needed`` fields record the furthest-in-the-future grace
396ccc9971eSMauro Carvalho Chehabperiod request seen by the corresponding ``rcu_node`` structure. The
397ccc9971eSMauro Carvalho Chehabrequest is considered fulfilled when the value of the ``->gp_seq`` field
398ccc9971eSMauro Carvalho Chehabequals or exceeds that of the ``->gp_seq_needed`` field.
399ccc9971eSMauro Carvalho Chehab
400ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
401ccc9971eSMauro Carvalho Chehab| **Quick Quiz**:                                                       |
402ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
403ccc9971eSMauro Carvalho Chehab| Suppose that this ``rcu_node`` structure doesn't see a request for a  |
404ccc9971eSMauro Carvalho Chehab| very long time. Won't wrapping of the ``->gp_seq`` field cause        |
405ccc9971eSMauro Carvalho Chehab| problems?                                                             |
406ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
407ccc9971eSMauro Carvalho Chehab| **Answer**:                                                           |
408ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
409ccc9971eSMauro Carvalho Chehab| No, because if the ``->gp_seq_needed`` field lags behind the          |
410ccc9971eSMauro Carvalho Chehab| ``->gp_seq`` field, the ``->gp_seq_needed`` field will be updated at  |
411ccc9971eSMauro Carvalho Chehab| the end of the grace period. Modulo-arithmetic comparisons therefore  |
412ccc9971eSMauro Carvalho Chehab| will always get the correct answer, even with wrapping.               |
413ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
414ccc9971eSMauro Carvalho Chehab
415ccc9971eSMauro Carvalho ChehabQuiescent-State Tracking
416ccc9971eSMauro Carvalho Chehab''''''''''''''''''''''''
417ccc9971eSMauro Carvalho Chehab
418ccc9971eSMauro Carvalho ChehabThese fields manage the propagation of quiescent states up the combining
419ccc9971eSMauro Carvalho Chehabtree.
420ccc9971eSMauro Carvalho Chehab
421ccc9971eSMauro Carvalho ChehabThis portion of the ``rcu_node`` structure has fields as follows:
422ccc9971eSMauro Carvalho Chehab
423ccc9971eSMauro Carvalho Chehab::
424ccc9971eSMauro Carvalho Chehab
425ccc9971eSMauro Carvalho Chehab     1   unsigned long qsmask;
426ccc9971eSMauro Carvalho Chehab     2   unsigned long expmask;
427ccc9971eSMauro Carvalho Chehab     3   unsigned long qsmaskinit;
428ccc9971eSMauro Carvalho Chehab     4   unsigned long expmaskinit;
429ccc9971eSMauro Carvalho Chehab
430ccc9971eSMauro Carvalho ChehabThe ``->qsmask`` field tracks which of this ``rcu_node`` structure's
431ccc9971eSMauro Carvalho Chehabchildren still need to report quiescent states for the current normal
432ccc9971eSMauro Carvalho Chehabgrace period. Such children will have a value of 1 in their
433ccc9971eSMauro Carvalho Chehabcorresponding bit. Note that the leaf ``rcu_node`` structures should be
434ccc9971eSMauro Carvalho Chehabthought of as having ``rcu_data`` structures as their children.
435ccc9971eSMauro Carvalho ChehabSimilarly, the ``->expmask`` field tracks which of this ``rcu_node``
436ccc9971eSMauro Carvalho Chehabstructure's children still need to report quiescent states for the
437ccc9971eSMauro Carvalho Chehabcurrent expedited grace period. An expedited grace period has the same
438ccc9971eSMauro Carvalho Chehabconceptual properties as a normal grace period, but the expedited
439ccc9971eSMauro Carvalho Chehabimplementation accepts extreme CPU overhead to obtain much lower
440ccc9971eSMauro Carvalho Chehabgrace-period latency, for example, consuming a few tens of microseconds
441ccc9971eSMauro Carvalho Chehabworth of CPU time to reduce grace-period duration from milliseconds to
442ccc9971eSMauro Carvalho Chehabtens of microseconds. The ``->qsmaskinit`` field tracks which of this
443ccc9971eSMauro Carvalho Chehab``rcu_node`` structure's children cover for at least one online CPU.
444ccc9971eSMauro Carvalho ChehabThis mask is used to initialize ``->qsmask``, and ``->expmaskinit`` is
445ccc9971eSMauro Carvalho Chehabused to initialize ``->expmask`` and the beginning of the normal and
446ccc9971eSMauro Carvalho Chehabexpedited grace periods, respectively.
447ccc9971eSMauro Carvalho Chehab
448ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
449ccc9971eSMauro Carvalho Chehab| **Quick Quiz**:                                                       |
450ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
451ccc9971eSMauro Carvalho Chehab| Why are these bitmasks protected by locking? Come on, haven't you     |
452ccc9971eSMauro Carvalho Chehab| heard of atomic instructions???                                       |
453ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
454ccc9971eSMauro Carvalho Chehab| **Answer**:                                                           |
455ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
456ccc9971eSMauro Carvalho Chehab| Lockless grace-period computation! Such a tantalizing possibility!    |
457ccc9971eSMauro Carvalho Chehab| But consider the following sequence of events:                        |
458ccc9971eSMauro Carvalho Chehab|                                                                       |
459ccc9971eSMauro Carvalho Chehab| #. CPU 0 has been in dyntick-idle mode for quite some time. When it   |
460ccc9971eSMauro Carvalho Chehab|    wakes up, it notices that the current RCU grace period needs it to |
461ccc9971eSMauro Carvalho Chehab|    report in, so it sets a flag where the scheduling clock interrupt  |
462ccc9971eSMauro Carvalho Chehab|    will find it.                                                      |
463ccc9971eSMauro Carvalho Chehab| #. Meanwhile, CPU 1 is running ``force_quiescent_state()``, and       |
464ccc9971eSMauro Carvalho Chehab|    notices that CPU 0 has been in dyntick idle mode, which qualifies  |
465ccc9971eSMauro Carvalho Chehab|    as an extended quiescent state.                                    |
466ccc9971eSMauro Carvalho Chehab| #. CPU 0's scheduling clock interrupt fires in the middle of an RCU   |
467ccc9971eSMauro Carvalho Chehab|    read-side critical section, and notices that the RCU core needs    |
468ccc9971eSMauro Carvalho Chehab|    something, so commences RCU softirq processing.                    |
469ccc9971eSMauro Carvalho Chehab| #. CPU 0's softirq handler executes and is just about ready to report |
470ccc9971eSMauro Carvalho Chehab|    its quiescent state up the ``rcu_node`` tree.                      |
471ccc9971eSMauro Carvalho Chehab| #. But CPU 1 beats it to the punch, completing the current grace      |
472ccc9971eSMauro Carvalho Chehab|    period and starting a new one.                                     |
473ccc9971eSMauro Carvalho Chehab| #. CPU 0 now reports its quiescent state for the wrong grace period.  |
474ccc9971eSMauro Carvalho Chehab|    That grace period might now end before the RCU read-side critical  |
475ccc9971eSMauro Carvalho Chehab|    section. If that happens, disaster will ensue.                     |
476ccc9971eSMauro Carvalho Chehab|                                                                       |
477ccc9971eSMauro Carvalho Chehab| So the locking is absolutely required in order to coordinate clearing |
478ccc9971eSMauro Carvalho Chehab| of the bits with updating of the grace-period sequence number in      |
479ccc9971eSMauro Carvalho Chehab| ``->gp_seq``.                                                         |
480ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
481ccc9971eSMauro Carvalho Chehab
482ccc9971eSMauro Carvalho ChehabBlocked-Task Management
483ccc9971eSMauro Carvalho Chehab'''''''''''''''''''''''
484ccc9971eSMauro Carvalho Chehab
485ccc9971eSMauro Carvalho Chehab``PREEMPT_RCU`` allows tasks to be preempted in the midst of their RCU
486ccc9971eSMauro Carvalho Chehabread-side critical sections, and these tasks must be tracked explicitly.
487ccc9971eSMauro Carvalho ChehabThe details of exactly why and how they are tracked will be covered in a
488ccc9971eSMauro Carvalho Chehabseparate article on RCU read-side processing. For now, it is enough to
489ccc9971eSMauro Carvalho Chehabknow that the ``rcu_node`` structure tracks them.
490ccc9971eSMauro Carvalho Chehab
491ccc9971eSMauro Carvalho Chehab::
492ccc9971eSMauro Carvalho Chehab
493ccc9971eSMauro Carvalho Chehab     1   struct list_head blkd_tasks;
494ccc9971eSMauro Carvalho Chehab     2   struct list_head *gp_tasks;
495ccc9971eSMauro Carvalho Chehab     3   struct list_head *exp_tasks;
496ccc9971eSMauro Carvalho Chehab     4   bool wait_blkd_tasks;
497ccc9971eSMauro Carvalho Chehab
498ccc9971eSMauro Carvalho ChehabThe ``->blkd_tasks`` field is a list header for the list of blocked and
499ccc9971eSMauro Carvalho Chehabpreempted tasks. As tasks undergo context switches within RCU read-side
500ccc9971eSMauro Carvalho Chehabcritical sections, their ``task_struct`` structures are enqueued (via
501ccc9971eSMauro Carvalho Chehabthe ``task_struct``'s ``->rcu_node_entry`` field) onto the head of the
502ccc9971eSMauro Carvalho Chehab``->blkd_tasks`` list for the leaf ``rcu_node`` structure corresponding
503ccc9971eSMauro Carvalho Chehabto the CPU on which the outgoing context switch executed. As these tasks
504ccc9971eSMauro Carvalho Chehablater exit their RCU read-side critical sections, they remove themselves
505ccc9971eSMauro Carvalho Chehabfrom the list. This list is therefore in reverse time order, so that if
506ccc9971eSMauro Carvalho Chehabone of the tasks is blocking the current grace period, all subsequent
507ccc9971eSMauro Carvalho Chehabtasks must also be blocking that same grace period. Therefore, a single
508ccc9971eSMauro Carvalho Chehabpointer into this list suffices to track all tasks blocking a given
509ccc9971eSMauro Carvalho Chehabgrace period. That pointer is stored in ``->gp_tasks`` for normal grace
510ccc9971eSMauro Carvalho Chehabperiods and in ``->exp_tasks`` for expedited grace periods. These last
511ccc9971eSMauro Carvalho Chehabtwo fields are ``NULL`` if either there is no grace period in flight or
512ccc9971eSMauro Carvalho Chehabif there are no blocked tasks preventing that grace period from
513ccc9971eSMauro Carvalho Chehabcompleting. If either of these two pointers is referencing a task that
514ccc9971eSMauro Carvalho Chehabremoves itself from the ``->blkd_tasks`` list, then that task must
515ccc9971eSMauro Carvalho Chehabadvance the pointer to the next task on the list, or set the pointer to
516ccc9971eSMauro Carvalho Chehab``NULL`` if there are no subsequent tasks on the list.
517ccc9971eSMauro Carvalho Chehab
518ccc9971eSMauro Carvalho ChehabFor example, suppose that tasks T1, T2, and T3 are all hard-affinitied
519ccc9971eSMauro Carvalho Chehabto the largest-numbered CPU in the system. Then if task T1 blocked in an
520ccc9971eSMauro Carvalho ChehabRCU read-side critical section, then an expedited grace period started,
521ccc9971eSMauro Carvalho Chehabthen task T2 blocked in an RCU read-side critical section, then a normal
522ccc9971eSMauro Carvalho Chehabgrace period started, and finally task 3 blocked in an RCU read-side
523ccc9971eSMauro Carvalho Chehabcritical section, then the state of the last leaf ``rcu_node``
524ccc9971eSMauro Carvalho Chehabstructure's blocked-task list would be as shown below:
525ccc9971eSMauro Carvalho Chehab
526ccc9971eSMauro Carvalho Chehab.. kernel-figure:: blkd_task.svg
527ccc9971eSMauro Carvalho Chehab
528ccc9971eSMauro Carvalho ChehabTask T1 is blocking both grace periods, task T2 is blocking only the
529ccc9971eSMauro Carvalho Chehabnormal grace period, and task T3 is blocking neither grace period. Note
530ccc9971eSMauro Carvalho Chehabthat these tasks will not remove themselves from this list immediately
531ccc9971eSMauro Carvalho Chehabupon resuming execution. They will instead remain on the list until they
532ccc9971eSMauro Carvalho Chehabexecute the outermost ``rcu_read_unlock()`` that ends their RCU
533ccc9971eSMauro Carvalho Chehabread-side critical section.
534ccc9971eSMauro Carvalho Chehab
535ccc9971eSMauro Carvalho ChehabThe ``->wait_blkd_tasks`` field indicates whether or not the current
536ccc9971eSMauro Carvalho Chehabgrace period is waiting on a blocked task.
537ccc9971eSMauro Carvalho Chehab
538ccc9971eSMauro Carvalho ChehabSizing the ``rcu_node`` Array
539ccc9971eSMauro Carvalho Chehab'''''''''''''''''''''''''''''
540ccc9971eSMauro Carvalho Chehab
541ccc9971eSMauro Carvalho ChehabThe ``rcu_node`` array is sized via a series of C-preprocessor
542ccc9971eSMauro Carvalho Chehabexpressions as follows:
543ccc9971eSMauro Carvalho Chehab
544ccc9971eSMauro Carvalho Chehab::
545ccc9971eSMauro Carvalho Chehab
546ccc9971eSMauro Carvalho Chehab    1 #ifdef CONFIG_RCU_FANOUT
547ccc9971eSMauro Carvalho Chehab    2 #define RCU_FANOUT CONFIG_RCU_FANOUT
548ccc9971eSMauro Carvalho Chehab    3 #else
549ccc9971eSMauro Carvalho Chehab    4 # ifdef CONFIG_64BIT
550ccc9971eSMauro Carvalho Chehab    5 # define RCU_FANOUT 64
551ccc9971eSMauro Carvalho Chehab    6 # else
552ccc9971eSMauro Carvalho Chehab    7 # define RCU_FANOUT 32
553ccc9971eSMauro Carvalho Chehab    8 # endif
554ccc9971eSMauro Carvalho Chehab    9 #endif
555ccc9971eSMauro Carvalho Chehab   10
556ccc9971eSMauro Carvalho Chehab   11 #ifdef CONFIG_RCU_FANOUT_LEAF
557ccc9971eSMauro Carvalho Chehab   12 #define RCU_FANOUT_LEAF CONFIG_RCU_FANOUT_LEAF
558ccc9971eSMauro Carvalho Chehab   13 #else
559ccc9971eSMauro Carvalho Chehab   14 # ifdef CONFIG_64BIT
560ccc9971eSMauro Carvalho Chehab   15 # define RCU_FANOUT_LEAF 64
561ccc9971eSMauro Carvalho Chehab   16 # else
562ccc9971eSMauro Carvalho Chehab   17 # define RCU_FANOUT_LEAF 32
563ccc9971eSMauro Carvalho Chehab   18 # endif
564ccc9971eSMauro Carvalho Chehab   19 #endif
565ccc9971eSMauro Carvalho Chehab   20
566ccc9971eSMauro Carvalho Chehab   21 #define RCU_FANOUT_1        (RCU_FANOUT_LEAF)
567ccc9971eSMauro Carvalho Chehab   22 #define RCU_FANOUT_2        (RCU_FANOUT_1 * RCU_FANOUT)
568ccc9971eSMauro Carvalho Chehab   23 #define RCU_FANOUT_3        (RCU_FANOUT_2 * RCU_FANOUT)
569ccc9971eSMauro Carvalho Chehab   24 #define RCU_FANOUT_4        (RCU_FANOUT_3 * RCU_FANOUT)
570ccc9971eSMauro Carvalho Chehab   25
571ccc9971eSMauro Carvalho Chehab   26 #if NR_CPUS <= RCU_FANOUT_1
572ccc9971eSMauro Carvalho Chehab   27 #  define RCU_NUM_LVLS        1
573ccc9971eSMauro Carvalho Chehab   28 #  define NUM_RCU_LVL_0        1
574ccc9971eSMauro Carvalho Chehab   29 #  define NUM_RCU_NODES        NUM_RCU_LVL_0
575ccc9971eSMauro Carvalho Chehab   30 #  define NUM_RCU_LVL_INIT    { NUM_RCU_LVL_0 }
576ccc9971eSMauro Carvalho Chehab   31 #  define RCU_NODE_NAME_INIT  { "rcu_node_0" }
577ccc9971eSMauro Carvalho Chehab   32 #  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0" }
578ccc9971eSMauro Carvalho Chehab   33 #  define RCU_EXP_NAME_INIT   { "rcu_node_exp_0" }
579ccc9971eSMauro Carvalho Chehab   34 #elif NR_CPUS <= RCU_FANOUT_2
580ccc9971eSMauro Carvalho Chehab   35 #  define RCU_NUM_LVLS        2
581ccc9971eSMauro Carvalho Chehab   36 #  define NUM_RCU_LVL_0        1
582ccc9971eSMauro Carvalho Chehab   37 #  define NUM_RCU_LVL_1        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
583ccc9971eSMauro Carvalho Chehab   38 #  define NUM_RCU_NODES        (NUM_RCU_LVL_0 + NUM_RCU_LVL_1)
584ccc9971eSMauro Carvalho Chehab   39 #  define NUM_RCU_LVL_INIT    { NUM_RCU_LVL_0, NUM_RCU_LVL_1 }
585ccc9971eSMauro Carvalho Chehab   40 #  define RCU_NODE_NAME_INIT  { "rcu_node_0", "rcu_node_1" }
586ccc9971eSMauro Carvalho Chehab   41 #  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0", "rcu_node_fqs_1" }
587ccc9971eSMauro Carvalho Chehab   42 #  define RCU_EXP_NAME_INIT   { "rcu_node_exp_0", "rcu_node_exp_1" }
588ccc9971eSMauro Carvalho Chehab   43 #elif NR_CPUS <= RCU_FANOUT_3
589ccc9971eSMauro Carvalho Chehab   44 #  define RCU_NUM_LVLS        3
590ccc9971eSMauro Carvalho Chehab   45 #  define NUM_RCU_LVL_0        1
591ccc9971eSMauro Carvalho Chehab   46 #  define NUM_RCU_LVL_1        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2)
592ccc9971eSMauro Carvalho Chehab   47 #  define NUM_RCU_LVL_2        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
593ccc9971eSMauro Carvalho Chehab   48 #  define NUM_RCU_NODES        (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2)
594ccc9971eSMauro Carvalho Chehab   49 #  define NUM_RCU_LVL_INIT    { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2 }
595ccc9971eSMauro Carvalho Chehab   50 #  define RCU_NODE_NAME_INIT  { "rcu_node_0", "rcu_node_1", "rcu_node_2" }
596ccc9971eSMauro Carvalho Chehab   51 #  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2" }
597ccc9971eSMauro Carvalho Chehab   52 #  define RCU_EXP_NAME_INIT   { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2" }
598ccc9971eSMauro Carvalho Chehab   53 #elif NR_CPUS <= RCU_FANOUT_4
599ccc9971eSMauro Carvalho Chehab   54 #  define RCU_NUM_LVLS        4
600ccc9971eSMauro Carvalho Chehab   55 #  define NUM_RCU_LVL_0        1
601ccc9971eSMauro Carvalho Chehab   56 #  define NUM_RCU_LVL_1        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_3)
602ccc9971eSMauro Carvalho Chehab   57 #  define NUM_RCU_LVL_2        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2)
603ccc9971eSMauro Carvalho Chehab   58 #  define NUM_RCU_LVL_3        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
604ccc9971eSMauro Carvalho Chehab   59 #  define NUM_RCU_NODES        (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2 + NUM_RCU_LVL_3)
605ccc9971eSMauro Carvalho Chehab   60 #  define NUM_RCU_LVL_INIT    { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2, NUM_RCU_LVL_3 }
606ccc9971eSMauro Carvalho Chehab   61 #  define RCU_NODE_NAME_INIT  { "rcu_node_0", "rcu_node_1", "rcu_node_2", "rcu_node_3" }
607ccc9971eSMauro Carvalho Chehab   62 #  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2", "rcu_node_fqs_3" }
608ccc9971eSMauro Carvalho Chehab   63 #  define RCU_EXP_NAME_INIT   { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2", "rcu_node_exp_3" }
609ccc9971eSMauro Carvalho Chehab   64 #else
610ccc9971eSMauro Carvalho Chehab   65 # error "CONFIG_RCU_FANOUT insufficient for NR_CPUS"
611ccc9971eSMauro Carvalho Chehab   66 #endif
612ccc9971eSMauro Carvalho Chehab
613ccc9971eSMauro Carvalho ChehabThe maximum number of levels in the ``rcu_node`` structure is currently
614ccc9971eSMauro Carvalho Chehablimited to four, as specified by lines 21-24 and the structure of the
615ccc9971eSMauro Carvalho Chehabsubsequent “if” statement. For 32-bit systems, this allows
616ccc9971eSMauro Carvalho Chehab16*32*32*32=524,288 CPUs, which should be sufficient for the next few
617ccc9971eSMauro Carvalho Chehabyears at least. For 64-bit systems, 16*64*64*64=4,194,304 CPUs is
618ccc9971eSMauro Carvalho Chehaballowed, which should see us through the next decade or so. This
619ccc9971eSMauro Carvalho Chehabfour-level tree also allows kernels built with ``CONFIG_RCU_FANOUT=8``
620ccc9971eSMauro Carvalho Chehabto support up to 4096 CPUs, which might be useful in very large systems
621ccc9971eSMauro Carvalho Chehabhaving eight CPUs per socket (but please note that no one has yet shown
622ccc9971eSMauro Carvalho Chehabany measurable performance degradation due to misaligned socket and
623ccc9971eSMauro Carvalho Chehab``rcu_node`` boundaries). In addition, building kernels with a full four
624ccc9971eSMauro Carvalho Chehablevels of ``rcu_node`` tree permits better testing of RCU's
625ccc9971eSMauro Carvalho Chehabcombining-tree code.
626ccc9971eSMauro Carvalho Chehab
627ccc9971eSMauro Carvalho ChehabThe ``RCU_FANOUT`` symbol controls how many children are permitted at
628ccc9971eSMauro Carvalho Chehabeach non-leaf level of the ``rcu_node`` tree. If the
629ccc9971eSMauro Carvalho Chehab``CONFIG_RCU_FANOUT`` Kconfig option is not specified, it is set based
630ccc9971eSMauro Carvalho Chehabon the word size of the system, which is also the Kconfig default.
631ccc9971eSMauro Carvalho Chehab
632ccc9971eSMauro Carvalho ChehabThe ``RCU_FANOUT_LEAF`` symbol controls how many CPUs are handled by
633ccc9971eSMauro Carvalho Chehabeach leaf ``rcu_node`` structure. Experience has shown that allowing a
634ccc9971eSMauro Carvalho Chehabgiven leaf ``rcu_node`` structure to handle 64 CPUs, as permitted by the
635ccc9971eSMauro Carvalho Chehabnumber of bits in the ``->qsmask`` field on a 64-bit system, results in
636ccc9971eSMauro Carvalho Chehabexcessive contention for the leaf ``rcu_node`` structures' ``->lock``
637ccc9971eSMauro Carvalho Chehabfields. The number of CPUs per leaf ``rcu_node`` structure is therefore
638ccc9971eSMauro Carvalho Chehablimited to 16 given the default value of ``CONFIG_RCU_FANOUT_LEAF``. If
639ccc9971eSMauro Carvalho Chehab``CONFIG_RCU_FANOUT_LEAF`` is unspecified, the value selected is based
640ccc9971eSMauro Carvalho Chehabon the word size of the system, just as for ``CONFIG_RCU_FANOUT``.
641ccc9971eSMauro Carvalho ChehabLines 11-19 perform this computation.
642ccc9971eSMauro Carvalho Chehab
643ccc9971eSMauro Carvalho ChehabLines 21-24 compute the maximum number of CPUs supported by a
644ccc9971eSMauro Carvalho Chehabsingle-level (which contains a single ``rcu_node`` structure),
645ccc9971eSMauro Carvalho Chehabtwo-level, three-level, and four-level ``rcu_node`` tree, respectively,
646ccc9971eSMauro Carvalho Chehabgiven the fanout specified by ``RCU_FANOUT`` and ``RCU_FANOUT_LEAF``.
647ccc9971eSMauro Carvalho ChehabThese numbers of CPUs are retained in the ``RCU_FANOUT_1``,
648ccc9971eSMauro Carvalho Chehab``RCU_FANOUT_2``, ``RCU_FANOUT_3``, and ``RCU_FANOUT_4`` C-preprocessor
649ccc9971eSMauro Carvalho Chehabvariables, respectively.
650ccc9971eSMauro Carvalho Chehab
651ccc9971eSMauro Carvalho ChehabThese variables are used to control the C-preprocessor ``#if`` statement
652ccc9971eSMauro Carvalho Chehabspanning lines 26-66 that computes the number of ``rcu_node`` structures
653ccc9971eSMauro Carvalho Chehabrequired for each level of the tree, as well as the number of levels
654ccc9971eSMauro Carvalho Chehabrequired. The number of levels is placed in the ``NUM_RCU_LVLS``
655ccc9971eSMauro Carvalho ChehabC-preprocessor variable by lines 27, 35, 44, and 54. The number of
656ccc9971eSMauro Carvalho Chehab``rcu_node`` structures for the topmost level of the tree is always
657ccc9971eSMauro Carvalho Chehabexactly one, and this value is unconditionally placed into
658ccc9971eSMauro Carvalho Chehab``NUM_RCU_LVL_0`` by lines 28, 36, 45, and 55. The rest of the levels
659ccc9971eSMauro Carvalho Chehab(if any) of the ``rcu_node`` tree are computed by dividing the maximum
660ccc9971eSMauro Carvalho Chehabnumber of CPUs by the fanout supported by the number of levels from the
661ccc9971eSMauro Carvalho Chehabcurrent level down, rounding up. This computation is performed by
662ccc9971eSMauro Carvalho Chehablines 37, 46-47, and 56-58. Lines 31-33, 40-42, 50-52, and 62-63 create
663ccc9971eSMauro Carvalho Chehabinitializers for lockdep lock-class names. Finally, lines 64-66 produce
664ccc9971eSMauro Carvalho Chehaban error if the maximum number of CPUs is too large for the specified
665ccc9971eSMauro Carvalho Chehabfanout.
666ccc9971eSMauro Carvalho Chehab
667ccc9971eSMauro Carvalho ChehabThe ``rcu_segcblist`` Structure
668ccc9971eSMauro Carvalho Chehab~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
669ccc9971eSMauro Carvalho Chehab
670ccc9971eSMauro Carvalho ChehabThe ``rcu_segcblist`` structure maintains a segmented list of callbacks
671ccc9971eSMauro Carvalho Chehabas follows:
672ccc9971eSMauro Carvalho Chehab
673ccc9971eSMauro Carvalho Chehab::
674ccc9971eSMauro Carvalho Chehab
675ccc9971eSMauro Carvalho Chehab    1 #define RCU_DONE_TAIL        0
676ccc9971eSMauro Carvalho Chehab    2 #define RCU_WAIT_TAIL        1
677ccc9971eSMauro Carvalho Chehab    3 #define RCU_NEXT_READY_TAIL  2
678ccc9971eSMauro Carvalho Chehab    4 #define RCU_NEXT_TAIL        3
679ccc9971eSMauro Carvalho Chehab    5 #define RCU_CBLIST_NSEGS     4
680ccc9971eSMauro Carvalho Chehab    6
681ccc9971eSMauro Carvalho Chehab    7 struct rcu_segcblist {
682ccc9971eSMauro Carvalho Chehab    8   struct rcu_head *head;
683ccc9971eSMauro Carvalho Chehab    9   struct rcu_head **tails[RCU_CBLIST_NSEGS];
684ccc9971eSMauro Carvalho Chehab   10   unsigned long gp_seq[RCU_CBLIST_NSEGS];
685ccc9971eSMauro Carvalho Chehab   11   long len;
686ccc9971eSMauro Carvalho Chehab   12   long len_lazy;
687ccc9971eSMauro Carvalho Chehab   13 };
688ccc9971eSMauro Carvalho Chehab
689ccc9971eSMauro Carvalho ChehabThe segments are as follows:
690ccc9971eSMauro Carvalho Chehab
691ccc9971eSMauro Carvalho Chehab#. ``RCU_DONE_TAIL``: Callbacks whose grace periods have elapsed. These
692ccc9971eSMauro Carvalho Chehab   callbacks are ready to be invoked.
693ccc9971eSMauro Carvalho Chehab#. ``RCU_WAIT_TAIL``: Callbacks that are waiting for the current grace
694ccc9971eSMauro Carvalho Chehab   period. Note that different CPUs can have different ideas about which
695ccc9971eSMauro Carvalho Chehab   grace period is current, hence the ``->gp_seq`` field.
696ccc9971eSMauro Carvalho Chehab#. ``RCU_NEXT_READY_TAIL``: Callbacks waiting for the next grace period
697ccc9971eSMauro Carvalho Chehab   to start.
698ccc9971eSMauro Carvalho Chehab#. ``RCU_NEXT_TAIL``: Callbacks that have not yet been associated with a
699ccc9971eSMauro Carvalho Chehab   grace period.
700ccc9971eSMauro Carvalho Chehab
701ccc9971eSMauro Carvalho ChehabThe ``->head`` pointer references the first callback or is ``NULL`` if
702ccc9971eSMauro Carvalho Chehabthe list contains no callbacks (which is *not* the same as being empty).
703ccc9971eSMauro Carvalho ChehabEach element of the ``->tails[]`` array references the ``->next``
704ccc9971eSMauro Carvalho Chehabpointer of the last callback in the corresponding segment of the list,
705ccc9971eSMauro Carvalho Chehabor the list's ``->head`` pointer if that segment and all previous
706ccc9971eSMauro Carvalho Chehabsegments are empty. If the corresponding segment is empty but some
707ccc9971eSMauro Carvalho Chehabprevious segment is not empty, then the array element is identical to
708ccc9971eSMauro Carvalho Chehabits predecessor. Older callbacks are closer to the head of the list, and
709ccc9971eSMauro Carvalho Chehabnew callbacks are added at the tail. This relationship between the
710ccc9971eSMauro Carvalho Chehab``->head`` pointer, the ``->tails[]`` array, and the callbacks is shown
711ccc9971eSMauro Carvalho Chehabin this diagram:
712ccc9971eSMauro Carvalho Chehab
713ccc9971eSMauro Carvalho Chehab.. kernel-figure:: nxtlist.svg
714ccc9971eSMauro Carvalho Chehab
715ccc9971eSMauro Carvalho ChehabIn this figure, the ``->head`` pointer references the first RCU callback
716ccc9971eSMauro Carvalho Chehabin the list. The ``->tails[RCU_DONE_TAIL]`` array element references the
717ccc9971eSMauro Carvalho Chehab``->head`` pointer itself, indicating that none of the callbacks is
718ccc9971eSMauro Carvalho Chehabready to invoke. The ``->tails[RCU_WAIT_TAIL]`` array element references
719ccc9971eSMauro Carvalho Chehabcallback CB 2's ``->next`` pointer, which indicates that CB 1 and CB 2
720ccc9971eSMauro Carvalho Chehabare both waiting on the current grace period, give or take possible
721ccc9971eSMauro Carvalho Chehabdisagreements about exactly which grace period is the current one. The
722ccc9971eSMauro Carvalho Chehab``->tails[RCU_NEXT_READY_TAIL]`` array element references the same RCU
723ccc9971eSMauro Carvalho Chehabcallback that ``->tails[RCU_WAIT_TAIL]`` does, which indicates that
724ccc9971eSMauro Carvalho Chehabthere are no callbacks waiting on the next RCU grace period. The
725ccc9971eSMauro Carvalho Chehab``->tails[RCU_NEXT_TAIL]`` array element references CB 4's ``->next``
726ccc9971eSMauro Carvalho Chehabpointer, indicating that all the remaining RCU callbacks have not yet
727ccc9971eSMauro Carvalho Chehabbeen assigned to an RCU grace period. Note that the
728ccc9971eSMauro Carvalho Chehab``->tails[RCU_NEXT_TAIL]`` array element always references the last RCU
729ccc9971eSMauro Carvalho Chehabcallback's ``->next`` pointer unless the callback list is empty, in
730ccc9971eSMauro Carvalho Chehabwhich case it references the ``->head`` pointer.
731ccc9971eSMauro Carvalho Chehab
732ccc9971eSMauro Carvalho ChehabThere is one additional important special case for the
733ccc9971eSMauro Carvalho Chehab``->tails[RCU_NEXT_TAIL]`` array element: It can be ``NULL`` when this
734ccc9971eSMauro Carvalho Chehablist is *disabled*. Lists are disabled when the corresponding CPU is
735ccc9971eSMauro Carvalho Chehaboffline or when the corresponding CPU's callbacks are offloaded to a
736ccc9971eSMauro Carvalho Chehabkthread, both of which are described elsewhere.
737ccc9971eSMauro Carvalho Chehab
738ccc9971eSMauro Carvalho ChehabCPUs advance their callbacks from the ``RCU_NEXT_TAIL`` to the
739ccc9971eSMauro Carvalho Chehab``RCU_NEXT_READY_TAIL`` to the ``RCU_WAIT_TAIL`` to the
740ccc9971eSMauro Carvalho Chehab``RCU_DONE_TAIL`` list segments as grace periods advance.
741ccc9971eSMauro Carvalho Chehab
742ccc9971eSMauro Carvalho ChehabThe ``->gp_seq[]`` array records grace-period numbers corresponding to
743ccc9971eSMauro Carvalho Chehabthe list segments. This is what allows different CPUs to have different
744ccc9971eSMauro Carvalho Chehabideas as to which is the current grace period while still avoiding
745ccc9971eSMauro Carvalho Chehabpremature invocation of their callbacks. In particular, this allows CPUs
746ccc9971eSMauro Carvalho Chehabthat go idle for extended periods to determine which of their callbacks
747ccc9971eSMauro Carvalho Chehabare ready to be invoked after reawakening.
748ccc9971eSMauro Carvalho Chehab
749ccc9971eSMauro Carvalho ChehabThe ``->len`` counter contains the number of callbacks in ``->head``,
750ccc9971eSMauro Carvalho Chehaband the ``->len_lazy`` contains the number of those callbacks that are
751ccc9971eSMauro Carvalho Chehabknown to only free memory, and whose invocation can therefore be safely
752ccc9971eSMauro Carvalho Chehabdeferred.
753ccc9971eSMauro Carvalho Chehab
754ccc9971eSMauro Carvalho Chehab.. important::
755ccc9971eSMauro Carvalho Chehab
756ccc9971eSMauro Carvalho Chehab   It is the ``->len`` field that determines whether or
757ccc9971eSMauro Carvalho Chehab   not there are callbacks associated with this ``rcu_segcblist``
758ccc9971eSMauro Carvalho Chehab   structure, *not* the ``->head`` pointer. The reason for this is that all
759ccc9971eSMauro Carvalho Chehab   the ready-to-invoke callbacks (that is, those in the ``RCU_DONE_TAIL``
760ccc9971eSMauro Carvalho Chehab   segment) are extracted all at once at callback-invocation time
761ccc9971eSMauro Carvalho Chehab   (``rcu_do_batch``), due to which ``->head`` may be set to NULL if there
762ccc9971eSMauro Carvalho Chehab   are no not-done callbacks remaining in the ``rcu_segcblist``. If
763ccc9971eSMauro Carvalho Chehab   callback invocation must be postponed, for example, because a
764ccc9971eSMauro Carvalho Chehab   high-priority process just woke up on this CPU, then the remaining
765ccc9971eSMauro Carvalho Chehab   callbacks are placed back on the ``RCU_DONE_TAIL`` segment and
766ccc9971eSMauro Carvalho Chehab   ``->head`` once again points to the start of the segment. In short, the
767ccc9971eSMauro Carvalho Chehab   head field can briefly be ``NULL`` even though the CPU has callbacks
768ccc9971eSMauro Carvalho Chehab   present the entire time. Therefore, it is not appropriate to test the
769ccc9971eSMauro Carvalho Chehab   ``->head`` pointer for ``NULL``.
770ccc9971eSMauro Carvalho Chehab
771ccc9971eSMauro Carvalho ChehabIn contrast, the ``->len`` and ``->len_lazy`` counts are adjusted only
772ccc9971eSMauro Carvalho Chehabafter the corresponding callbacks have been invoked. This means that the
773ccc9971eSMauro Carvalho Chehab``->len`` count is zero only if the ``rcu_segcblist`` structure really
774ccc9971eSMauro Carvalho Chehabis devoid of callbacks. Of course, off-CPU sampling of the ``->len``
775ccc9971eSMauro Carvalho Chehabcount requires careful use of appropriate synchronization, for example,
776ccc9971eSMauro Carvalho Chehabmemory barriers. This synchronization can be a bit subtle, particularly
777ccc9971eSMauro Carvalho Chehabin the case of ``rcu_barrier()``.
778ccc9971eSMauro Carvalho Chehab
779ccc9971eSMauro Carvalho ChehabThe ``rcu_data`` Structure
780ccc9971eSMauro Carvalho Chehab~~~~~~~~~~~~~~~~~~~~~~~~~~
781ccc9971eSMauro Carvalho Chehab
782ccc9971eSMauro Carvalho ChehabThe ``rcu_data`` maintains the per-CPU state for the RCU subsystem. The
783ccc9971eSMauro Carvalho Chehabfields in this structure may be accessed only from the corresponding CPU
784ccc9971eSMauro Carvalho Chehab(and from tracing) unless otherwise stated. This structure is the focus
785ccc9971eSMauro Carvalho Chehabof quiescent-state detection and RCU callback queuing. It also tracks
786ccc9971eSMauro Carvalho Chehabits relationship to the corresponding leaf ``rcu_node`` structure to
787ccc9971eSMauro Carvalho Chehaballow more-efficient propagation of quiescent states up the ``rcu_node``
788ccc9971eSMauro Carvalho Chehabcombining tree. Like the ``rcu_node`` structure, it provides a local
789ccc9971eSMauro Carvalho Chehabcopy of the grace-period information to allow for-free synchronized
790ccc9971eSMauro Carvalho Chehabaccess to this information from the corresponding CPU. Finally, this
791ccc9971eSMauro Carvalho Chehabstructure records past dyntick-idle state for the corresponding CPU and
792ccc9971eSMauro Carvalho Chehabalso tracks statistics.
793ccc9971eSMauro Carvalho Chehab
794ccc9971eSMauro Carvalho ChehabThe ``rcu_data`` structure's fields are discussed, singly and in groups,
795ccc9971eSMauro Carvalho Chehabin the following sections.
796ccc9971eSMauro Carvalho Chehab
797ccc9971eSMauro Carvalho ChehabConnection to Other Data Structures
798ccc9971eSMauro Carvalho Chehab'''''''''''''''''''''''''''''''''''
799ccc9971eSMauro Carvalho Chehab
800ccc9971eSMauro Carvalho ChehabThis portion of the ``rcu_data`` structure is declared as follows:
801ccc9971eSMauro Carvalho Chehab
802ccc9971eSMauro Carvalho Chehab::
803ccc9971eSMauro Carvalho Chehab
804ccc9971eSMauro Carvalho Chehab     1   int cpu;
805ccc9971eSMauro Carvalho Chehab     2   struct rcu_node *mynode;
806ccc9971eSMauro Carvalho Chehab     3   unsigned long grpmask;
807ccc9971eSMauro Carvalho Chehab     4   bool beenonline;
808ccc9971eSMauro Carvalho Chehab
809ccc9971eSMauro Carvalho ChehabThe ``->cpu`` field contains the number of the corresponding CPU and the
810ccc9971eSMauro Carvalho Chehab``->mynode`` field references the corresponding ``rcu_node`` structure.
811ccc9971eSMauro Carvalho ChehabThe ``->mynode`` is used to propagate quiescent states up the combining
812ccc9971eSMauro Carvalho Chehabtree. These two fields are constant and therefore do not require
813ccc9971eSMauro Carvalho Chehabsynchronization.
814ccc9971eSMauro Carvalho Chehab
815ccc9971eSMauro Carvalho ChehabThe ``->grpmask`` field indicates the bit in the ``->mynode->qsmask``
816ccc9971eSMauro Carvalho Chehabcorresponding to this ``rcu_data`` structure, and is also used when
817ccc9971eSMauro Carvalho Chehabpropagating quiescent states. The ``->beenonline`` flag is set whenever
818ccc9971eSMauro Carvalho Chehabthe corresponding CPU comes online, which means that the debugfs tracing
819ccc9971eSMauro Carvalho Chehabneed not dump out any ``rcu_data`` structure for which this flag is not
820ccc9971eSMauro Carvalho Chehabset.
821ccc9971eSMauro Carvalho Chehab
822ccc9971eSMauro Carvalho ChehabQuiescent-State and Grace-Period Tracking
823ccc9971eSMauro Carvalho Chehab'''''''''''''''''''''''''''''''''''''''''
824ccc9971eSMauro Carvalho Chehab
825ccc9971eSMauro Carvalho ChehabThis portion of the ``rcu_data`` structure is declared as follows:
826ccc9971eSMauro Carvalho Chehab
827ccc9971eSMauro Carvalho Chehab::
828ccc9971eSMauro Carvalho Chehab
829ccc9971eSMauro Carvalho Chehab     1   unsigned long gp_seq;
830ccc9971eSMauro Carvalho Chehab     2   unsigned long gp_seq_needed;
831ccc9971eSMauro Carvalho Chehab     3   bool cpu_no_qs;
832ccc9971eSMauro Carvalho Chehab     4   bool core_needs_qs;
833ccc9971eSMauro Carvalho Chehab     5   bool gpwrap;
834ccc9971eSMauro Carvalho Chehab
835ccc9971eSMauro Carvalho ChehabThe ``->gp_seq`` field is the counterpart of the field of the same name
836ccc9971eSMauro Carvalho Chehabin the ``rcu_state`` and ``rcu_node`` structures. The
837ccc9971eSMauro Carvalho Chehab``->gp_seq_needed`` field is the counterpart of the field of the same
838ccc9971eSMauro Carvalho Chehabname in the rcu_node structure. They may each lag up to one behind their
839ccc9971eSMauro Carvalho Chehab``rcu_node`` counterparts, but in ``CONFIG_NO_HZ_IDLE`` and
840ccc9971eSMauro Carvalho Chehab``CONFIG_NO_HZ_FULL`` kernels can lag arbitrarily far behind for CPUs in
841ccc9971eSMauro Carvalho Chehabdyntick-idle mode (but these counters will catch up upon exit from
842ccc9971eSMauro Carvalho Chehabdyntick-idle mode). If the lower two bits of a given ``rcu_data``
843ccc9971eSMauro Carvalho Chehabstructure's ``->gp_seq`` are zero, then this ``rcu_data`` structure
844ccc9971eSMauro Carvalho Chehabbelieves that RCU is idle.
845ccc9971eSMauro Carvalho Chehab
846ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
847ccc9971eSMauro Carvalho Chehab| **Quick Quiz**:                                                       |
848ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
849ccc9971eSMauro Carvalho Chehab| All this replication of the grace period numbers can only cause       |
850ccc9971eSMauro Carvalho Chehab| massive confusion. Why not just keep a global sequence number and be  |
851ccc9971eSMauro Carvalho Chehab| done with it???                                                       |
852ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
853ccc9971eSMauro Carvalho Chehab| **Answer**:                                                           |
854ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
855ccc9971eSMauro Carvalho Chehab| Because if there was only a single global sequence numbers, there     |
856ccc9971eSMauro Carvalho Chehab| would need to be a single global lock to allow safely accessing and   |
857ccc9971eSMauro Carvalho Chehab| updating it. And if we are not going to have a single global lock, we |
858ccc9971eSMauro Carvalho Chehab| need to carefully manage the numbers on a per-node basis. Recall from |
859ccc9971eSMauro Carvalho Chehab| the answer to a previous Quick Quiz that the consequences of applying |
860ccc9971eSMauro Carvalho Chehab| a previously sampled quiescent state to the wrong grace period are    |
861ccc9971eSMauro Carvalho Chehab| quite severe.                                                         |
862ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
863ccc9971eSMauro Carvalho Chehab
864ccc9971eSMauro Carvalho ChehabThe ``->cpu_no_qs`` flag indicates that the CPU has not yet passed
865ccc9971eSMauro Carvalho Chehabthrough a quiescent state, while the ``->core_needs_qs`` flag indicates
866ccc9971eSMauro Carvalho Chehabthat the RCU core needs a quiescent state from the corresponding CPU.
867ccc9971eSMauro Carvalho ChehabThe ``->gpwrap`` field indicates that the corresponding CPU has remained
868ccc9971eSMauro Carvalho Chehabidle for so long that the ``gp_seq`` counter is in danger of overflow,
869ccc9971eSMauro Carvalho Chehabwhich will cause the CPU to disregard the values of its counters on its
870ccc9971eSMauro Carvalho Chehabnext exit from idle.
871ccc9971eSMauro Carvalho Chehab
872ccc9971eSMauro Carvalho ChehabRCU Callback Handling
873ccc9971eSMauro Carvalho Chehab'''''''''''''''''''''
874ccc9971eSMauro Carvalho Chehab
875ccc9971eSMauro Carvalho ChehabIn the absence of CPU-hotplug events, RCU callbacks are invoked by the
876ccc9971eSMauro Carvalho Chehabsame CPU that registered them. This is strictly a cache-locality
877ccc9971eSMauro Carvalho Chehaboptimization: callbacks can and do get invoked on CPUs other than the
878ccc9971eSMauro Carvalho Chehabone that registered them. After all, if the CPU that registered a given
879ccc9971eSMauro Carvalho Chehabcallback has gone offline before the callback can be invoked, there
880ccc9971eSMauro Carvalho Chehabreally is no other choice.
881ccc9971eSMauro Carvalho Chehab
882ccc9971eSMauro Carvalho ChehabThis portion of the ``rcu_data`` structure is declared as follows:
883ccc9971eSMauro Carvalho Chehab
884ccc9971eSMauro Carvalho Chehab::
885ccc9971eSMauro Carvalho Chehab
886ccc9971eSMauro Carvalho Chehab    1 struct rcu_segcblist cblist;
887ccc9971eSMauro Carvalho Chehab    2 long qlen_last_fqs_check;
888ccc9971eSMauro Carvalho Chehab    3 unsigned long n_cbs_invoked;
889ccc9971eSMauro Carvalho Chehab    4 unsigned long n_nocbs_invoked;
890ccc9971eSMauro Carvalho Chehab    5 unsigned long n_cbs_orphaned;
891ccc9971eSMauro Carvalho Chehab    6 unsigned long n_cbs_adopted;
892ccc9971eSMauro Carvalho Chehab    7 unsigned long n_force_qs_snap;
893ccc9971eSMauro Carvalho Chehab    8 long blimit;
894ccc9971eSMauro Carvalho Chehab
895ccc9971eSMauro Carvalho ChehabThe ``->cblist`` structure is the segmented callback list described
896ccc9971eSMauro Carvalho Chehabearlier. The CPU advances the callbacks in its ``rcu_data`` structure
897ccc9971eSMauro Carvalho Chehabwhenever it notices that another RCU grace period has completed. The CPU
898ccc9971eSMauro Carvalho Chehabdetects the completion of an RCU grace period by noticing that the value
899ccc9971eSMauro Carvalho Chehabof its ``rcu_data`` structure's ``->gp_seq`` field differs from that of
900ccc9971eSMauro Carvalho Chehabits leaf ``rcu_node`` structure. Recall that each ``rcu_node``
901ccc9971eSMauro Carvalho Chehabstructure's ``->gp_seq`` field is updated at the beginnings and ends of
902ccc9971eSMauro Carvalho Chehabeach grace period.
903ccc9971eSMauro Carvalho Chehab
904ccc9971eSMauro Carvalho ChehabThe ``->qlen_last_fqs_check`` and ``->n_force_qs_snap`` coordinate the
905ccc9971eSMauro Carvalho Chehabforcing of quiescent states from ``call_rcu()`` and friends when
906ccc9971eSMauro Carvalho Chehabcallback lists grow excessively long.
907ccc9971eSMauro Carvalho Chehab
908ccc9971eSMauro Carvalho ChehabThe ``->n_cbs_invoked``, ``->n_cbs_orphaned``, and ``->n_cbs_adopted``
909ccc9971eSMauro Carvalho Chehabfields count the number of callbacks invoked, sent to other CPUs when
910ccc9971eSMauro Carvalho Chehabthis CPU goes offline, and received from other CPUs when those other
911ccc9971eSMauro Carvalho ChehabCPUs go offline. The ``->n_nocbs_invoked`` is used when the CPU's
912ccc9971eSMauro Carvalho Chehabcallbacks are offloaded to a kthread.
913ccc9971eSMauro Carvalho Chehab
914ccc9971eSMauro Carvalho ChehabFinally, the ``->blimit`` counter is the maximum number of RCU callbacks
915ccc9971eSMauro Carvalho Chehabthat may be invoked at a given time.
916ccc9971eSMauro Carvalho Chehab
917ccc9971eSMauro Carvalho ChehabDyntick-Idle Handling
918ccc9971eSMauro Carvalho Chehab'''''''''''''''''''''
919ccc9971eSMauro Carvalho Chehab
920ccc9971eSMauro Carvalho ChehabThis portion of the ``rcu_data`` structure is declared as follows:
921ccc9971eSMauro Carvalho Chehab
922ccc9971eSMauro Carvalho Chehab::
923ccc9971eSMauro Carvalho Chehab
924ccc9971eSMauro Carvalho Chehab     1   int dynticks_snap;
925ccc9971eSMauro Carvalho Chehab     2   unsigned long dynticks_fqs;
926ccc9971eSMauro Carvalho Chehab
927ccc9971eSMauro Carvalho ChehabThe ``->dynticks_snap`` field is used to take a snapshot of the
928ccc9971eSMauro Carvalho Chehabcorresponding CPU's dyntick-idle state when forcing quiescent states,
929ccc9971eSMauro Carvalho Chehaband is therefore accessed from other CPUs. Finally, the
930ccc9971eSMauro Carvalho Chehab``->dynticks_fqs`` field is used to count the number of times this CPU
931ccc9971eSMauro Carvalho Chehabis determined to be in dyntick-idle state, and is used for tracing and
932ccc9971eSMauro Carvalho Chehabdebugging purposes.
933ccc9971eSMauro Carvalho Chehab
934ccc9971eSMauro Carvalho ChehabThis portion of the rcu_data structure is declared as follows:
935ccc9971eSMauro Carvalho Chehab
936ccc9971eSMauro Carvalho Chehab::
937ccc9971eSMauro Carvalho Chehab
938ccc9971eSMauro Carvalho Chehab     1   long dynticks_nesting;
939ccc9971eSMauro Carvalho Chehab     2   long dynticks_nmi_nesting;
940ccc9971eSMauro Carvalho Chehab     3   atomic_t dynticks;
941ccc9971eSMauro Carvalho Chehab     4   bool rcu_need_heavy_qs;
942ccc9971eSMauro Carvalho Chehab     5   bool rcu_urgent_qs;
943ccc9971eSMauro Carvalho Chehab
944ccc9971eSMauro Carvalho ChehabThese fields in the rcu_data structure maintain the per-CPU dyntick-idle
945ccc9971eSMauro Carvalho Chehabstate for the corresponding CPU. The fields may be accessed only from
946ccc9971eSMauro Carvalho Chehabthe corresponding CPU (and from tracing) unless otherwise stated.
947ccc9971eSMauro Carvalho Chehab
948ccc9971eSMauro Carvalho ChehabThe ``->dynticks_nesting`` field counts the nesting depth of process
949ccc9971eSMauro Carvalho Chehabexecution, so that in normal circumstances this counter has value zero
950ccc9971eSMauro Carvalho Chehabor one. NMIs, irqs, and tracers are counted by the
951ccc9971eSMauro Carvalho Chehab``->dynticks_nmi_nesting`` field. Because NMIs cannot be masked, changes
952ccc9971eSMauro Carvalho Chehabto this variable have to be undertaken carefully using an algorithm
953ccc9971eSMauro Carvalho Chehabprovided by Andy Lutomirski. The initial transition from idle adds one,
954ccc9971eSMauro Carvalho Chehaband nested transitions add two, so that a nesting level of five is
955ccc9971eSMauro Carvalho Chehabrepresented by a ``->dynticks_nmi_nesting`` value of nine. This counter
956ccc9971eSMauro Carvalho Chehabcan therefore be thought of as counting the number of reasons why this
957ccc9971eSMauro Carvalho ChehabCPU cannot be permitted to enter dyntick-idle mode, aside from
958ccc9971eSMauro Carvalho Chehabprocess-level transitions.
959ccc9971eSMauro Carvalho Chehab
960ccc9971eSMauro Carvalho ChehabHowever, it turns out that when running in non-idle kernel context, the
961ccc9971eSMauro Carvalho ChehabLinux kernel is fully capable of entering interrupt handlers that never
962ccc9971eSMauro Carvalho Chehabexit and perhaps also vice versa. Therefore, whenever the
963ccc9971eSMauro Carvalho Chehab``->dynticks_nesting`` field is incremented up from zero, the
964ccc9971eSMauro Carvalho Chehab``->dynticks_nmi_nesting`` field is set to a large positive number, and
965ccc9971eSMauro Carvalho Chehabwhenever the ``->dynticks_nesting`` field is decremented down to zero,
9661b98b7c5SRandy Dunlapthe ``->dynticks_nmi_nesting`` field is set to zero. Assuming that
967ccc9971eSMauro Carvalho Chehabthe number of misnested interrupts is not sufficient to overflow the
968ccc9971eSMauro Carvalho Chehabcounter, this approach corrects the ``->dynticks_nmi_nesting`` field
969ccc9971eSMauro Carvalho Chehabevery time the corresponding CPU enters the idle loop from process
970ccc9971eSMauro Carvalho Chehabcontext.
971ccc9971eSMauro Carvalho Chehab
972ccc9971eSMauro Carvalho ChehabThe ``->dynticks`` field counts the corresponding CPU's transitions to
973ccc9971eSMauro Carvalho Chehaband from either dyntick-idle or user mode, so that this counter has an
974ccc9971eSMauro Carvalho Chehabeven value when the CPU is in dyntick-idle mode or user mode and an odd
975ccc9971eSMauro Carvalho Chehabvalue otherwise. The transitions to/from user mode need to be counted
976404147faSAkira Yokosawafor user mode adaptive-ticks support (see Documentation/timers/no_hz.rst).
977ccc9971eSMauro Carvalho Chehab
978ccc9971eSMauro Carvalho ChehabThe ``->rcu_need_heavy_qs`` field is used to record the fact that the
979ccc9971eSMauro Carvalho ChehabRCU core code would really like to see a quiescent state from the
980ccc9971eSMauro Carvalho Chehabcorresponding CPU, so much so that it is willing to call for
981ccc9971eSMauro Carvalho Chehabheavy-weight dyntick-counter operations. This flag is checked by RCU's
982ccc9971eSMauro Carvalho Chehabcontext-switch and ``cond_resched()`` code, which provide a momentary
983ccc9971eSMauro Carvalho Chehabidle sojourn in response.
984ccc9971eSMauro Carvalho Chehab
985ccc9971eSMauro Carvalho ChehabFinally, the ``->rcu_urgent_qs`` field is used to record the fact that
986ccc9971eSMauro Carvalho Chehabthe RCU core code would really like to see a quiescent state from the
987ccc9971eSMauro Carvalho Chehabcorresponding CPU, with the various other fields indicating just how
988ccc9971eSMauro Carvalho Chehabbadly RCU wants this quiescent state. This flag is checked by RCU's
989ccc9971eSMauro Carvalho Chehabcontext-switch path (``rcu_note_context_switch``) and the cond_resched
990ccc9971eSMauro Carvalho Chehabcode.
991ccc9971eSMauro Carvalho Chehab
992ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
993ccc9971eSMauro Carvalho Chehab| **Quick Quiz**:                                                       |
994ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
995ccc9971eSMauro Carvalho Chehab| Why not simply combine the ``->dynticks_nesting`` and                 |
996ccc9971eSMauro Carvalho Chehab| ``->dynticks_nmi_nesting`` counters into a single counter that just   |
997ccc9971eSMauro Carvalho Chehab| counts the number of reasons that the corresponding CPU is non-idle?  |
998ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
999ccc9971eSMauro Carvalho Chehab| **Answer**:                                                           |
1000ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
1001ccc9971eSMauro Carvalho Chehab| Because this would fail in the presence of interrupts whose handlers  |
1002ccc9971eSMauro Carvalho Chehab| never return and of handlers that manage to return from a made-up     |
1003ccc9971eSMauro Carvalho Chehab| interrupt.                                                            |
1004ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
1005ccc9971eSMauro Carvalho Chehab
1006ccc9971eSMauro Carvalho ChehabAdditional fields are present for some special-purpose builds, and are
1007ccc9971eSMauro Carvalho Chehabdiscussed separately.
1008ccc9971eSMauro Carvalho Chehab
1009ccc9971eSMauro Carvalho ChehabThe ``rcu_head`` Structure
1010ccc9971eSMauro Carvalho Chehab~~~~~~~~~~~~~~~~~~~~~~~~~~
1011ccc9971eSMauro Carvalho Chehab
1012ccc9971eSMauro Carvalho ChehabEach ``rcu_head`` structure represents an RCU callback. These structures
1013ccc9971eSMauro Carvalho Chehabare normally embedded within RCU-protected data structures whose
1014ccc9971eSMauro Carvalho Chehabalgorithms use asynchronous grace periods. In contrast, when using
1015ccc9971eSMauro Carvalho Chehabalgorithms that block waiting for RCU grace periods, RCU users need not
1016ccc9971eSMauro Carvalho Chehabprovide ``rcu_head`` structures.
1017ccc9971eSMauro Carvalho Chehab
1018ccc9971eSMauro Carvalho ChehabThe ``rcu_head`` structure has fields as follows:
1019ccc9971eSMauro Carvalho Chehab
1020ccc9971eSMauro Carvalho Chehab::
1021ccc9971eSMauro Carvalho Chehab
1022ccc9971eSMauro Carvalho Chehab     1   struct rcu_head *next;
1023ccc9971eSMauro Carvalho Chehab     2   void (*func)(struct rcu_head *head);
1024ccc9971eSMauro Carvalho Chehab
1025ccc9971eSMauro Carvalho ChehabThe ``->next`` field is used to link the ``rcu_head`` structures
1026ccc9971eSMauro Carvalho Chehabtogether in the lists within the ``rcu_data`` structures. The ``->func``
1027ccc9971eSMauro Carvalho Chehabfield is a pointer to the function to be called when the callback is
1028ccc9971eSMauro Carvalho Chehabready to be invoked, and this function is passed a pointer to the
1029ccc9971eSMauro Carvalho Chehab``rcu_head`` structure. However, ``kfree_rcu()`` uses the ``->func``
1030ccc9971eSMauro Carvalho Chehabfield to record the offset of the ``rcu_head`` structure within the
1031ccc9971eSMauro Carvalho Chehabenclosing RCU-protected data structure.
1032ccc9971eSMauro Carvalho Chehab
1033ccc9971eSMauro Carvalho ChehabBoth of these fields are used internally by RCU. From the viewpoint of
1034ccc9971eSMauro Carvalho ChehabRCU users, this structure is an opaque “cookie”.
1035ccc9971eSMauro Carvalho Chehab
1036ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
1037ccc9971eSMauro Carvalho Chehab| **Quick Quiz**:                                                       |
1038ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
1039ccc9971eSMauro Carvalho Chehab| Given that the callback function ``->func`` is passed a pointer to    |
1040ccc9971eSMauro Carvalho Chehab| the ``rcu_head`` structure, how is that function supposed to find the |
1041ccc9971eSMauro Carvalho Chehab| beginning of the enclosing RCU-protected data structure?              |
1042ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
1043ccc9971eSMauro Carvalho Chehab| **Answer**:                                                           |
1044ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
1045ccc9971eSMauro Carvalho Chehab| In actual practice, there is a separate callback function per type of |
1046ccc9971eSMauro Carvalho Chehab| RCU-protected data structure. The callback function can therefore use |
1047ccc9971eSMauro Carvalho Chehab| the ``container_of()`` macro in the Linux kernel (or other            |
1048ccc9971eSMauro Carvalho Chehab| pointer-manipulation facilities in other software environments) to    |
1049ccc9971eSMauro Carvalho Chehab| find the beginning of the enclosing structure.                        |
1050ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
1051ccc9971eSMauro Carvalho Chehab
1052ccc9971eSMauro Carvalho ChehabRCU-Specific Fields in the ``task_struct`` Structure
1053ccc9971eSMauro Carvalho Chehab~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1054ccc9971eSMauro Carvalho Chehab
1055ccc9971eSMauro Carvalho ChehabThe ``CONFIG_PREEMPT_RCU`` implementation uses some additional fields in
1056ccc9971eSMauro Carvalho Chehabthe ``task_struct`` structure:
1057ccc9971eSMauro Carvalho Chehab
1058ccc9971eSMauro Carvalho Chehab::
1059ccc9971eSMauro Carvalho Chehab
1060ccc9971eSMauro Carvalho Chehab    1 #ifdef CONFIG_PREEMPT_RCU
1061ccc9971eSMauro Carvalho Chehab    2   int rcu_read_lock_nesting;
1062ccc9971eSMauro Carvalho Chehab    3   union rcu_special rcu_read_unlock_special;
1063ccc9971eSMauro Carvalho Chehab    4   struct list_head rcu_node_entry;
1064ccc9971eSMauro Carvalho Chehab    5   struct rcu_node *rcu_blocked_node;
1065ccc9971eSMauro Carvalho Chehab    6 #endif /* #ifdef CONFIG_PREEMPT_RCU */
1066ccc9971eSMauro Carvalho Chehab    7 #ifdef CONFIG_TASKS_RCU
1067ccc9971eSMauro Carvalho Chehab    8   unsigned long rcu_tasks_nvcsw;
1068ccc9971eSMauro Carvalho Chehab    9   bool rcu_tasks_holdout;
1069ccc9971eSMauro Carvalho Chehab   10   struct list_head rcu_tasks_holdout_list;
1070ccc9971eSMauro Carvalho Chehab   11   int rcu_tasks_idle_cpu;
1071ccc9971eSMauro Carvalho Chehab   12 #endif /* #ifdef CONFIG_TASKS_RCU */
1072ccc9971eSMauro Carvalho Chehab
1073ccc9971eSMauro Carvalho ChehabThe ``->rcu_read_lock_nesting`` field records the nesting level for RCU
1074ccc9971eSMauro Carvalho Chehabread-side critical sections, and the ``->rcu_read_unlock_special`` field
1075ccc9971eSMauro Carvalho Chehabis a bitmask that records special conditions that require
1076ccc9971eSMauro Carvalho Chehab``rcu_read_unlock()`` to do additional work. The ``->rcu_node_entry``
1077ccc9971eSMauro Carvalho Chehabfield is used to form lists of tasks that have blocked within
1078ccc9971eSMauro Carvalho Chehabpreemptible-RCU read-side critical sections and the
1079ccc9971eSMauro Carvalho Chehab``->rcu_blocked_node`` field references the ``rcu_node`` structure whose
1080ccc9971eSMauro Carvalho Chehablist this task is a member of, or ``NULL`` if it is not blocked within a
1081ccc9971eSMauro Carvalho Chehabpreemptible-RCU read-side critical section.
1082ccc9971eSMauro Carvalho Chehab
1083ccc9971eSMauro Carvalho ChehabThe ``->rcu_tasks_nvcsw`` field tracks the number of voluntary context
1084ccc9971eSMauro Carvalho Chehabswitches that this task had undergone at the beginning of the current
1085ccc9971eSMauro Carvalho Chehabtasks-RCU grace period, ``->rcu_tasks_holdout`` is set if the current
1086ccc9971eSMauro Carvalho Chehabtasks-RCU grace period is waiting on this task,
1087ccc9971eSMauro Carvalho Chehab``->rcu_tasks_holdout_list`` is a list element enqueuing this task on
1088ccc9971eSMauro Carvalho Chehabthe holdout list, and ``->rcu_tasks_idle_cpu`` tracks which CPU this
1089ccc9971eSMauro Carvalho Chehabidle task is running, but only if the task is currently running, that
1090ccc9971eSMauro Carvalho Chehabis, if the CPU is currently idle.
1091ccc9971eSMauro Carvalho Chehab
1092ccc9971eSMauro Carvalho ChehabAccessor Functions
1093ccc9971eSMauro Carvalho Chehab~~~~~~~~~~~~~~~~~~
1094ccc9971eSMauro Carvalho Chehab
1095ccc9971eSMauro Carvalho ChehabThe following listing shows the ``rcu_get_root()``,
1096ccc9971eSMauro Carvalho Chehab``rcu_for_each_node_breadth_first`` and ``rcu_for_each_leaf_node()``
1097ccc9971eSMauro Carvalho Chehabfunction and macros:
1098ccc9971eSMauro Carvalho Chehab
1099ccc9971eSMauro Carvalho Chehab::
1100ccc9971eSMauro Carvalho Chehab
1101ccc9971eSMauro Carvalho Chehab     1 static struct rcu_node *rcu_get_root(struct rcu_state *rsp)
1102ccc9971eSMauro Carvalho Chehab     2 {
1103ccc9971eSMauro Carvalho Chehab     3   return &rsp->node[0];
1104ccc9971eSMauro Carvalho Chehab     4 }
1105ccc9971eSMauro Carvalho Chehab     5
1106ccc9971eSMauro Carvalho Chehab     6 #define rcu_for_each_node_breadth_first(rsp, rnp) \
1107ccc9971eSMauro Carvalho Chehab     7   for ((rnp) = &(rsp)->node[0]; \
1108ccc9971eSMauro Carvalho Chehab     8        (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++)
1109ccc9971eSMauro Carvalho Chehab     9
1110ccc9971eSMauro Carvalho Chehab    10 #define rcu_for_each_leaf_node(rsp, rnp) \
1111ccc9971eSMauro Carvalho Chehab    11   for ((rnp) = (rsp)->level[NUM_RCU_LVLS - 1]; \
1112ccc9971eSMauro Carvalho Chehab    12        (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++)
1113ccc9971eSMauro Carvalho Chehab
1114ccc9971eSMauro Carvalho ChehabThe ``rcu_get_root()`` simply returns a pointer to the first element of
1115ccc9971eSMauro Carvalho Chehabthe specified ``rcu_state`` structure's ``->node[]`` array, which is the
1116ccc9971eSMauro Carvalho Chehabroot ``rcu_node`` structure.
1117ccc9971eSMauro Carvalho Chehab
1118ccc9971eSMauro Carvalho ChehabAs noted earlier, the ``rcu_for_each_node_breadth_first()`` macro takes
1119ccc9971eSMauro Carvalho Chehabadvantage of the layout of the ``rcu_node`` structures in the
1120ccc9971eSMauro Carvalho Chehab``rcu_state`` structure's ``->node[]`` array, performing a breadth-first
1121ccc9971eSMauro Carvalho Chehabtraversal by simply traversing the array in order. Similarly, the
1122ccc9971eSMauro Carvalho Chehab``rcu_for_each_leaf_node()`` macro traverses only the last part of the
1123ccc9971eSMauro Carvalho Chehabarray, thus traversing only the leaf ``rcu_node`` structures.
1124ccc9971eSMauro Carvalho Chehab
1125ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
1126ccc9971eSMauro Carvalho Chehab| **Quick Quiz**:                                                       |
1127ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
1128ccc9971eSMauro Carvalho Chehab| What does ``rcu_for_each_leaf_node()`` do if the ``rcu_node`` tree    |
1129ccc9971eSMauro Carvalho Chehab| contains only a single node?                                          |
1130ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
1131ccc9971eSMauro Carvalho Chehab| **Answer**:                                                           |
1132ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
1133ccc9971eSMauro Carvalho Chehab| In the single-node case, ``rcu_for_each_leaf_node()`` traverses the   |
1134ccc9971eSMauro Carvalho Chehab| single node.                                                          |
1135ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
1136ccc9971eSMauro Carvalho Chehab
1137ccc9971eSMauro Carvalho ChehabSummary
1138ccc9971eSMauro Carvalho Chehab~~~~~~~
1139ccc9971eSMauro Carvalho Chehab
1140ccc9971eSMauro Carvalho ChehabSo the state of RCU is represented by an ``rcu_state`` structure, which
1141ccc9971eSMauro Carvalho Chehabcontains a combining tree of ``rcu_node`` and ``rcu_data`` structures.
1142ccc9971eSMauro Carvalho ChehabFinally, in ``CONFIG_NO_HZ_IDLE`` kernels, each CPU's dyntick-idle state
1143ccc9971eSMauro Carvalho Chehabis tracked by dynticks-related fields in the ``rcu_data`` structure. If
1144ccc9971eSMauro Carvalho Chehabyou made it this far, you are well prepared to read the code
1145ccc9971eSMauro Carvalho Chehabwalkthroughs in the other articles in this series.
1146ccc9971eSMauro Carvalho Chehab
1147ccc9971eSMauro Carvalho ChehabAcknowledgments
1148ccc9971eSMauro Carvalho Chehab~~~~~~~~~~~~~~~
1149ccc9971eSMauro Carvalho Chehab
1150ccc9971eSMauro Carvalho ChehabI owe thanks to Cyrill Gorcunov, Mathieu Desnoyers, Dhaval Giani, Paul
1151ccc9971eSMauro Carvalho ChehabTurner, Abhishek Srivastava, Matt Kowalczyk, and Serge Hallyn for
1152ccc9971eSMauro Carvalho Chehabhelping me get this document into a more human-readable state.
1153ccc9971eSMauro Carvalho Chehab
1154ccc9971eSMauro Carvalho ChehabLegal Statement
1155ccc9971eSMauro Carvalho Chehab~~~~~~~~~~~~~~~
1156ccc9971eSMauro Carvalho Chehab
1157ccc9971eSMauro Carvalho ChehabThis work represents the view of the author and does not necessarily
1158ccc9971eSMauro Carvalho Chehabrepresent the view of IBM.
1159ccc9971eSMauro Carvalho Chehab
1160ccc9971eSMauro Carvalho ChehabLinux is a registered trademark of Linus Torvalds.
1161ccc9971eSMauro Carvalho Chehab
1162ccc9971eSMauro Carvalho ChehabOther company, product, and service names may be trademarks or service
1163ccc9971eSMauro Carvalho Chehabmarks of others.
1164