1Coding Particular Pacemaker Components
2--------------------------------------
3
4The Pacemaker code can be intricate and difficult to follow. This chapter has
5some high-level descriptions of how individual components work.
6
7
8.. index::
9   single: fencer
10   single: pacemaker-fenced
11
12Fencer
13######
14
15``pacemaker-fenced`` is the Pacemaker daemon that handles fencing requests. In
16the broadest terms, fencing works like this:
17
18#. The initiator (an external program such as ``stonith_admin``, or the cluster
19   itself via the controller) asks the local fencer, "Hey, could you please
20   fence this node?"
21#. The local fencer asks all the fencers in the cluster (including itself),
22   "Hey, what fencing devices do you have access to that can fence this node?"
23#. Each fencer in the cluster replies with a list of available devices that
24   it knows about.
25#. Once the original fencer gets all the replies, it asks the most
26   appropriate fencer peer to actually carry out the fencing. It may send
27   out more than one such request if the target node must be fenced with
28   multiple devices.
29#. The chosen fencer(s) call the appropriate fencing resource agent(s) to
30   do the fencing, then reply to the original fencer with the result.
31#. The original fencer broadcasts the result to all fencers.
32#. Each fencer sends the result to each of its local clients (including, at
33   some point, the initiator).
34
35A more detailed description follows.
36
37.. index::
38   single: libstonithd
39
40Initiating a fencing request
41____________________________
42
43A fencing request can be initiated by the cluster or externally, using the
44libstonithd API.
45
46* The cluster always initiates fencing via
47  ``daemons/controld/controld_te_actions.c:te_fence_node()`` (which calls the
48  ``fence()`` API method). This occurs when a transition graph synapse contains
49  a ``CRM_OP_FENCE`` XML operation.
50* The main external clients are ``stonith_admin`` and ``cts-fence-helper``.
51  The ``DLM`` project also uses Pacemaker for fencing.
52
53Highlights of the fencing API:
54
55* ``stonith_api_new()`` creates and returns a new ``stonith_t`` object, whose
56  ``cmds`` member has methods for connect, disconnect, fence, etc.
57* the ``fence()`` method creates and sends a ``STONITH_OP_FENCE XML`` request with
58  the desired action and target node. Callers do not have to choose or even
59  have any knowledge about particular fencing devices.
60
61Fencing queries
62_______________
63
64The function calls for a fencing request go something like this:
65
66The local fencer receives the client's request via an IPC or messaging
67layer callback, which calls
68
69* ``stonith_command()``, which (for requests) calls
70
71  * ``handle_request()``, which (for ``STONITH_OP_FENCE`` from a client) calls
72
73    * ``initiate_remote_stonith_op()``, which creates a ``STONITH_OP_QUERY`` XML
74      request with the target, desired action, timeout, etc. then broadcasts
75      the operation to the cluster group (i.e. all fencer instances) and
76      starts a timer. The query is broadcast because (1) location constraints
77      might prevent the local node from accessing the stonith device directly,
78      and (2) even if the local node does have direct access, another node
79      might be preferred to carry out the fencing.
80
81Each fencer receives the original fencer's ``STONITH_OP_QUERY`` broadcast
82request via IPC or messaging layer callback, which calls:
83
84* ``stonith_command()``, which (for requests) calls
85
86  *  ``handle_request()``, which (for ``STONITH_OP_QUERY`` from a peer) calls
87
88    * ``stonith_query()``, which calls
89
90      * ``get_capable_devices()`` with ``stonith_query_capable_device_db()`` to add
91        device information to an XML reply and send it. (A message is
92        considered a reply if it contains ``T_STONITH_REPLY``, which is only
93        set by fencer peers, not clients.)
94
95The original fencer receives all peers' ``STONITH_OP_QUERY`` replies via IPC
96or messaging layer callback, which calls:
97
98* ``stonith_command()``, which (for replies) calls
99
100  * ``handle_reply()`` which (for ``STONITH_OP_QUERY``) calls
101
102    * ``process_remote_stonith_query()``, which allocates a new query result
103      structure, parses device information into it, and adds it to operation
104      object. It increments the number of replies received for this operation,
105      and compares it against the expected number of replies (i.e. the number
106      of active peers), and if this is the last expected reply, calls
107
108      * ``call_remote_stonith()``, which calculates the timeout and sends
109        ``STONITH_OP_FENCE`` request(s) to carry out the fencing. If the target
110	node has a fencing "topology" (which allows specifications such as
111	"this node can be fenced either with device A, or devices B and C in
112	combination"), it will choose the device(s), and send out as many
113	requests as needed. If it chooses a device, it will choose the peer; a
114	peer is preferred if it has "verified" access to the desired device,
115	meaning that it has the device "running" on it and thus has a monitor
116        operation ensuring reachability.
117
118Fencing operations
119__________________
120
121Each ``STONITH_OP_FENCE`` request goes something like this:
122
123The chosen peer fencer receives the ``STONITH_OP_FENCE`` request via IPC or
124messaging layer callback, which calls:
125
126* ``stonith_command()``, which (for requests) calls
127
128  * ``handle_request()``, which (for ``STONITH_OP_FENCE`` from a peer) calls
129
130    * ``stonith_fence()``, which calls
131
132      * ``schedule_stonith_command()`` (using supplied device if
133        ``F_STONITH_DEVICE`` was set, otherwise the highest-priority capable
134	device obtained via ``get_capable_devices()`` with
135	``stonith_fence_get_devices_cb()``), which adds the operation to the
136        device's pending operations list and triggers processing.
137
138The chosen peer fencer's mainloop is triggered and calls
139
140* ``stonith_device_dispatch()``, which calls
141
142  * ``stonith_device_execute()``, which pops off the next item from the device's
143    pending operations list. If acting as the (internally implemented) watchdog
144    agent, it panics the node, otherwise it calls
145
146    * ``stonith_action_create()`` and ``stonith_action_execute_async()`` to
147      call the fencing agent.
148
149The chosen peer fencer's mainloop is triggered again once the fencing agent
150returns, and calls
151
152* ``stonith_action_async_done()`` which adds the results to an action object
153  then calls its
154
155  * done callback (``st_child_done()``), which calls ``schedule_stonith_command()``
156    for a new device if there are further required actions to execute or if the
157    original action failed, then builds and sends an XML reply to the original
158    fencer (via ``stonith_send_async_reply()``), then checks whether any
159    pending actions are the same as the one just executed and merges them if so.
160
161Fencing replies
162_______________
163
164The original fencer receives the ``STONITH_OP_FENCE`` reply via IPC or
165messaging layer callback, which calls:
166
167* ``stonith_command()``, which (for replies) calls
168
169  * ``handle_reply()``, which calls
170
171    * ``process_remote_stonith_exec()``, which calls either
172      ``call_remote_stonith()`` (to retry a failed operation, or try the next
173       device in a topology is appropriate, which issues a new
174      ``STONITH_OP_FENCE`` request, proceeding as before) or
175      ``remote_op_done()`` (if the operation is definitively failed or
176      successful).
177
178      * remote_op_done() broadcasts the result to all peers.
179
180Finally, all peers receive the broadcast result and call
181
182* ``remote_op_done()``, which sends the result to all local clients.
183
184
185.. index::
186   single: scheduler
187   single: pacemaker-schedulerd
188   single: libpe_status
189   single: libpe_rules
190   single: libpacemaker
191
192Scheduler
193#########
194
195``pacemaker-schedulerd`` is the Pacemaker daemon that runs the Pacemaker
196scheduler for the controller, but "the scheduler" in general refers to related
197library code in ``libpe_status`` and ``libpe_rules`` (``lib/pengine/*.c``), and
198some of ``libpacemaker`` (``lib/pacemaker/pcmk_sched_*.c``).
199
200The purpose of the scheduler is to take a CIB as input and generate a
201transition graph (list of actions that need to be taken) as output.
202
203The controller invokes the scheduler by contacting the scheduler daemon via
204local IPC. Tools such as ``crm_simulate``, ``crm_mon``, and ``crm_resource``
205can also invoke the scheduler, but do so by calling the library functions
206directly. This allows them to run using a ``CIB_file`` without the cluster
207needing to be active.
208
209The main entry point for the scheduler code is
210``lib/pacemaker/pcmk_sched_messages.c:pcmk__schedule_actions()``. It sets
211defaults and calls a bunch of "stage *N*" functions. Yes, there is a stage 0
212and no stage 1. :) The code has evolved over time to where splitting the stages
213up differently and renumbering them would make sense.
214
215* ``stage0()`` "unpacks" most of the CIB XML into data structures, and
216  determines the current cluster status. It also creates implicit location
217  constraints for the node health feature.
218* ``stage2()`` applies factors that make resources prefer certain nodes (such
219  as shutdown locks, location constraints, and stickiness).
220* ``stage3()`` creates internal constraints (such as the implicit ordering for
221  group members, or start actions being implicitly ordered before promote
222  actions).
223* ``stage4()`` "checks actions", which means processing resource history
224  entries in the CIB status section. This is used to decide whether certain
225  actions need to be done, such as deleting orphan resources, forcing a restart
226  when a resource definition changes, etc.
227* ``stage5()`` allocates resources to nodes and creates actions (which might or
228  might not end up in the final graph).
229* ``stage6()`` creates implicit ordering constraints for resources running
230  across remote connections, and schedules fencing actions and shutdowns.
231* ``stage7()`` "updates actions", which means applying ordering constraints in
232  order to modify action attributes such as optional or required.
233* ``stage8()`` creates the transition graph.
234
235Challenges
236__________
237
238Working with the scheduler is difficult. Challenges include:
239
240* It is far too much code to keep more than a small portion in your head at one
241  time.
242* Small changes can have large (and unexpected) effects. This is why we have a
243  large number of regression tests (``cts/cts-scheduler``), which should be run
244  after making code changes.
245* It produces an insane amount of log messages at debug and trace levels.
246  You can put resource ID(s) in the ``PCMK_trace_tags`` environment variable to
247  enable trace-level messages only when related to specific resources.
248* Different parts of the main ``pe_working_set_t`` structure are finalized at
249  different points in the scheduling process, so you have to keep in mind
250  whether information you're using at one point of the code can possibly change
251  later. For example, data unpacked from the CIB can safely be used anytime
252  after stage0(), but actions may become optional or required anytime before
253  stage8(). There's no easy way to deal with this.
254* Many names of struct members, functions, etc., are suboptimal, but are part
255  of the public API and cannot be changed until an API backward compatibility
256  break.
257
258
259.. index::
260   single: pe_working_set_t
261
262Cluster Working Set
263___________________
264
265The main data object for the scheduler is ``pe_working_set_t``, which contains
266all information needed about nodes, resources, constraints, etc., both as the
267raw CIB XML and parsed into more usable data structures, plus the resulting
268transition graph XML. The variable name is usually ``data_set``.
269
270.. index::
271   single: pe_resource_t
272
273Resources
274_________
275
276``pe_resource_t`` is the data object representing cluster resources. A resource
277has a variant: primitive (a.k.a. native), group, clone, or bundle.
278
279The resource object has members for two sets of methods,
280``resource_object_functions_t`` from the ``libpe_status`` public API, and
281``resource_alloc_functions_t`` whose implementation is internal to
282``libpacemaker``. The actual functions vary by variant.
283
284The object functions have basic capabilities such as unpacking the resource
285XML, and determining the current or planned location of the resource.
286
287The allocation functions have more obscure capabilities needed for scheduling,
288such as processing location and ordering constraints. For example,
289``stage3()``, which creates internal constraints, simply calls the
290``internal_constraints()`` method for each top-level resource in the working
291set.
292
293.. index::
294   single: pe_node_t
295
296Nodes
297_____
298
299Allocation of resources to nodes is done by choosing the node with the highest
300score for a given resource. The scheduler does a bunch of processing to
301generate the scores, then the actual allocation is straightforward.
302
303Node lists are frequently used. For example, ``pe_working_set_t`` has a
304``nodes`` member which is a list of all nodes in the cluster, and
305``pe_resource_t`` has a ``running_on`` member which is a list of all nodes on
306which the resource is (or might be) active. These are lists of ``pe_node_t``
307objects.
308
309The ``pe_node_t`` object contains a ``struct pe_node_shared_s *details`` member
310with all node information that is independent of resource allocation (the node
311name, etc.).
312
313The working set's ``nodes`` member contains the original of this information.
314All other node lists contain copies of ``pe_node_t`` where only the ``details``
315member points to the originals in the working set's ``nodes`` list. In this
316way, the other members of ``pe_node_t`` (such as ``weight``, which is the node
317score) may vary by node list, while the common details are shared.
318
319.. index::
320   single: pe_action_t
321   single: pe_action_flags
322
323Actions
324_______
325
326``pe_action_t`` is the data object representing actions that might need to be
327taken. These could be resource actions, cluster-wide actions such as fencing a
328node, or "pseudo-actions" which are abstractions used as convenient points for
329ordering other actions against.
330
331It has a ``flags`` member which is a bitmask of ``enum pe_action_flags``. The
332most important of these are ``pe_action_runnable`` (if not set, the action is
333"blocked" and cannot be added to the transition graph) and
334``pe_action_optional`` (actions with this set will not be added to the
335transition graph; actions often start out as optional, and may become required
336later).
337
338
339.. index::
340   single: pe__ordering_t
341   single: pe_ordering
342
343Orderings
344_________
345
346Ordering constraints are simple in concept, but they are one of the most
347important, powerful, and difficult to follow aspects of the scheduler code.
348
349``pe__ordering_t`` is the data object representing an ordering, better thought
350of as a relationship between two actions, since the relation can be more
351complex than just "this one runs after that one".
352
353For an ordering "A then B", the code generally refers to A as "first" or
354"before", and B as "then" or "after".
355
356Much of the power comes from ``enum pe_ordering``, which are flags that
357determine how an ordering behaves. There are many obscure flags with big
358effects. A few examples:
359
360* ``pe_order_none`` means the ordering is disabled and will be ignored. It's 0,
361  meaning no flags set, so it must be compared with equality rather than
362  ``pcmk_is_set()``.
363* ``pe_order_optional`` means the ordering does not make either action
364  required, so it only applies if they both become required for other reasons.
365* ``pe_order_implies_first`` means that if action B becomes required for any
366  reason, then action A will become required as well.
367