1Coding Particular Pacemaker Components 2-------------------------------------- 3 4The Pacemaker code can be intricate and difficult to follow. This chapter has 5some high-level descriptions of how individual components work. 6 7 8.. index:: 9 single: fencer 10 single: pacemaker-fenced 11 12Fencer 13###### 14 15``pacemaker-fenced`` is the Pacemaker daemon that handles fencing requests. In 16the broadest terms, fencing works like this: 17 18#. The initiator (an external program such as ``stonith_admin``, or the cluster 19 itself via the controller) asks the local fencer, "Hey, could you please 20 fence this node?" 21#. The local fencer asks all the fencers in the cluster (including itself), 22 "Hey, what fencing devices do you have access to that can fence this node?" 23#. Each fencer in the cluster replies with a list of available devices that 24 it knows about. 25#. Once the original fencer gets all the replies, it asks the most 26 appropriate fencer peer to actually carry out the fencing. It may send 27 out more than one such request if the target node must be fenced with 28 multiple devices. 29#. The chosen fencer(s) call the appropriate fencing resource agent(s) to 30 do the fencing, then reply to the original fencer with the result. 31#. The original fencer broadcasts the result to all fencers. 32#. Each fencer sends the result to each of its local clients (including, at 33 some point, the initiator). 34 35A more detailed description follows. 36 37.. index:: 38 single: libstonithd 39 40Initiating a fencing request 41____________________________ 42 43A fencing request can be initiated by the cluster or externally, using the 44libstonithd API. 45 46* The cluster always initiates fencing via 47 ``daemons/controld/controld_te_actions.c:te_fence_node()`` (which calls the 48 ``fence()`` API method). This occurs when a transition graph synapse contains 49 a ``CRM_OP_FENCE`` XML operation. 50* The main external clients are ``stonith_admin`` and ``cts-fence-helper``. 51 The ``DLM`` project also uses Pacemaker for fencing. 52 53Highlights of the fencing API: 54 55* ``stonith_api_new()`` creates and returns a new ``stonith_t`` object, whose 56 ``cmds`` member has methods for connect, disconnect, fence, etc. 57* the ``fence()`` method creates and sends a ``STONITH_OP_FENCE XML`` request with 58 the desired action and target node. Callers do not have to choose or even 59 have any knowledge about particular fencing devices. 60 61Fencing queries 62_______________ 63 64The function calls for a fencing request go something like this: 65 66The local fencer receives the client's request via an IPC or messaging 67layer callback, which calls 68 69* ``stonith_command()``, which (for requests) calls 70 71 * ``handle_request()``, which (for ``STONITH_OP_FENCE`` from a client) calls 72 73 * ``initiate_remote_stonith_op()``, which creates a ``STONITH_OP_QUERY`` XML 74 request with the target, desired action, timeout, etc. then broadcasts 75 the operation to the cluster group (i.e. all fencer instances) and 76 starts a timer. The query is broadcast because (1) location constraints 77 might prevent the local node from accessing the stonith device directly, 78 and (2) even if the local node does have direct access, another node 79 might be preferred to carry out the fencing. 80 81Each fencer receives the original fencer's ``STONITH_OP_QUERY`` broadcast 82request via IPC or messaging layer callback, which calls: 83 84* ``stonith_command()``, which (for requests) calls 85 86 * ``handle_request()``, which (for ``STONITH_OP_QUERY`` from a peer) calls 87 88 * ``stonith_query()``, which calls 89 90 * ``get_capable_devices()`` with ``stonith_query_capable_device_db()`` to add 91 device information to an XML reply and send it. (A message is 92 considered a reply if it contains ``T_STONITH_REPLY``, which is only 93 set by fencer peers, not clients.) 94 95The original fencer receives all peers' ``STONITH_OP_QUERY`` replies via IPC 96or messaging layer callback, which calls: 97 98* ``stonith_command()``, which (for replies) calls 99 100 * ``handle_reply()`` which (for ``STONITH_OP_QUERY``) calls 101 102 * ``process_remote_stonith_query()``, which allocates a new query result 103 structure, parses device information into it, and adds it to operation 104 object. It increments the number of replies received for this operation, 105 and compares it against the expected number of replies (i.e. the number 106 of active peers), and if this is the last expected reply, calls 107 108 * ``call_remote_stonith()``, which calculates the timeout and sends 109 ``STONITH_OP_FENCE`` request(s) to carry out the fencing. If the target 110 node has a fencing "topology" (which allows specifications such as 111 "this node can be fenced either with device A, or devices B and C in 112 combination"), it will choose the device(s), and send out as many 113 requests as needed. If it chooses a device, it will choose the peer; a 114 peer is preferred if it has "verified" access to the desired device, 115 meaning that it has the device "running" on it and thus has a monitor 116 operation ensuring reachability. 117 118Fencing operations 119__________________ 120 121Each ``STONITH_OP_FENCE`` request goes something like this: 122 123The chosen peer fencer receives the ``STONITH_OP_FENCE`` request via IPC or 124messaging layer callback, which calls: 125 126* ``stonith_command()``, which (for requests) calls 127 128 * ``handle_request()``, which (for ``STONITH_OP_FENCE`` from a peer) calls 129 130 * ``stonith_fence()``, which calls 131 132 * ``schedule_stonith_command()`` (using supplied device if 133 ``F_STONITH_DEVICE`` was set, otherwise the highest-priority capable 134 device obtained via ``get_capable_devices()`` with 135 ``stonith_fence_get_devices_cb()``), which adds the operation to the 136 device's pending operations list and triggers processing. 137 138The chosen peer fencer's mainloop is triggered and calls 139 140* ``stonith_device_dispatch()``, which calls 141 142 * ``stonith_device_execute()``, which pops off the next item from the device's 143 pending operations list. If acting as the (internally implemented) watchdog 144 agent, it panics the node, otherwise it calls 145 146 * ``stonith_action_create()`` and ``stonith_action_execute_async()`` to 147 call the fencing agent. 148 149The chosen peer fencer's mainloop is triggered again once the fencing agent 150returns, and calls 151 152* ``stonith_action_async_done()`` which adds the results to an action object 153 then calls its 154 155 * done callback (``st_child_done()``), which calls ``schedule_stonith_command()`` 156 for a new device if there are further required actions to execute or if the 157 original action failed, then builds and sends an XML reply to the original 158 fencer (via ``stonith_send_async_reply()``), then checks whether any 159 pending actions are the same as the one just executed and merges them if so. 160 161Fencing replies 162_______________ 163 164The original fencer receives the ``STONITH_OP_FENCE`` reply via IPC or 165messaging layer callback, which calls: 166 167* ``stonith_command()``, which (for replies) calls 168 169 * ``handle_reply()``, which calls 170 171 * ``process_remote_stonith_exec()``, which calls either 172 ``call_remote_stonith()`` (to retry a failed operation, or try the next 173 device in a topology is appropriate, which issues a new 174 ``STONITH_OP_FENCE`` request, proceeding as before) or 175 ``remote_op_done()`` (if the operation is definitively failed or 176 successful). 177 178 * remote_op_done() broadcasts the result to all peers. 179 180Finally, all peers receive the broadcast result and call 181 182* ``remote_op_done()``, which sends the result to all local clients. 183 184 185.. index:: 186 single: scheduler 187 single: pacemaker-schedulerd 188 single: libpe_status 189 single: libpe_rules 190 single: libpacemaker 191 192Scheduler 193######### 194 195``pacemaker-schedulerd`` is the Pacemaker daemon that runs the Pacemaker 196scheduler for the controller, but "the scheduler" in general refers to related 197library code in ``libpe_status`` and ``libpe_rules`` (``lib/pengine/*.c``), and 198some of ``libpacemaker`` (``lib/pacemaker/pcmk_sched_*.c``). 199 200The purpose of the scheduler is to take a CIB as input and generate a 201transition graph (list of actions that need to be taken) as output. 202 203The controller invokes the scheduler by contacting the scheduler daemon via 204local IPC. Tools such as ``crm_simulate``, ``crm_mon``, and ``crm_resource`` 205can also invoke the scheduler, but do so by calling the library functions 206directly. This allows them to run using a ``CIB_file`` without the cluster 207needing to be active. 208 209The main entry point for the scheduler code is 210``lib/pacemaker/pcmk_sched_messages.c:pcmk__schedule_actions()``. It sets 211defaults and calls a bunch of "stage *N*" functions. Yes, there is a stage 0 212and no stage 1. :) The code has evolved over time to where splitting the stages 213up differently and renumbering them would make sense. 214 215* ``stage0()`` "unpacks" most of the CIB XML into data structures, and 216 determines the current cluster status. It also creates implicit location 217 constraints for the node health feature. 218* ``stage2()`` applies factors that make resources prefer certain nodes (such 219 as shutdown locks, location constraints, and stickiness). 220* ``stage3()`` creates internal constraints (such as the implicit ordering for 221 group members, or start actions being implicitly ordered before promote 222 actions). 223* ``stage4()`` "checks actions", which means processing resource history 224 entries in the CIB status section. This is used to decide whether certain 225 actions need to be done, such as deleting orphan resources, forcing a restart 226 when a resource definition changes, etc. 227* ``stage5()`` allocates resources to nodes and creates actions (which might or 228 might not end up in the final graph). 229* ``stage6()`` creates implicit ordering constraints for resources running 230 across remote connections, and schedules fencing actions and shutdowns. 231* ``stage7()`` "updates actions", which means applying ordering constraints in 232 order to modify action attributes such as optional or required. 233* ``stage8()`` creates the transition graph. 234 235Challenges 236__________ 237 238Working with the scheduler is difficult. Challenges include: 239 240* It is far too much code to keep more than a small portion in your head at one 241 time. 242* Small changes can have large (and unexpected) effects. This is why we have a 243 large number of regression tests (``cts/cts-scheduler``), which should be run 244 after making code changes. 245* It produces an insane amount of log messages at debug and trace levels. 246 You can put resource ID(s) in the ``PCMK_trace_tags`` environment variable to 247 enable trace-level messages only when related to specific resources. 248* Different parts of the main ``pe_working_set_t`` structure are finalized at 249 different points in the scheduling process, so you have to keep in mind 250 whether information you're using at one point of the code can possibly change 251 later. For example, data unpacked from the CIB can safely be used anytime 252 after stage0(), but actions may become optional or required anytime before 253 stage8(). There's no easy way to deal with this. 254* Many names of struct members, functions, etc., are suboptimal, but are part 255 of the public API and cannot be changed until an API backward compatibility 256 break. 257 258 259.. index:: 260 single: pe_working_set_t 261 262Cluster Working Set 263___________________ 264 265The main data object for the scheduler is ``pe_working_set_t``, which contains 266all information needed about nodes, resources, constraints, etc., both as the 267raw CIB XML and parsed into more usable data structures, plus the resulting 268transition graph XML. The variable name is usually ``data_set``. 269 270.. index:: 271 single: pe_resource_t 272 273Resources 274_________ 275 276``pe_resource_t`` is the data object representing cluster resources. A resource 277has a variant: primitive (a.k.a. native), group, clone, or bundle. 278 279The resource object has members for two sets of methods, 280``resource_object_functions_t`` from the ``libpe_status`` public API, and 281``resource_alloc_functions_t`` whose implementation is internal to 282``libpacemaker``. The actual functions vary by variant. 283 284The object functions have basic capabilities such as unpacking the resource 285XML, and determining the current or planned location of the resource. 286 287The allocation functions have more obscure capabilities needed for scheduling, 288such as processing location and ordering constraints. For example, 289``stage3()``, which creates internal constraints, simply calls the 290``internal_constraints()`` method for each top-level resource in the working 291set. 292 293.. index:: 294 single: pe_node_t 295 296Nodes 297_____ 298 299Allocation of resources to nodes is done by choosing the node with the highest 300score for a given resource. The scheduler does a bunch of processing to 301generate the scores, then the actual allocation is straightforward. 302 303Node lists are frequently used. For example, ``pe_working_set_t`` has a 304``nodes`` member which is a list of all nodes in the cluster, and 305``pe_resource_t`` has a ``running_on`` member which is a list of all nodes on 306which the resource is (or might be) active. These are lists of ``pe_node_t`` 307objects. 308 309The ``pe_node_t`` object contains a ``struct pe_node_shared_s *details`` member 310with all node information that is independent of resource allocation (the node 311name, etc.). 312 313The working set's ``nodes`` member contains the original of this information. 314All other node lists contain copies of ``pe_node_t`` where only the ``details`` 315member points to the originals in the working set's ``nodes`` list. In this 316way, the other members of ``pe_node_t`` (such as ``weight``, which is the node 317score) may vary by node list, while the common details are shared. 318 319.. index:: 320 single: pe_action_t 321 single: pe_action_flags 322 323Actions 324_______ 325 326``pe_action_t`` is the data object representing actions that might need to be 327taken. These could be resource actions, cluster-wide actions such as fencing a 328node, or "pseudo-actions" which are abstractions used as convenient points for 329ordering other actions against. 330 331It has a ``flags`` member which is a bitmask of ``enum pe_action_flags``. The 332most important of these are ``pe_action_runnable`` (if not set, the action is 333"blocked" and cannot be added to the transition graph) and 334``pe_action_optional`` (actions with this set will not be added to the 335transition graph; actions often start out as optional, and may become required 336later). 337 338 339.. index:: 340 single: pe__ordering_t 341 single: pe_ordering 342 343Orderings 344_________ 345 346Ordering constraints are simple in concept, but they are one of the most 347important, powerful, and difficult to follow aspects of the scheduler code. 348 349``pe__ordering_t`` is the data object representing an ordering, better thought 350of as a relationship between two actions, since the relation can be more 351complex than just "this one runs after that one". 352 353For an ordering "A then B", the code generally refers to A as "first" or 354"before", and B as "then" or "after". 355 356Much of the power comes from ``enum pe_ordering``, which are flags that 357determine how an ordering behaves. There are many obscure flags with big 358effects. A few examples: 359 360* ``pe_order_none`` means the ordering is disabled and will be ignored. It's 0, 361 meaning no flags set, so it must be compared with equality rather than 362 ``pcmk_is_set()``. 363* ``pe_order_optional`` means the ordering does not make either action 364 required, so it only applies if they both become required for other reasons. 365* ``pe_order_implies_first`` means that if action B becomes required for any 366 reason, then action A will become required as well. 367