1= The OCF Resource Agent Developer's Guide
2
3== Introduction
4
5This document is to serve as a guide and reference for all developers,
6maintainers, and contributors working on OCF (Open Cluster Framework)
7compliant cluster resource agents. It explains the anatomy and general
8functionality of a resource agent, illustrates the resource agent API,
9and provides valuable hints and tips to resource agent authors.
10
11=== What is a resource agent?
12
13A resource agent is an executable that manages a cluster resource. No
14formal definition of a cluster resource exists, other than "anything a
15cluster manages is a resource." Cluster resources can be as diverse as
16IP addresses, file systems, database services, and entire virtual
17machines -- to name just a few examples.
18
19=== Who or what uses a resource agent?
20
21Any Open Cluster Framework (OCF) compliant cluster management
22application is capable of managing resources using the resource agents
23described in this document. At the time of writing, two OCF compliant
24cluster management applications exist for the Linux platform:
25
26* _Pacemaker_, a cluster manager supporting both the Corosync and
27  Heartbeat cluster messaging frameworks. Pacemaker evolved out of the
28  Linux-HA project.
29* _RGmanager_, the cluster manager bundled in Red Hat Cluster
30  Suite. It supports the Corosync cluster messaging framework
31  exclusively.
32
33=== Which language is a resource agent written in?
34
35An OCF compliant resource agent can be implemented in _any_
36programming language. The API is not language specific. However, most
37resource agents are implemented as shell scripts, which is why this
38guide primarily uses example code written in shell language.
39
40=== Is there a naming convention?
41
42Yes! We have agreed to the following convention for resource agent
43names: Please name resource agents using lower case letters, with
44words separated by dashes (+example-agent-name+).
45
46Existing agents may or may not follow this convention, but it is the
47intention to make sure future agents follow this rule.
48
49== API definitions
50
51=== Environment variables
52
53A resource agent receives all configuration information about the
54resource it manages via environment variables. The names of these
55environment variables are always the name of the resource parameter,
56prefixed with +OCF_RESKEY_+. For example, if the resource has an +ip+
57parameter set to +192.168.1.1+, then the resource agent will have
58access to an environment variable +OCF_RESKEY_ip+ holding that value.
59
60For any resource parameter that is not required to be set by the user
61-- that is, its parameter definition in the resource agent metadata
62does not specify +required="true"+ -- then the resource agent must
63
64* Provide a reasonable default. This should be advertised in the
65  metadata. By convention, the resource agent uses a variable named
66  +OCF_RESKEY_<parametername>_default+ that holds this default.
67* Alternatively, cater correctly for the value being empty.
68
69In addition, the cluster manager may also support _meta_ resource
70parameters. These do not apply directly to the resource configuration,
71but rather specify _how_ the cluster resource manager is expected to manage
72the resource. For example, the Pacemaker cluster manager uses the
73+target-role+ meta parameter to specify whether the resource should be
74started or stopped.
75
76Meta parameters are passed into the resource agent in the
77+OCF_RESKEY_CRM_meta_+ namespace, with any hypens converted to
78underscores. Thus, the +target-role+ attribute maps to an environment
79variable named +OCF_RESKEY_CRM_meta_target_role+.
80
81The <<_script_variables>> section contains other system environment
82variables.
83
84=== Actions
85
86Any resource agent must support one command-line argument which
87specifies the action the resource agent is about to execute. The
88following actions must be supported by any resource agent:
89
90* +start+ -- starts the resource.
91* +stop+ -- shuts down the resource.
92* +monitor+ -- queries the resource for its state.
93* +meta-data+ -- dumps the resource agent metadata.
94
95In addition, resource agents may optionally support the following
96actions:
97
98* +promote+ -- turns a resource into the +Master+ role (Master/Slave
99  resources only).
100* +demote+ -- turns a resource into the +Slave+ role (Master/Slave
101  resources only).
102* +migrate_to+ and +migrate_from+ -- implement live migration of
103  resources.
104* +validate-all+ -- validates a resource's configuration.
105* +usage+ or +help+ -- displays a usage message when the resource
106  agent is invoked from the command line, rather than by the cluster
107  manager.
108* +notify+ -- inform resource about changes in state of other clones.
109* +status+ -- historical (deprecated) synonym for +monitor+.
110
111=== Timeouts
112
113Action timeouts are enforced outside the resource agent proper. It is
114the cluster manager's responsibility to monitor how long a resource
115agent action has been running, and terminate it if it does not meet
116its completion deadline. Thus, resource agents need not themselves
117check for any timeout expiry.
118
119Resource agents can, however, _advise_ the user of sensible timeout
120values (which, when correctly set, will be duly enforced by the
121cluster manager). See <<_metadata,the following section>> for details
122on how a resource agent advertises its suggested timeouts.
123
124=== Metadata
125
126Every resource agent must describe its own purpose and supported
127parameters in a set of XML metadata. This metadata is used by cluster
128management applications for on-line help, and resource agent man pages
129are generated from it as well. The following is a fictitious set of
130metadata from an imaginary resource agent:
131
132[source,xml]
133--------------------------------------------------------------------------
134<?xml version="1.0"?>
135<!DOCTYPE resource-agent SYSTEM "ra-api-1.dtd">
136<resource-agent name="foobar">
137  <version>0.1</version>
138  <longdesc lang="en">
139This is a fictitious example resource agent written for the
140OCF Resource Agent Developers Guide.
141  </longdesc>
142  <shortdesc lang="en">Example resource agent
143  for budding OCF RA developers</shortdesc>
144  <parameters>
145    <parameter name="eggs" unique="0" required="1">
146      <longdesc lang="en">
147      Number of eggs, an example numeric parameter
148      </longdesc>
149      <shortdesc lang="en">Number of eggs</shortdesc>
150      <content type="integer"/>
151    </parameter>
152    <parameter name="superfrobnicate" unique="0" required="0">
153      <longdesc lang="en">
154      Enable superfrobnication, an example boolean parameter
155      </longdesc>
156      <shortdesc lang="en">Enable superfrobnication</shortdesc>
157      <content type="boolean" default="false"/>
158    </parameter>
159    <parameter name="datadir" unique="0" required="1">
160      <longdesc lang="en">
161      Data directory, an example string parameter
162      </longdesc>
163      <shortdesc lang="en">Data directory</shortdesc>
164      <content type="string"/>
165    </parameter>
166  </parameters>
167  <actions>
168    <action name="start"        timeout="20" />
169    <action name="stop"         timeout="20" />
170    <action name="monitor"      timeout="20"
171                                interval="10" depth="0" />
172    <action name="notify"       timeout="20" />
173    <action name="reload"       timeout="20" />
174    <action name="migrate_to"   timeout="20" />
175    <action name="migrate_from" timeout="20" />
176    <action name="meta-data"    timeout="5" />
177    <action name="validate-all"   timeout="20" />
178  </actions>
179</resource-agent>
180--------------------------------------------------------------------------
181
182The +resource-agent+ element, of which there must only be one per
183resource agent, defines the resource agent +name+ and +version+.
184
185The +longdesc+ and +shortdesc+ elements in +resource-agent+ provide a
186long and short description of the resource agent's
187functionality. While +shortdesc+ is a one-line description of what
188the resource agent does and is usually used in terse listings,
189+longdesc+ should give a full-blown description of the resource agent
190in as much detail as possible.
191
192The +parameters+ element describes the resource agent parameters, and
193should hold any number of +parameter+ children -- one for each
194parameter that the resource agent supports.
195
196Every +parameter+ should, like the +resource-agent+ as a whole, come
197with a +shortdesc+ and a +longdesc+, and also a +content+ child that
198describes the parameter's expected content.
199
200On the +content+ element, there may be four different attributes:
201
202* +type+ describes the parameter type (+string+, +integer+, or
203  +boolean+). If unset, +type+ defaults to +string+.
204
205* +required+ indicates whether setting the parameter is mandatory
206  (+required="true"+) or optional (+required="false"+).
207
208* For optional parameters, it is customary to provide a sensible
209  default via the +default+ attribute.
210
211* Finally, the +unique+ attribute (allowed values: +true+ or +false+)
212  indicates that a specific value must be unique across the cluster,
213  for this parameter of this particular resource type. For example, a
214  highly available floating IP address is declared +unique+ -- as that
215  one IP address should run only once throughout the cluster, avoiding
216  duplicates.
217
218The +actions+ list defines the actions that the resource agent
219advertises as supported.
220
221Every +action+ should list its own +timeout+ value. This is a
222hint to the user what _minimal_ timeout should be configured for the
223action. This is meant to cater for the fact that some resources are
224quick to start and stop (IP addresses or filesystems, for example),
225some may take several minutes to do so (such as databases).
226
227In addition, recurring actions (such as +monitor+) should also specify
228a recommended minimum +interval+, which is the time between two
229consecutive invocations of the same action. Like +timeout+, this value
230does not constitute a default -- it is merely a hint for the user
231which action interval to configure, at minimum.
232
233== Return codes
234
235For any invocation, resource agents must exit with a defined return
236code that informs the caller of the outcome of the invoked
237action. The return codes are explained in detail in the following
238subsections.
239
240=== +OCF_SUCCESS+ (0)
241
242The action completed successfully. This is the expected return code
243for any successful +start+, +stop+, +promote+, +demote+,
244+migrate_from+, +migrate_to+, +meta_data+, +help+, and +usage+ action.
245
246For +monitor+ (and its deprecated alias, +status+), however, a
247modified convention applies:
248
249* For primitive (stateless) resources, +OCF_SUCCESS+ from +monitor+
250  means that the resource is running. Non-running and gracefully
251  shut-down resources must instead return +OCF_NOT_RUNNING+.
252
253* For master/slave (stateful) resources, +OCF_SUCCESS+ from +monitor+
254  means that the resource is running _in Slave mode_. Resources
255  running in Master mode must instead return +OCF_RUNNING_MASTER+, and
256  gracefully shut-down resources must instead return
257  +OCF_NOT_RUNNING+.
258
259=== +OCF_ERR_GENERIC+ (1)
260
261The action returned a generic error. A resource agent should use this
262exit code only when none of the more specific error codes, defined
263below, accurately describes the problem.
264
265The cluster resource manager interprets this exit code as a _soft_
266error. This means that unless specifically configured otherwise, the
267resource manager will attempt to recover a resource which failed with
268+OCF_ERR_GENERIC+ in-place -- usually by restarting the resource on
269the same node.
270
271=== +OCF_ERR_ARGS+ (2)
272
273The resource’s configuration is not valid on this machine. E.g. it
274refers to a location not found on the node.
275
276NOTE: The resource agent should not return this error when instructed
277to perform an action that it does not support. Instead, under those
278circumstances, it should return +OCF_ERR_UNIMPLEMENTED+.
279
280=== +OCF_ERR_UNIMPLEMENTED+ (3)
281
282The resource agent was instructed to execute an action that the agent
283does not implement.
284
285Not all resource agent actions are mandatory. +promote+, +demote+,
286+migrate_to+, +migrate_from+, and +notify+, are all optional actions
287which the resource agent may or may not implement. When a non-stateful
288resource agent is misconfigured as a master/slave resource, for
289example, then the resource agent should alert the user about this
290misconfiguration by returning +OCF_ERR_UNIMPLEMENTED+ on the +promote+
291and +demote+ actions.
292
293=== +OCF_ERR_PERM+ (4)
294
295The action failed due to insufficient permissions. This may be due to
296the agent not being able to open a certain file, to listen on a
297specific socket, to write to a directory, or similar.
298
299The cluster resource manager interprets this exit code as a _hard_
300error. This means that unless specifically configured otherwise, the
301resource manager will attempt to recover a resource which failed with
302this error by restarting the resource on a different node (where the
303permission problem may not exist).
304
305=== +OCF_ERR_INSTALLED+ (5)
306
307The action failed because a required component is missing on the node
308where the action was executed. This may be due to a required binary
309not being executable, or a vital configuration file being unreadable.
310
311The cluster resource manager interprets this exit code as a _hard_
312error. This means that unless specifically configured otherwise, the
313resource manager will attempt to recover a resource which failed with
314this error by restarting the resource on a different node (where the
315required files or binaries may be present).
316
317=== +OCF_ERR_CONFIGURED+ (6)
318
319The action failed because the user misconfigured the resource. For
320example, the user may have configured an alphanumeric string for a
321parameter that really should be an integer.
322
323The cluster resource manager interprets this exit code as a _fatal_
324error. Since this is a configuration error that is present
325cluster-wide, it would make no sense to recover such a resource on a
326different node, let alone in-place. When a resource fails with this
327error, the cluster manager will attempt to shut down the resource, and
328wait for administrator intervention.
329
330=== +OCF_NOT_RUNNING+ (7)
331
332The resource was found not to be running. This is an exit code that
333may be returned by the +monitor+ action exclusively. Note that this
334implies that the resource has either _gracefully_ shut down, or has
335never been started.
336
337If the resource is not running due to an error condition, the
338+monitor+ action should instead return one of the +OCF_ERR_+ exit
339codes or +OCF_FAILED_MASTER+.
340
341=== +OCF_RUNNING_MASTER+ (8)
342
343The resource was found to be running in the +Master+ role. This
344applies only to stateful (Master/Slave) resources, and only to
345their +monitor+ action.
346
347Note that there is no specific exit code for "running in slave
348mode". This is because their is no functional distinction between a
349primitive resource running normally, and a stateful resource running
350as a slave. The +monitor+ action of a stateful resource running
351normally in the +Slave+ role should simply return +OCF_SUCCESS+.
352
353=== +OCF_FAILED_MASTER+ (9)
354
355The resource was found to have failed in the +Master+ role. This
356applies only to stateful (Master/Slave) resources, and only to their
357+monitor+ action.
358
359The cluster resource manager interprets this exit code as a _soft_
360error. This means that unless specifically configured otherwise, the
361resource manager will attempt to recover a resource which failed with
362+$OCF_FAILED_MASTER+ in-place -- usually by demoting, stopping,
363starting and then promoting the resource on the same node.
364
365
366== Resource agent structure
367
368A typical (shell-based) resource agent contains standard structural
369items, in the order as listed in this section.  It describes the
370expected behavior of a resource agent with respect to the various
371actions it supports, using a fictitous resource agent named +foobar+
372as an example.
373
374=== Resource agent interpreter
375
376Any resource agent implemented as a script must specify its
377interpreter using standard "shebang" (+#!+) header syntax.
378
379[source,bash]
380--------------------------------------------------------------------------
381#!/bin/sh
382--------------------------------------------------------------------------
383
384If a resource agent is written in shell, specifying the generic shell
385interpreter (+#!/bin/sh+) is generally preferred, though not
386required. Resource agents declared as +/bin/sh+ compatible must not
387use constructs native to a specific shell (such as, for example,
388+${!variable}+ syntax native to +bash+). It is advisable to
389occasionally run such resource agents through a sanitization utility
390such as +checkbashisms+.
391
392It is considered a regression to introduce a patch that will make a
393previously +sh+ compatible resource agent suitable only for +bash+,
394+ksh+, or any other non-generic shell. It is, however, perfectly
395acceptable for a new resource agent to explicitly define a specific
396shell, such as +/bin/bash+, as its interpreter.
397
398=== Author and license information
399
400The resource agent should contain a comment listing the resource agent
401author(s) and/or copyright holder(s), and stating the license that
402applies to the resource agent:
403
404[source,bash]
405--------------------------------------------------------------------------
406#
407#   Resource Agent for managing foobar resources.
408#
409#   License:      GNU General Public License (GPL)
410#   (c) 2008-2010 John Doe, Jane Roe,
411#                 and Linux-HA contributors
412--------------------------------------------------------------------------
413
414When a resource agent refers to a license for which multiple versions
415exist, it is assumed that the current version applies.
416
417=== Initialization
418
419Any shell resource agent should source the +ocf-shellfuncs+ function
420library. With the syntax below, this is done in terms of
421+$OCF_FUNCTIONS_DIR+, which -- for testing purposes, and also for
422generating documentation -- may be overridden from the command line.
423
424[source,bash]
425--------------------------------------------------------------------------
426# Initialization:
427: ${OCF_FUNCTIONS_DIR=${OCF_ROOT}/lib/heartbeat}
428. ${OCF_FUNCTIONS_DIR}/ocf-shellfuncs
429--------------------------------------------------------------------------
430
431=== Functions implementing resource agent actions
432
433What follows next are the functions implementing the resource agent's
434advertised actions. The individual actions are described in detail in
435<<_resource_agent_actions>>.
436
437=== Execution block
438
439This is the part of the resource agent that actually executes when the
440resource agent is invoked. It typically follows a fairly standard
441structure:
442
443[source,bash]
444--------------------------------------------------------------------------
445# Make sure meta-data and usage always succeed
446case $__OCF_ACTION in
447meta-data)	foobar_meta_data
448		exit $OCF_SUCCESS
449		;;
450usage|help)	foobar_usage
451		exit $OCF_SUCCESS
452		;;
453esac
454
455# Anything other than meta-data and usage must pass validation
456foobar_validate_all || exit $?
457
458# Translate each action into the appropriate function call
459case $__OCF_ACTION in
460start)		foobar_start;;
461stop)		foobar_stop;;
462status|monitor)	foobar_monitor;;
463promote)	foobar_promote;;
464demote)		foobar_demote;;
465notify)		foobar_notify;;
466reload)		ocf_log info "Reloading..."
467	        foobar_start
468		;;
469validate-all)	;;
470*)		foobar_usage
471		exit $OCF_ERR_UNIMPLEMENTED
472		;;
473esac
474rc=$?
475
476# The resource agent may optionally log a debug message
477ocf_log debug "${OCF_RESOURCE_INSTANCE} $__OCF_ACTION returned $rc"
478exit $rc
479--------------------------------------------------------------------------
480
481
482== Resource agent actions
483
484Each action is typically implemented in a separate function or method
485in the resource agent. By convention, these are usually named
486+<agent>_<action>+, so the function implementing the +start+ action in
487+foobar+ would be named +foobar_start()+.
488
489As a general rule, whenever the resource agent encounters an error
490that it is not able to recover, it is permitted to immediately exit,
491throw an exception, or otherwise cease execution. Examples for this
492include configuration issues, missing binaries, permission problems,
493etc. It is not necessary to pass these errors up the call stack.
494
495It is the cluster manager's responsibility to initiate the appropriate
496recovery action based on the user's configuration. The resource agent
497should not guess at said configuration.
498
499=== +start+ action
500
501When invoked with the +start+ action, the resource agent must start
502the resource if it is not yet running. This means that the agent must
503verify the resource's configuration, query its state, and then start
504it only if it is not running. A common way of doing this would be to
505invoke the +validate_all+ and +monitor+ function first, as in the
506following example:
507
508[source,bash]
509--------------------------------------------------------------------------
510foobar_start() {
511    # exit immediately if configuration is not valid
512    foobar_validate_all || exit $?
513
514    # if resource is already running, bail out early
515    if foobar_monitor; then
516	ocf_log info "Resource is already running"
517	return $OCF_SUCCESS
518    fi
519
520    # actually start up the resource here (make sure to immediately
521    # exit with an $OCF_ERR_ error code if anything goes seriously
522    # wrong)
523    ...
524
525    # After the resource has been started, check whether it started up
526    # correctly. If the resource starts asynchronously, the agent may
527    # spin on the monitor function here -- if the resource does not
528    # start up within the defined timeout, the cluster manager will
529    # consider the start action failed
530    while ! foobar_monitor; do
531	ocf_log debug "Resource has not started yet, waiting"
532	sleep 1
533    done
534
535    # only return $OCF_SUCCESS if _everything_ succeeded as expected
536    return $OCF_SUCCESS
537}
538--------------------------------------------------------------------------
539
540
541=== +stop+ action
542
543When invoked with the +stop+ action, the resource agent must stop the
544resource, if it is running. This means that the agent must verify the
545resource configuration, query its state, and then stop it only if it
546is currently running. A common way of doing this would be to invoke
547the +validate_all+ and +monitor+ function first. It is important to
548understand that +stop+ is a force operation -- the resource agent must
549do everything in its power to shut down, the resource, short of
550rebooting the node or shutting it off. Consider the following example:
551
552[source,bash]
553--------------------------------------------------------------------------
554foobar_stop() {
555    local rc
556
557    # exit immediately if configuration is not valid
558    foobar_validate_all || exit $?
559
560    foobar_monitor
561    rc=$?
562    case "$rc" in
563        "$OCF_SUCCESS")
564            # Currently running. Normal, expected behavior.
565            ocf_log debug "Resource is currently running"
566            ;;
567        "$OCF_RUNNING_MASTER")
568            # Running as a Master. Need to demote before stopping.
569            ocf_log info "Resource is currently running as Master"
570	    foobar_demote || \
571                ocf_log warn "Demote failed, trying to stop anyway"
572            ;;
573        "$OCF_NOT_RUNNING")
574            # Currently not running. Nothing to do.
575	    ocf_log info "Resource is already stopped"
576	    return $OCF_SUCCESS
577            ;;
578    esac
579
580    # actually shut down the resource here (make sure to immediately
581    # exit with an $OCF_ERR_ error code if anything goes seriously
582    # wrong)
583    ...
584
585    # After the resource has been stopped, check whether it shut down
586    # correctly. If the resource stops asynchronously, the agent may
587    # spin on the monitor function here -- if the resource does not
588    # shut down within the defined timeout, the cluster manager will
589    # consider the stop action failed
590    while foobar_monitor; do
591	ocf_log debug "Resource has not stopped yet, waiting"
592	sleep 1
593    done
594
595    # only return $OCF_SUCCESS if _everything_ succeeded as expected
596    return $OCF_SUCCESS
597
598}
599--------------------------------------------------------------------------
600
601NOTE: The expected exit code for a successful stop operation is
602+$OCF_SUCCESS+, _not_ +$OCF_NOT_RUNNING+.
603
604IMPORTANT: A failed stop operation is a potentially dangerous
605situation which the cluster manager will almost invariably try to
606resolve by means of node fencing. In other words, the cluster manager
607will forcibly evict from the cluster a node on which a stop operation
608has failed. While this measure serves ultimately to protect data, it
609does cause disruption to applications and their users. Thus, a
610resource agent should make sure that it exits with an error only if
611all avenues for proper resource shutdown have been exhausted.
612
613=== +monitor+ action
614
615The +monitor+ action queries the current status of a resource. It must
616discern between three different states:
617
618* resource is currently running (return +$OCF_SUCCESS+);
619* resource has stopped gracefully (return +$OCF_NOT_RUNNING+);
620* resource has run into a problem and must be considered failed
621  (return the appropriate +$OCF_ERR_+ code to indicate the nature of the
622  problem).
623
624
625[source,bash]
626--------------------------------------------------------------------------
627foobar_monitor() {
628    local rc
629
630    # exit immediately if configuration is not valid
631    foobar_validate_all || exit $?
632
633    ocf_run frobnicate --test
634
635    # This example assumes the following exit code convention
636    # for frobnicate:
637    # 0: running, and fully caught up with master
638    # 1: gracefully stopped
639    # any other: error
640    case "$?" in
641	0)
642            rc=$OCF_SUCCESS
643	    ocf_log debug "Resource is running"
644            ;;
645	1)
646            rc=$OCF_NOT_RUNNING
647	    ocf_log debug "Resource is not running"
648	    ;;
649	*)
650	    ocf_log err "Resource has failed"
651	    exit $OCF_ERR_GENERIC
652    esac
653
654    return $rc
655}
656--------------------------------------------------------------------------
657
658Stateful (master/slave) resource agents may use a more elaborate
659monitoring scheme where they can provide "hints" to the cluster
660manager identifying which instance is best suited to assume the
661+Master+ role. <<_specifying_a_master_preference>> explains the
662details.
663
664NOTE: The cluster manager may invoke the +monitor+ action for a
665_probe_, which is a test whether the resource is currently
666running. Normally, the monitor operation would behave exactly the same
667during a probe and a "real" monitor action. If a specific resource
668does require special treatment for probes, however, the +ocf_is_probe+
669convenience function is available in the OCF shell functions library
670for that purpose.
671
672=== +validate-all+ action
673
674The +validate-all+ action tests for correct resource agent
675configuration and a working environment. +validate-all+ should exit
676with one of the following return codes:
677
678* +$OCF_SUCCESS+ -- all is well, the configuration is valid and
679  usable.
680* +$OCF_ERR_CONFIGURED+ -- the user has misconfigured the resource.
681* +$OCF_ERR_INSTALLED+ -- the resource has possibly been configured
682  correctly, but a vital component is missing on the node where
683  +validate-all+ is being executed.
684* +$OCF_ERR_PERM+ -- the resource is configured correctly and is not
685  missing any required components, but is suffering from a permission
686  issue (such as not being able to create a necessary file).
687
688+validate-all+ is usually wrapped in a function that is not only
689called when explicitly invoking the corresponding action, but also --
690as a sanity check -- from just about any other function. Therefore,
691the resource agent author must keep in mind that the function may be
692invoked during the +start+, +stop+, and +monitor+ operations, and also
693during probes.
694
695Probes pose a separate challenge for validation. During a probe (when
696the cluster manager may expect the resource _not_ to be running on the
697node where the probe is executed), some required components may be
698_expected_ to not be available on the affected node. For example, this
699includes any shared data on storage devices not available for reading
700during the probe. The +validate-all+ function may thus need to treat
701probes specially, using the +ocf_is_probe+ convenience function:
702
703[source,bash]
704--------------------------------------------------------------------------
705foobar_validate_all() {
706    # Test for configuration errors first
707    if ! ocf_is_decimal $OCF_RESKEY_eggs; then
708       ocf_log err "eggs is not numeric!"
709       exit $OCF_ERR_CONFIGURED
710    fi
711
712    # Test for required binaries
713    check_binary frobnicate
714
715    # Check for data directory (this may be on shared storage, so
716    # disable this test during probes)
717    if ! ocf_is_probe; then
718       if ! [ -d $OCF_RESKEY_datadir ]; then
719       	  ocf_log err "$OCF_RESKEY_datadir does not exist or is not a directory!"
720          exit $OCF_ERR_INSTALLED
721       fi
722    fi
723
724    return $OCF_SUCCESS
725}
726--------------------------------------------------------------------------
727
728=== +meta-data+ action
729
730The +meta-data+ action dumps the resource agent metadata to standard
731output. The output must follow the metadata format as specified in
732<<_metadata>>.
733
734[source,bash]
735--------------------------------------------------------------------------
736foobar_meta_data {
737    cat <<EOF
738<?xml version="1.0"?>
739<!DOCTYPE resource-agent SYSTEM "ra-api-1.dtd">
740<resource-agent name="foobar">
741  <version>0.1</version>
742  <longdesc lang="en">
743...
744EOF
745}
746--------------------------------------------------------------------------
747
748=== +promote+ action
749
750The +promote+ action is optional. It must only be supported by
751_stateful_ resource agents, which means agents that discern between
752two distinct _roles_: +Master+ and +Slave+. +Slave+ is functionally
753identical to the +Started+ state in a stateless resource agent. Thus,
754while a regular (stateless) resource agent only needs to implement
755+start+ and +stop+, a stateful resource agent must also support the
756+promote+ action to be able to make a transition between the +Started+
757(+Slave+) and +Master+ roles.
758
759[source,bash]
760--------------------------------------------------------------------------
761foobar_promote() {
762    local rc
763
764    # exit immediately if configuration is not valid
765    foobar_validate_all || exit $?
766
767    # test the resource's current state
768    foobar_monitor
769    rc=$?
770    case "$rc" in
771        "$OCF_SUCCESS")
772            # Running as slave. Normal, expected behavior.
773            ocf_log debug "Resource is currently running as Slave"
774            ;;
775        "$OCF_RUNNING_MASTER")
776            # Already a master. Unexpected, but not a problem.
777            ocf_log info "Resource is already running as Master"
778	    return $OCF_SUCCESS
779            ;;
780        "$OCF_NOT_RUNNING")
781            # Currently not running. Need to start before promoting.
782            ocf_log info "Resource is currently not running"
783            foobar_start
784            ;;
785        *)
786            # Failed resource. Let the cluster manager recover.
787            ocf_log err "Unexpected error, cannot promote"
788            exit $rc
789            ;;
790    esac
791
792    # actually promote the resource here (make sure to immediately
793    # exit with an $OCF_ERR_ error code if anything goes seriously
794    # wrong)
795    ocf_run frobnicate --master-mode || exit $OCF_ERR_GENERIC
796
797    # After the resource has been promoted, check whether the
798    # promotion worked. If the resource promotion is asynchronous, the
799    # agent may spin on the monitor function here -- if the resource
800    # does not assume the Master role within the defined timeout, the
801    # cluster manager will consider the promote action failed.
802    while true; do
803        foobar_monitor
804        if [ $? -eq $OCF_RUNNING_MASTER ]; then
805            ocf_log debug "Resource promoted"
806            break
807        else
808            ocf_log debug "Resource still awaiting promotion"
809            sleep 1
810        fi
811    done
812
813    # only return $OCF_SUCCESS if _everything_ succeeded as expected
814    return $OCF_SUCCESS
815}
816--------------------------------------------------------------------------
817
818=== +demote+ action
819
820The +demote+ action is optional. It must only be supported by
821_stateful_ resource agents, which means agents that discern between
822two distict _roles_: +Master+ and +Slave+. +Slave+ is functionally
823identical to the +Started+ state in a stateless resource agent. Thus,
824while a regular (stateless) resource agent only needs to implement
825+start+ and +stop+, a stateful resource agent must also support the
826+demote+ action to be able to make a transition between the +Master+
827and +Started+ (+Slave+) roles.
828
829[source,bash]
830--------------------------------------------------------------------------
831foobar_demote() {
832    local rc
833
834    # exit immediately if configuration is not valid
835    foobar_validate_all || exit $?
836
837    # test the resource's current state
838    foobar_monitor
839    rc=$?
840    case "$rc" in
841        "$OCF_RUNNING_MASTER")
842            # Running as master. Normal, expected behavior.
843            ocf_log debug "Resource is currently running as Master"
844            ;;
845        "$OCF_SUCCESS")
846            # Alread running as slave. Nothing to do.
847            ocf_log debug "Resource is currently running as Slave"
848	    return $OCF_SUCCESS
849            ;;
850        "$OCF_NOT_RUNNING")
851            # Currently not running. Getting a demote action
852            # in this state is unexpected. Exit with an error
853            # and let the cluster manager recover.
854            ocf_log err "Resource is currently not running"
855            exit $OCF_ERR_GENERIC
856            ;;
857        *)
858            # Failed resource. Let the cluster manager recover.
859            ocf_log err "Unexpected error, cannot demote"
860            exit $rc
861            ;;
862    esac
863
864    # actually demote the resource here (make sure to immediately
865    # exit with an $OCF_ERR_ error code if anything goes seriously
866    # wrong)
867    ocf_run frobnicate --unset-master-mode || exit $OCF_ERR_GENERIC
868
869    # After the resource has been demoted, check whether the
870    # demotion worked. If the resource demotion is asynchronous, the
871    # agent may spin on the monitor function here -- if the resource
872    # does not assume the Slave role within the defined timeout, the
873    # cluster manager will consider the demote action failed.
874    while true; do
875        foobar_monitor
876        if [ $? -eq $OCF_RUNNING_MASTER ]; then
877            ocf_log debug "Resource still demoting"
878            sleep 1
879        else
880            ocf_log debug "Resource demoted"
881            break
882        fi
883    done
884
885    # only return $OCF_SUCCESS if _everything_ succeeded as expected
886    return $OCF_SUCCESS
887}
888--------------------------------------------------------------------------
889
890=== +migrate_to+ action
891
892The +migrate_to+ action can serve one of two purposes:
893
894* Initiate a native _push_ type migration for the resource. In other
895  words, instruct the resource to move _to_ a specific node from the
896  node it is currently running on. The resource agent knows about its
897  destination node via the +$OCF_RESKEY_CRM_meta_migrate_target+ environment
898  variable.
899
900* Freeze the resource in a _freeze/thaw_ (also known as
901  _suspend/resume_) type migration. In this mode, the resource does
902  not need any information about its destination node at this point.
903
904The example below illustrates a push type migration:
905
906[source,bash]
907--------------------------------------------------------------------------
908foobar_migrate_to() {
909    # exit immediately if configuration is not valid
910    foobar_validate_all || exit $?
911
912    # if resource is not running, bail out early
913    if ! foobar_monitor; then
914	ocf_log err "Resource is not running"
915	exit $OCF_ERR_GENERIC
916    fi
917
918    # actually start up the resource here (make sure to immediately
919    # exit with an $OCF_ERR_ error code if anything goes seriously
920    # wrong)
921    ocf_run frobnicate --migrate \
922                       --dest=$OCF_RESKEY_CRM_meta_migrate_target \
923                       || exit OCF_ERR_GENERIC
924    ...
925
926    # only return $OCF_SUCCESS if _everything_ succeeded as expected
927    return $OCF_SUCCESS
928}
929--------------------------------------------------------------------------
930
931In contrast, a freeze/thaw type migration may implement its freeze
932operation like this:
933
934[source,bash]
935--------------------------------------------------------------------------
936foobar_migrate_to() {
937    # exit immediately if configuration is not valid
938    foobar_validate_all || exit $?
939
940    # if resource is not running, bail out early
941    if ! foobar_monitor; then
942	ocf_log err "Resource is not running"
943	exit $OCF_ERR_GENERIC
944    fi
945
946    # actually start up the resource here (make sure to immediately
947    # exit with an $OCF_ERR_ error code if anything goes seriously
948    # wrong)
949    ocf_run frobnicate --freeze || exit OCF_ERR_GENERIC
950    ...
951
952    # only return $OCF_SUCCESS if _everything_ succeeded as expected
953    return $OCF_SUCCESS
954}
955--------------------------------------------------------------------------
956
957
958=== +migrate_from+ action
959
960The +migrate_from+ action can serve one of two purposes:
961
962* Complete a native _push_ type migration for the resource. In other
963  words, check whether the migration has succeeded properly, and the
964  resource is running on the local node. The resource agent knows
965  about its the migration source via the
966  +$OCF_RESKEY_CRM_meta_migrate_source+ environment variable.
967
968* Thaw the resource in a _freeze/thaw_ (also known as
969  _suspend/resume_) type migration. In this mode, the resource usually
970  not need any information about its source node at this point.
971
972The example below illustrates a push type migration:
973
974[source,bash]
975--------------------------------------------------------------------------
976foobar_migrate_from() {
977    # exit immediately if configuration is not valid
978    foobar_validate_all || exit $?
979
980    # After the resource has been migrated, check whether it resumed
981    # correctly. If the resource starts asynchronously, the agent may
982    # spin on the monitor function here -- if the resource does not
983    # run within the defined timeout, the cluster manager will
984    # consider the migrate_from action failed
985    while ! foobar_monitor; do
986	ocf_log debug "Resource has not yet migrated, waiting"
987	sleep 1
988    done
989
990    # only return $OCF_SUCCESS if _everything_ succeeded as expected
991    return $OCF_SUCCESS
992}
993--------------------------------------------------------------------------
994
995In contrast, a freeze/thaw type migration may implement its thaw
996operation like this:
997
998[source,bash]
999--------------------------------------------------------------------------
1000foobar_migrate_from() {
1001    # exit immediately if configuration is not valid
1002    foobar_validate_all || exit $?
1003
1004    # actually start up the resource here (make sure to immediately
1005    # exit with an $OCF_ERR_ error code if anything goes seriously
1006    # wrong)
1007    ocf_run frobnicate --thaw || exit OCF_ERR_GENERIC
1008
1009    # After the resource has been migrated, check whether it resumed
1010    # correctly. If the resource starts asynchronously, the agent may
1011    # spin on the monitor function here -- if the resource does not
1012    # run within the defined timeout, the cluster manager will
1013    # consider the migrate_from action failed
1014    while ! foobar_monitor; do
1015	ocf_log debug "Resource has not yet migrated, waiting"
1016	sleep 1
1017    done
1018
1019    # only return $OCF_SUCCESS if _everything_ succeeded as expected
1020    return $OCF_SUCCESS
1021}
1022--------------------------------------------------------------------------
1023
1024
1025=== +notify+ action
1026
1027With notifications, instances of clones (and of master/slave
1028resources, which are an extended kind of clones) can inform each other
1029about their state. When notifications are enabled, certain actions on
1030any instance of a clone carries a +pre+ and +post+ notification.
1031
1032List of actions that trigger notifications:
1033
1034* start
1035* stop
1036* promote
1037* demote
1038
1039The cluster manager invokes the +notify+ operation on _all_ clone
1040instances. For +notify+ operations, additional environment variables
1041are passed into the resource agent during execution:
1042
1043* +$OCF_RESKEY_CRM_meta_notify_type+ -- the notification type (+pre+
1044  or +post+)
1045
1046* +$OCF_RESKEY_CRM_meta_notify_operation+ -- the operation (action)
1047  that the notification is about (+start+, +stop+, +promote+, +demote+
1048  etc.)
1049
1050* +$OCF_RESKEY_CRM_meta_notify_start_uname+ -- node name of the node
1051  where the resource is being started (+start+ notifications only)
1052
1053* +$OCF_RESKEY_CRM_meta_notify_stop_uname+ -- node name of the node
1054  where the resource is being stopped (+stop+ notifications only)
1055
1056* +$OCF_RESKEY_CRM_meta_notify_master_uname+ -- node name of the node
1057  where the resource currently _is in_ the Master role
1058
1059* +$OCF_RESKEY_CRM_meta_notify_promote_uname+ -- node name of the node
1060  where the resource currently _is being promoted to_ the Master role
1061  (+promote+ notifications only)
1062
1063* +$OCF_RESKEY_CRM_meta_notify_demote_uname+ -- node name of the node
1064  where the resource currently _is being demoted to_ the Slave role
1065  (+demote+ notifications only)
1066
1067Notifications come in particularly handy for master/slave resources
1068using a "pull" scheme, where the master is a publisher and the slave a
1069subscriber. Since the master is obviously only available as such when
1070a promotion has occurred, the slaves can use a "pre-promote"
1071notification to configure themselves to subscribe to the right
1072publisher.
1073
1074Likewise, the subscribers may want to unsubscribe from the publisher
1075after it has relinquished its master status, and a "post-demote"
1076notification can be used for that purpose.
1077
1078Consider the example below to illustrate the concept.
1079
1080[source,bash]
1081--------------------------------------------------------------------------
1082foobar_notify() {
1083    local type_op
1084    type_op="${OCF_RESKEY_CRM_meta_notify_type}-${OCF_RESKEY_CRM_meta_notify_operation}"
1085
1086    ocf_log debug "Received $type_op notification."
1087    case "$type_op" in
1088	'pre-promote')
1089	    ocf_run frobnicate --slave-mode \
1090                               --master=$OCF_RESKEY_CRM_meta_notify_promote_uname \
1091                               || exit $OCF_ERR_GENERIC
1092	    ;;
1093	'post-demote')
1094	    ocf_run frobnicate --unset-slave-mode || exit $OCF_ERR_GENERIC
1095	    ;;
1096    esac
1097
1098    return $OCF_SUCCESS
1099}
1100--------------------------------------------------------------------------
1101
1102NOTE: A master/slave resource agent may support a _multi-master_
1103configuration, where there is possibly more than one master at any
1104given time. If that is the case, then the
1105+$OCF_RESKEY_CRM_meta_notify_*_uname+ variables may each contain a
1106space-separated lists of hostnames, rather than a single host name as
1107shown in the example. Under those circumstances the resource agent
1108would have to properly iterate over this list.
1109
1110== Script variables
1111
1112This section outlines variables typically available to resource agents,
1113primarily for convenience purposes. For additional variables
1114available while the agent is being executed, refer to
1115<<_environment_variables>> and <<_return_codes>>.
1116
1117=== +$OCF_RA_VERSION_MAJOR+
1118
1119The major version number of the resource agent API that the cluster
1120manager is currently using.
1121
1122=== +$OCF_RA_VERSION_MINOR+
1123
1124The minor version number of the resource agent API that the cluster
1125manager is currently using.
1126
1127=== +$OCF_ROOT+
1128
1129The root of the OCF resource agent hierarchy. This should never be
1130changed by a resource agent. This is usually +/usr/lib/ocf+.
1131
1132=== +$OCF_FUNCTIONS_DIR+
1133
1134The directory where the resource agents shell function library,
1135+ocf-shellfuncs+, resides. This is usually defined in terms of
1136+$OCF_ROOT+ and should never be changed by a resource agent. This
1137variable may, however, be overridden from the command line while
1138testing a new or modified resource agent.
1139
1140=== +$OCF_EXIT_REASON_PREFIX+
1141
1142Used as a prefix when printing error messages from the resource agent.
1143Script functions use this automaticly so no explicit use is required
1144for shell based scripts.
1145
1146=== +$OCF_RESOURCE_INSTANCE+
1147
1148The resource instance name. For primitive (non-clone, non-stateful)
1149resources, this is simply the resource name. For clones and stateful
1150resources, this is the primitive name, followed by a colon an the
1151clone instance number (such as +p_foobar:0+).
1152
1153=== +$OCF_RESOURCE_TYPE+
1154
1155The resource type of the current resource, e.g. IPaddr2.
1156
1157=== +$OCF_RESOURCE_PROVIDER+
1158
1159The resource provider, e.g. heartbeat. This may not be in all cluster
1160managers of Resource Agent API version 1.0.
1161
1162=== +$__OCF_ACTION+
1163
1164The currently invoked action. This is exactly the first command-line
1165argument that the cluster manager specifies when it invokes the
1166resource agent.
1167
1168=== +$__SCRIPT_NAME+
1169
1170The name of the resource agent. This is exactly the base name of the
1171resource agent script, with leading directory names removed.
1172
1173=== +$HA_RSCTMP+
1174
1175A temporary directory for use by resource agents. The system startup
1176sequence (on any LSB compliant Linux distribution) guarantees that
1177this directory is emptied on system startup, so this directory will
1178not contain any stale data after a node reboot.
1179
1180== Convenience functions
1181
1182=== Logging: +ocf_log+
1183
1184Resource agents should use the +ocf_log+ function for logging
1185purposes. This convenient logging wrapper is invoked as follows:
1186
1187[source,bash]
1188--------------------------------------------------------------------------
1189ocf_log <severity> "Log message"
1190--------------------------------------------------------------------------
1191
1192It supports following the following severity levels:
1193
1194* +debug+ -- for debugging messages. Most logging configurations
1195  suppress this level by default.
1196* +info+ -- for informational messages about the agent's behavior or
1197  status.
1198* +warn+ -- for warnings. This is for any messages which reflect
1199  unexpected behavior that does _not_ constitute an unrecoverable
1200  error.
1201* +err+ -- for errors. As a general rule, this logging level should
1202  only be used immediately prior to an +exit+ with the appropriate
1203  error code.
1204* +crit+ -- for critical errors. As with +err+, this logging level
1205  should not be used unless the resource agent also exits with an
1206  error code. Very rarely used.
1207
1208=== Testing for binaries: +have_binary+ and +check_binary+
1209
1210A resource agent may need to test for the availability of a specific
1211executable. The +have_binary+ convenience function comes in handy
1212here:
1213
1214[source,bash]
1215--------------------------------------------------------------------------
1216if ! have_binary frobnicate; then
1217   ocf_log warn "Missing frobnicate binary, frobnication disabled!"
1218fi
1219--------------------------------------------------------------------------
1220
1221If a missing binary is a fatal problem for the resource, then the
1222+check_binary+ function should be used:
1223
1224[source,bash]
1225--------------------------------------------------------------------------
1226check_binary frobnicate
1227--------------------------------------------------------------------------
1228
1229Using +check_binary+ is a shorthand method for testing for the
1230existence (and executability) of the specified binary, and exiting
1231with +$OCF_ERR_INSTALLED+ if it cannot be found or executed.
1232
1233NOTE: Both +have_binary+ and +check_binary+ honor +$PATH+ when the
1234binary to test for is not specified as a full path. It is usually wise
1235to _not_ test for a full path, as binary installations path may vary
1236by distribution or user policy.
1237
1238=== Executing commands and capturing their output: +ocf_run+
1239
1240Whenever a resource agent needs to execute a command and capture its
1241output, it should use the +ocf_run+ convenience function, invoked as
1242in this example:
1243
1244[source,bash]
1245--------------------------------------------------------------------------
1246ocf_run frobnicate --spam=eggs || exit $OCF_ERR_GENERIC
1247--------------------------------------------------------------------------
1248
1249With the command specified above, the resource agent will invoke
1250+frobnicate --spam=eggs+ and capture its output and
1251exit code. If the exit code is nonzero (indicating an error),
1252+ocf_run+ logs the command output with the +err+ logging severity, and
1253the resource agent subsequently exits.  If the exit code is zero
1254(indicating success), any command output will be logged with the +info+
1255logging severity.
1256
1257If the resource agent wishes to ignore the output of a successful
1258command execution, it can use the +-q+ flag with +ocf_run+. In the
1259example below, +ocf_run+ will only log output if the command exit code
1260is nonzero.
1261
1262[source,bash]
1263--------------------------------------------------------------------------
1264ocf_run -q frobnicate --spam=eggs || exit $OCF_ERR_GENERIC
1265--------------------------------------------------------------------------
1266
1267Finally, if the resource agent wants to log the output of a command
1268with a nonzero exit code with a severity _other_ than error, it may do
1269so by adding the +-info+ or +-warn+ option to +ocf_run+:
1270
1271[source,bash]
1272--------------------------------------------------------------------------
1273ocf_run -warn frobnicate --spam=eggs
1274--------------------------------------------------------------------------
1275
1276=== Locks: +ocf_take_lock+ and +ocf_release_lock_on_exit+
1277
1278Occasionally, there may be different resources of the same type in a
1279cluster configuration that should not execute actions in
1280parallel. When a resource agent needs to guard against parallel
1281execution on the same machine, it can use the +ocf_take_lock+ and
1282+ocf_release_lock_on_exit+ convenience functions:
1283
1284[source,bash]
1285--------------------------------------------------------------------------
1286LOCKFILE=${HA_RSCTMP}/foobar
1287ocf_release_lock_on_exit $LOCKFILE
1288
1289foobar_start() {
1290    ...
1291    ocf_take_lock $LOCKFILE
1292    ...
1293}
1294--------------------------------------------------------------------------
1295
1296+ocf_take_lock+ attempts to acquire the designated +$LOCKFILE+. When
1297it is unavailable, it sleeps a random amount of time between 0 and 1
1298seconds, and retries. +ocf_release_lock_on_exit+ releases the lock
1299file when the agent exits (for any reason).
1300
1301=== Testing for numerical values: +ocf_is_decimal+
1302
1303Specifically for parameter validation, it can be helpful to test
1304whether a given value is numeric. The +ocf_is_decimal+ function exists
1305for that purpose:
1306--------------------------------------------------------------------------
1307foobar_validate_all() {
1308    if ! ocf_is_decimal $OCF_RESKEY_eggs; then
1309        ocf_log err "eggs is not numeric!"
1310        exit $OCF_ERR_CONFIGURED
1311    fi
1312    ...
1313}
1314--------------------------------------------------------------------------
1315
1316=== Testing for boolean values: +ocf_is_true+
1317
1318When a resource agent defines a boolean parameter, the value
1319for this parameter may be specified by the user as +0+/+1+,
1320+true+/+false+, or +on+/+off+. Since it is tedious to test for all
1321these values from within the resource agent, the agent should instead
1322use the +ocf_is_true+ convenience function:
1323
1324[source,bash]
1325--------------------------------------------------------------------------
1326if ocf_is_true $OCF_RESKEY_superfrobnicate; then
1327    ocf_run frobnicate --super
1328fi
1329--------------------------------------------------------------------------
1330
1331NOTE: If +ocf_is_true+ is used against an empty or non-existant
1332variable, it always returns an exit code of +1+, which is equivalent
1333to +false+.
1334
1335=== Version comparison: +ocf_version_cmp+
1336
1337A resource agent may want to check the version of software
1338installed. +ocf_version_cmp+ takes care of all the necessary
1339details.
1340
1341The return codes are
1342
1343* +0+ -- the first version is smaller (earlier) than the second
1344* +1+ -- the two versions are equal
1345* +2+ -- the first version is greater (later) than the second
1346* +3+ -- one of arguments is not recognized as a version string
1347
1348The versions are allowed to contain digits, dots, and dashes.
1349
1350[source,bash]
1351--------------------------------------------------------------------------
1352local v=`gooey --version`
1353ocf_version_cmp "$v" 12.0.8-1
1354case $? in
1355	0) ocf_log err "we do not support version $v, it is too old"
1356	   exit $OCF_ERR_INSTALLED
1357	;;
1358	[12]) ;; # we can work with versions >= 12.0.8-1
1359	3) ocf_log err "gooey produced version <$v>, too funky for me"
1360	   exit $OCF_ERR_INSTALLED
1361	;;
1362esac
1363--------------------------------------------------------------------------
1364
1365=== Pseudo resources: +ha_pseudo_resource+
1366
1367"Pseudo resources" are those where the resource agent in fact does not
1368actually start or stop something akin to a runnable process, but
1369merely executes a single action and then needs some form of tracing
1370whether that action has been executed or not. The +portblock+ resource
1371agent is an example of this.
1372
1373Resource agents for pseudo resources can use a convenience function,
1374+ha_pseudo_resource+, which makes use of _tracking files_ to keep tabs
1375on the status of a resource. If +foobar+ was designed to manage a
1376pseudo resource, then its +start+ action could look like this:
1377
1378[source,bash]
1379--------------------------------------------------------------------------
1380foobar_start() {
1381    # exit immediately if configuration is not valid
1382    foobar_validate_all || exit $?
1383
1384    # if resource is already running, bail out early
1385    if foobar_monitor; then
1386	ocf_log info "Resource is already running"
1387	return $OCF_SUCCESS
1388    fi
1389
1390    # start the pseudo resource
1391    ha_pseudo_resource ${OCF_RESOURCE_INSTANCE} start
1392
1393    # After the resource has been started, check whether it started up
1394    # correctly. If the resource starts asynchronously, the agent may
1395    # spin on the monitor function here -- if the resource does not
1396    # start up within the defined timeout, the cluster manager will
1397    # consider the start action failed
1398    while ! foobar_monitor; do
1399	ocf_log debug "Resource has not started yet, waiting"
1400	sleep 1
1401    done
1402
1403    # only return $OCF_SUCCESS if _everything_ succeeded as expected
1404    return $OCF_SUCCESS
1405}
1406--------------------------------------------------------------------------
1407
1408
1409== Conventions
1410
1411This section contains a collection of conventions that have emerged in
1412the resource agent repositories over the years. Following these
1413conventions is by no means mandatory for resource agent authors, but
1414it is a good idea based on the
1415http://en.wikipedia.org/wiki/Principle_of_least_surprise[Principle of
1416Least Surprise] -- resource agents following these conventions will be
1417easier to understand, review, and use than those that do not.
1418
1419=== Well-known parameter names
1420
1421Several parameter names are supported by a number of resource
1422agents. For new resource agents, following these examples is generally
1423a good idea:
1424
1425* +binary+ -- the name of a binary that principally manages the
1426  resource, such as a server daemon
1427* +config+ -- the full path to a configuration file
1428* +pid+ -- the full path to a file holding a process ID (PID)
1429* +log+ -- the full path to a log file
1430* +socket+ -- the full path to a UNIX socket that the resource manages
1431* +ip+ -- an IP address that a daemon binds to
1432* +port+ -- a TCP or UDP port that a daemon binds to
1433
1434Needless to say, resource agents should only implement any of these
1435parameters if they are sensible to use in the agent's context.
1436
1437=== Parameter defaults
1438
1439Defaults for resource agent parameters should be set by initializing
1440variables with the suffix +_default+:
1441
1442[source,bash]
1443--------------------------------------------------------------------------
1444# Defaults
1445OCF_RESKEY_superfrobnicate_default=0
1446
1447: ${OCF_RESKEY_superfrobnicate=${OCF_RESKEY_superfrobnicate_default}}
1448--------------------------------------------------------------------------
1449
1450NOTE: The resource agent should make sure that it sets a default for
1451any parameter not marked as +required+ in the metadata.
1452
1453
1454=== Honoring +PATH+ for binaries
1455
1456When a resource agent supports a parameter designed to hold the name
1457of a binary (such as a daemon, or a client utility for querying
1458status), then that parameter should honor the +PATH+ environment
1459variable. Do not supply full paths. Thus, the following approach:
1460
1461[source,bash]
1462--------------------------------------------------------------------------
1463# Good example -- do it this way
1464OCF_RESKEY_frobnicate_default="frobnicate"
1465: ${OCF_RESKEY_frobnicate="${OCF_RESKEY_frobnicate_default}"}
1466--------------------------------------------------------------------------
1467
1468is much preferred over specifying a full path, as shown here:
1469
1470[source,bash]
1471--------------------------------------------------------------------------
1472# Bad example -- avoid if you can
1473OCF_RESKEY_frobnicate_default="/usr/local/sbin/frobnicate"
1474: ${OCF_RESKEY_frobnicate="${OCF_RESKEY_frobnicate_default}"}
1475--------------------------------------------------------------------------
1476
1477This rule holds for defaults, as well.
1478
1479
1480
1481== Special considerations
1482
1483=== Licensing
1484
1485Whenever possible, resource agent contributors are _encouraged_ to use
1486the GNU General Public License (GPL), version 2 and later, for any new
1487resource agents. The shell functions library does not strictly mandate
1488this, however, as it is licensed under the GNU Lesser General Public
1489License (LGPL), version 2.1 and later (so it can be used by non-GPL
1490agents).
1491
1492The resource agent _must_ explicitly state its own license in the
1493agent source code.
1494
1495
1496=== Locale settings
1497
1498When sourcing +ocf-shellfuncs+ as explained in <<_initialization>>,
1499any resource agent automatically sets +LANG+ and +LC_ALL+ to the +C+
1500locale. Resource agents can thus expect to always operate in the +C+
1501locale, and need not reset +LANG+ or any of the +LC_+ environment
1502variables themselves.
1503
1504
1505=== Testing for running processes
1506
1507For testing whether a particular process (with a known process ID) is
1508currently running, a frequently found method is to send it a +0+
1509signal and catch errors, similar to this example:
1510
1511[source,bash]
1512--------------------------------------------------------------------------
1513if kill -s 0 `cat $daemon_pid_file`; then
1514    ocf_log debug "Process is currently running"
1515else
1516    ocf_log warn "Process is dead, removing pid file"
1517    rm -f $daemon_pid_file
1518if
1519--------------------------------------------------------------------------
1520
1521IMPORTANT: An approach far superior to this example is to instead test
1522the _functionality_ of the daemon by connecting to it with a client
1523process, as shown in the example in
1524<<_literal_monitor_literal_action>>.
1525
1526
1527=== Specifying a master preference
1528
1529Stateful (master/slave) resources must set their own _master
1530preference_ -- they can thus provide hints to the cluster manager
1531which is the the best instance to promote to the +Master+ role.
1532
1533IMPORTANT: It is acceptable for multiple instances to have identical
1534positive master preferences. In that case, the cluster resource
1535manager will automatically select a resource agent to
1536promote. However, if _all_ instances have the (default) master score
1537of zero, the cluster manager will not promote any instance at
1538all. Thus, it is crucial that at least one instance has a positive
1539master score.
1540
1541For this purpose, +crm_master+ comes in handy. This convenience
1542wrapper around the +crm_attribute+ sets a node attribute named
1543+master-<<_literal_ocf_resource_instance_literal,$OCF_RESOURCE_INSTANCE>>+
1544for the node it is being executed on, and fills this attribute with
1545the specified value. The cluster manager is then expected to translate
1546this into a promotion score for the corresponding instance, and base
1547its promotion preference on that score.
1548
1549Stateful resource agents typically execute +crm_master+ during the
1550<<_literal_monitor_literal_action,+monitor+>> and/or
1551<<_literal_notify_literal_action,+notify+>> action.
1552
1553The following example assumes that the +foobar+ resource agent can
1554test the application's status by executing a binary that returns
1555certain exit codes based on whether
1556
1557* the resource is either in the master role, or is a slave that is
1558  fully caught up with the master (at any rate, it has current data),
1559  or
1560* the resource is in the slave role, but through some form of
1561  asynchronous replication has "fallen behind" the master, or
1562* the resource has gracefully stopped, or
1563* the resource has unexpectedly failed.
1564
1565[source,bash]
1566--------------------------------------------------------------------------
1567foobar_monitor() {
1568    local rc
1569
1570    # exit immediately if configuration is not valid
1571    foobar_validate_all || exit $?
1572
1573    ocf_run frobnicate --test
1574
1575    # This example assumes the following exit code convention
1576    # for frobnicate:
1577    # 0: running, and fully caught up with master
1578    # 1: gracefully stopped
1579    # 2: running, but lagging behind master
1580    # any other: error
1581    case "$?" in
1582	0)
1583            rc=$OCF_SUCCESS
1584	    ocf_log debug "Resource is running"
1585            # Set a high master preference. The current master
1586            # will always get this, plus 1. Any current slaves
1587            # will get a high preference so that if the master
1588            # fails, they are next in line to take over.
1589            crm_master -l reboot -v 100
1590            ;;
1591	1)
1592            rc=$OCF_NOT_RUNNING
1593	    ocf_log debug "Resource is not running"
1594            # Remove the master preference for this node
1595            crm_master -l reboot -D
1596	    ;;
1597        2)
1598            rc=$OCF_SUCCESS
1599            ocf_log debug "Resource is lagging behind master"
1600            # Set a low master preference: if the master fails
1601            # right now, and there is another slave that does
1602            # not lag behind the master, its higher master
1603            # preference will win and that slave will become
1604            # the new master
1605            crm_master -l reboot -v 5
1606            ;;
1607	*)
1608	    ocf_log err "Resource has failed"
1609	    exit $OCF_ERR_GENERIC
1610    esac
1611
1612    return $rc
1613}
1614--------------------------------------------------------------------------
1615
1616
1617== Testing resource agents
1618
1619This section discusses automated testing for resource agents. Testing
1620is a vital aspect of development; it is crucial both for creating new
1621resource agents, and for modifying existing ones.
1622
1623
1624=== Testing with +ocf-tester+
1625
1626The resource agents repository (and hence, any installed resource
1627agents package) contains a utility named +ocf-tester+. This shell
1628script allows you to conveniently and easily test the functionality of
1629your resource agent.
1630
1631+ocf-tester+ is commonly invoked, as +root+, like this:
1632
1633--------------------------------------------------------------------------
1634ocf-tester -n <name> [-o <param>=<value> ... ] <resource agent>
1635--------------------------------------------------------------------------
1636
1637* +<name>+ is an arbitrary resource name.
1638
1639* You may set any number of +<param>=<value>+ with the +-o+ option,
1640  corresponding to any resource parameters you wish to set for
1641  testing.
1642
1643* +<resource agent>+ is the full path to your resource agent.
1644
1645When invoked, +ocf-tester+ executes all mandatory actions and enforces
1646action behavior as explained in <<_resource_agent_actions>>.
1647
1648It also tests for optional actions. Optional actions must behave as
1649expected when advertised, but do not cause +ocf-tester+ to flag an
1650error if not implemented.
1651
1652IMPORTANT: +ocf-tester+ does not initiate "dry runs" of actions, nor
1653does it create resource dummies of any kind. Instead, it exercises the
1654actual resource agent as-is, whether that may include opening and
1655closing databases, mounting file systems, starting or stopping virtual
1656machines, etc. Use with care.
1657
1658For example, you could run +ocf-tester+ on the +foobar+ resource agent
1659as follows:
1660
1661--------------------------------------------------------------------------
1662# ocf-tester -n foobartest \
1663             -o superfrobnicate=true \
1664             -o datadir=/tmp \
1665             /home/johndoe/ra-dev/foobar
1666Beginning tests for /home/johndoe/ra-dev/foobar...
1667* Your agent does not support the notify action (optional)
1668* Your agent does not support the reload action (optional)
1669/home/johndoe/ra-dev/foobar passed all tests
1670--------------------------------------------------------------------------
1671
1672If the resource agent exhibits some difficult to grasp behaviour,
1673which is typically the case with just developed software, there
1674are +-v+ and +-d+ options to dump more output. If that does not
1675help, instruct +ocf-tester+ to trace the resource agent with
1676+-X+ (make sure to redirect output to a file, unless you are a
1677really fast reader).
1678
1679=== Testing with +ocft+
1680
1681+ocft+ is a testing tool for resource agents. The main difference
1682to +ocf-tester+ is that +ocft+ can automate creating complex
1683testing environments. That includes package installation and
1684arbitrary shell scripting.
1685
1686==== +ocft+ components
1687
1688+ocft+ consists of the following components:
1689
1690* A test case generator (+/usr/sbin/ocft+) -- generates shell
1691  scripts from test case configuration files
1692
1693* Configuration files (+/usr/share/resource-agents/ocft/configs/+) --
1694  a configuration file contains environment setup and test cases
1695  for one resource agent
1696
1697* The testing scripts are stored in +/var/lib/resource-agents/ocft/cases/+,
1698  but normally there is no need to inspect them
1699
1700==== Customizing the testing environment
1701
1702+ocft+ modifies the runtime environment of the resource agent
1703either by changing environment variables (through the interface
1704defined by OCF) or by running ad-hoc shell scripts which can for
1705instance change permissions of a file or unmount a file system.
1706
1707==== How to test
1708
1709You need to know the software (resource) you want to test. Draw a
1710sketch of all interesting scenarios, with all expected and
1711unexpected conditions and how the resource agent should react to
1712them. Then you need to encode these conditions and the expected
1713outcomes as +ocft+ test cases. Running ocft is then simple:
1714
1715---------------------------------------
1716# ocft make <RA>
1717# ocft test <RA>
1718---------------------------------------
1719
1720The first subcommand generates the scripts for your test cases
1721whereas the second runs them and checks the outcome.
1722
1723==== +ocft+ configuration file syntax
1724
1725There are four top level options each of which can contain
1726one or more sub-options.
1727
1728===== +CONFIG+ (top level option)
1729
1730This option is global and influences every test case.
1731
1732  ** +AgentRoot+ (sub-option)
1733---------------------------------------
1734AgentRoot /usr/lib/ocf/resource.d/xxx
1735---------------------------------------
1736
1737Normally, we assume that the resource agent lives under the
1738+heartbeat+ provider. Use `AgentRoot` to test agent which is
1739distributed by another vendor.
1740
1741  ** +InstallPackage+ (sub-option)
1742---------------------------------------
1743InstallPackage package [package2 [...]]
1744---------------------------------------
1745
1746Install packages necessary for testing. The installation is
1747skipped if the packages have already been installed.
1748
1749  ** 'HangTimeout' (sub-option)
1750---------------------------------------
1751HangTimeout secs
1752---------------------------------------
1753
1754The maximum time allowed for a single RA action. If this timer
1755expires, the action is considered as failed.
1756
1757===== +SETUP-AGENT+ (top level option)
1758---------------------------------------
1759SETUP-AGENT
1760  bash commands
1761---------------------------------------
1762
1763If the RA needs to be initialized before testing, you can put
1764bash code here for that purpose. The initialization is done only
1765once. If you need to reinitialize then delete the
1766+/tmp/.[AGENT_NAME]_set+ stamp file.
1767
1768===== +CASE+ (top level option)
1769---------------------------------------
1770CASE "description"
1771---------------------------------------
1772
1773This is the main building block of the test suite. Each test
1774case is to be described in one +CASE+ top level option.
1775
1776One case consists of several suboptions typically followed by the
1777+RunAgent+ suboption.
1778
1779  ** +Var+ (sub-option)
1780---------------------------------------
1781Var VARIABLE=value
1782---------------------------------------
1783
1784It is to set up an environment variable of the resource agent. They
1785usually appear to be OCF_RESKEY_xxx. One point is to be noted is there
1786is no blank by both sides of "=".
1787
1788  ** +Unvar+ (sub-option)
1789---------------------------------------
1790Unvar VARIABLE [VARIABLE2 [...]]
1791---------------------------------------
1792
1793Remove the environment variable.
1794
1795  ** +Include+ (sub-option)
1796---------------------------------------
1797Include macro_name
1798---------------------------------------
1799
1800Include statements in 'macro_name'. See below for description of
1801+CASE-BLOCK+.
1802
1803** +Bash+ (sub-option)
1804---------------------------------------
1805Bash bash_codes
1806---------------------------------------
1807
1808This option is to set up the environment of OS, where you can insert
1809BASH code to customize the system randomly. Note, do not cause
1810unrecoverable consequences to the system.
1811
1812** +BashAtExit+ (sub-option)
1813---------------------------------------
1814BashAtExit bash_codes
1815---------------------------------------
1816
1817This option is to recover the OS environment in order to run another
1818test case correctly. Of cause you can use 'Bash' option to recover
1819it. However, if mistakes occur in the process, the script will quit
1820directly instead of running your recovery codes.  If it happens, you
1821ought to use BashAtExit which can restore the system environment
1822before you quit.
1823
1824** +RunAgent+ (sub-option)
1825---------------------------------------
1826RunAgent cmd [ret_value]
1827---------------------------------------
1828
1829This option is to run resource agent. "cmd" is the parameter of the
1830resource agent, such as "start, status, stop ...". The second
1831parameter is optional. It will compare the actual returned value with
1832the expected value when the script has run recourse agent.  If
1833differs, bugs will be found.
1834
1835It is also possible to execute a suboption on a remote host
1836instead of locally. The protocol used is ssh and the command is
1837run in the background. Just add the +@<ipaddr>+ suffix to the
1838suboption name. For instance:
1839
1840---------------------------------------
1841Bash@192.168.1.100 date
1842---------------------------------------
1843
1844would run the date program. Remote commands are run in
1845background.
1846
1847NB: Not clear how can ssh be automated as we don't know in
1848advance the environment. Perhaps use "well-known" host names such
1849as "node2"? Also, if the command runs in the background, it's not
1850clear how is the exit code checked. Finally, does Var@node make
1851sense? Or is the current environment somehow copied over? We
1852probably need an example here.
1853
1854Need examples in general.
1855
1856===== +CASE-BLOCK+ (top level option)
1857---------------------------------------
1858CASE-BLOCK macro_name
1859---------------------------------------
1860
1861The +CASE-BLOCK+ option defines a macro which can be +Include+d
1862in any +CASE+. All +CASE+ suboptions are valid in +CASE-BLOCK+.
1863
1864
1865== Installing and packaging resource agents
1866
1867This section discusses what to do with your resource agent once it is
1868done and tested -- where to install it, and how to include it in either
1869your own application package or in the Linux-HA resource agents
1870repository.
1871
1872=== Installing resource agents
1873
1874If you choose to include your resource agent in your own project, make
1875sure it installs into the correct location. Resource agents should
1876install into the +/usr/lib/ocf/resource.d/<provider>+ directory, where
1877+<provider>+ is the name of your project or any other name you wish to
1878identify the resource agent with.
1879
1880For example, if your +foobar+ resource agent is being packaged as part
1881of a project named +fortytwo+, then the correct full path to your
1882resource agent would be
1883+/usr/lib/ocf/resource.d/fortytwo/foobar+. Make sure your resource
1884agent installs with +0755+ (+-rwxr-xr-x+) permission bits.
1885
1886When installed this way, OCF-compliant cluster resource managers will
1887be able to properly identify, parse, and execute your resource
1888agent. The Pacemaker cluster manager, for example, would map the
1889above-mentioned installation path to the +ocf:fortytwo:foobar+
1890resource type identifier.
1891
1892=== Packaging resource agents
1893
1894When you package resource agents as part of your own project, you
1895should apply the considerations outlined in this section.
1896
1897NOTE: If you instead prefer to submit your resource agent to the
1898Linux-HA resource agents repository, see
1899<<_submitting_resource_agents>> for information on doing so.
1900
1901==== RPM packaging
1902
1903It is recommended to put your OCF resource agent(s) in an RPM
1904sub-package, with the name +<toppackage>-resource-agents+. Ensure that
1905the package owns its provider directory, and depends on the upstream
1906+resource-agents+ package which lays out the directory hierarchy and
1907provides convenience shell functions. An example RPM spec snippet is
1908given below:
1909
1910--------------------------------------------------------------------------
1911%package resource-agents
1912Summary: OCF resource agent for Foobar
1913Group: System Environment/Base
1914Requires: %{name} = %{version}-%{release}, resource-agents
1915
1916%description resource-agents
1917This package contains the OCF-compliant resource agents for Foobar.
1918
1919%files resource-agents
1920%defattr(755,root,root,-)
1921%dir %{_prefix}/lib/ocf/resource.d/fortytwo
1922%{_prefix}/lib/ocf/resource.d/fortytwo/foobar
1923--------------------------------------------------------------------------
1924
1925NOTE: If an RPM spec file contains a +%package+ declaration, then RPM
1926considers this a sub-package which inherits top-level fields such as
1927+Name+, +Version+, +License+, etc. Sub-packages have the top-level
1928package name automatically prepended to their own name. Thus the snippet
1929above would create a sub-package named +foobar-resource-agents+
1930(presuming the package +Name+ is +foobar+).
1931
1932==== Debian packaging
1933
1934For Debian packages, like for <<_rpm_packaging,RPMs>>, it is
1935recommended to create a separate package holding your resource agents,
1936which then should depend on the +cluster-agents+ package.
1937
1938NOTE: This section assumes that you are packaging with +debhelper+.
1939
1940An example +debian/control+ snippet is given below:
1941
1942--------------------------------------------------------------------------
1943Package: foobar-cluster-agents
1944Priority: extra
1945Architecture: all
1946Depends: cluster-agents
1947Description: OCF-compliant resource agents for Foobar
1948--------------------------------------------------------------------------
1949
1950You will also create a separate +.install+ file. Sticking with the
1951example of installing the +foobar+ resource agent as a sub-package of
1952+fortytwo+, the +debian/fortytwo-cluster-agents.install+ file could
1953consist of the following content:
1954
1955--------------------------------------------------------------------------
1956usr/lib/ocf/resource.d/fortytwo/foobar
1957--------------------------------------------------------------------------
1958
1959=== Submitting resource agents
1960
1961If you choose not to bundle your resource agent with your own package,
1962but instead wish to submit it to the upstream resource agent
1963repository hosted on
1964https://github.com/ClusterLabs/resource-agents[the ClusterLabs
1965repository on GitHub], please follow the steps outlined in this section.
1966
1967Create a fork of the
1968https://github.com/ClusterLabs/resource-agents[upstream repository] and
1969clone it with the following commands:
1970
1971--------------------------------------------------------------------------
1972git clone git://github.com/<your-username>/resource-agents
1973git remote add upstream git@github.com:ClusterLabs/resource-agents.git
1974git checkout -b <new-branch>
1975--------------------------------------------------------------------------
1976
1977Then, copy your resource agent into the +heartbeat+ subdirectory:
1978--------------------------------------------------------------------------
1979cd resource-agents/heartbeat
1980cp /path/to/your/local/copy/of/foobar .
1981chmod 0755 foobar
1982cd ..
1983--------------------------------------------------------------------------
1984
1985Next, modify the +Makefile.am+ file in +resource-agents/heartbeat+ and
1986add your new resource agent to the +ocf_SCRIPTS+ list. This will make
1987sure the agent is properly installed.
1988
1989Lastly, open Makefile.am in +resource-agents/doc/man+ and add
1990+ocf_heartbeat_<name>.7+ to the +man_MANS+ variable. This will
1991automatically generate a resource agent manual page from its metadata,
1992and then install that man page into the correct location.
1993
1994Now, add your new resource agents, and the two modifications to the
1995Makefiles, to your changeset:
1996
1997--------------------------------------------------------------------------
1998git add heartbeat/foobar
1999git add heartbeat/Makefile.am
2000git add doc/man/Makefile.am
2001git commit
2002--------------------------------------------------------------------------
2003
2004In your commit message, be sure to include a meaningful description,
2005for example:
2006--------------------------------------------------------------------------
2007High: foobar: new resource agent
2008
2009This new resource agent adds functionality to manage a foobar service.
2010It supports being configured as a primitive or as a master/slave set,
2011and also optionally supports superfrobnication.
2012--------------------------------------------------------------------------
2013
2014Now push the patch set to GitHub:
2015--------------------------------------------------------------------------
2016git push
2017--------------------------------------------------------------------------
2018
2019Create a Pull Request (PR) on Github that will be reviewed by the
2020upstream developers.
2021
2022Once your new resource agent has been accepted for merging, one of the
2023upstream developers will Merge the Pull Request into the upstream
2024repository. At that point, you can update your master branch from
2025upstream, and remove your own branch.
2026
2027--------------------------------------------------------------------------
2028git checkout master
2029git fetch upstream
2030git merge upstream/master
2031git branch -D <branch>
2032--------------------------------------------------------------------------
2033
2034=== Maintaining resource agents
2035
2036If you maintain a specific resource agent, or you are making repeated
2037contributions to the codebase, it's usually a good idea to maintain
2038your own _fork_ of the +ClusterLabs/resource-agents+ repository on
2039GitHub.
2040
2041To do so,
2042
2043* https://github.com/signup[Create a GitHub account] if you do not
2044  have one already.
2045* http://help.github.com/fork-a-repo/[Fork] the
2046  https://github.com/ClusterLabs/resource-agents[+resource-agents+
2047  repository].
2048* Clone your personal fork into a local working copy.
2049
2050As you work on resource agents, *please* commit early, and commit
2051often. You can always fold commits later with +git rebase -i+.
2052
2053Once you have made a number of changes that you would like others to
2054review, push them to your GitHub fork and send a post to the
2055+linux-ha-dev+ mailing list pointing people to it.
2056
2057After the review is done, fix up your tree with any requested changes,
2058and then issue a pull request. There are two ways of doing so:
2059
2060* You can use the +git request-pull+ utility to get a pre-populated
2061  email skeleton summarizing your changesets. Add any information you
2062  see fit, and send it to the list. It is a good idea to prefix your
2063  email subject with +[GIT PULL]+ so upstream maintainers can pick the
2064  message out easily.
2065
2066* You can also issue a pull request directly on GitHub. GitHub
2067  automatically notifies upstream maintainers about new pull requests
2068  by email. Please refer to
2069  http://help.github.com/send-pull-requests/[github:help] for details
2070  on initiating pull requests.
2071