• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

etc/H08-Mar-2020-4321

hastmon/H03-May-2022-13,2409,439

hastmonctl/H03-May-2022-546455

COPYRIGHTH A D08-Mar-20201.5 KiB3026

ChangeLogH A D08-Mar-20201.9 KiB8044

MakefileH A D08-Mar-2020145 147

READMEH A D08-Mar-202011.1 KiB412355

README.mdH A D08-Mar-2020434 108

version.mkH A D08-Mar-202015 21

README

1#summary HASTMON -- cluster monitoring daemon.
2
3= Introduction. =
4
5HASTMON is a monitoring daemon that allows a couple of hosts to run a
6service providing automatic failover. Those machines will be called a
7cluster and each machine is one cluster node. HASTMON is designed for
8clusters that work in Primary-Secondary configuration, which means
9that only one of the cluster nodes can be active at any given
10time. Active node will be called Primary node. This is the node that
11will run the service. Other nodes will be running as Secondary
12ensuring that the service is not started there. There should be also
13at least one node acting as a watchdog -- it checks periodically
14status of all nodes and sends complaints to Secondary nodes if Primary
15is not available. Secondary node makes decision to change its role to
16Primary when two conditions are meat: there is no connection from
17primary and there are complaints from watchdog.
18
19Most of the HASTMON's code was taken from
20[http://wiki.freebsd.org/HAST FreeBSD HAST project]
21and it was developed as a monitoring daemon for HAST cluster but can be
22used for other setups.
23
24This software is being developed and tested under FreeBSD. Since 3.0
25it is supposed to work on NetBSD and Linux too.
26
27= Installation. =
28
29Since version 0.3.0 when support for several platforms was added
30hastmon requires
31[http://sourceforge.net/projects/mk-configure/ mk-configure]
32to be built and installed. mk-configure uses
33[http://www.crufty.net/help/sjg/bmake.html bmake] (NetBSD make).
34
35mk-configure and bmake have been already packaged on some platforms
36(FreeBSD, NetBSD, some Linux distros) but if it is not your case go to
37[http://sourceforge.net/projects/mk-configure/ mk-configure page],
38download the sources and read instructions how to install.
39
40When mk-configure is installed hastmon can be built and installed
41running the following commands:
42
43{{{
44cd <path to hastmon sources>
45mkcmake
46mkcmake install
47}}}
48
49= Configuration. =
50
51There should be at least 3 nodes: two that run the service and acting
52as Primary-Secondary and one is Watchdog node. Configuration for nodes
53is stored in /etc/hastmon.conf file, which is designed in a way that
54exactly the same file can be (and should be) used on all nodes.
55HASTMON can monitor several resources. For every resource the script
56should be provided that will be used to start/stop the resource and
57check its status. See hastmon.conf(5) and Examples section below how
58to write the configuration file and rc script.
59
60After the nodes are started their role is set up using hastmonctl
61utility. This utility is also used to check current status of the
62cluster.
63
64= Examples. =
65
66In this example two resources will be set up -- one is some
67application/daemon that may run only on one server and another is HAST
68cluster that provides NFS storage.
69
70The cluster is run on three nodes: lolek, bolek -- running the services
71(resources) and acting as Primary-Secondary, and reksio -- acting as
72Watchdog.
73
74Configuration file /etc/hastmon.conf is the same on all nodes:
75{{{
76resource daemon {
77        exec /etc/daemon.sh
78        friends lolek bolek reksio
79
80        on lolek {
81                remote tcp4://bolek
82                priority 0
83        }
84        on bolek {
85                remote tcp4://lolek
86                priority 1
87        }
88        on reksio {
89                remote tcp4://lolek tcp4://bolek
90        }
91}
92
93resource storage {
94        exec /etc/storage.sh
95        friends lolek bolek reksio
96
97        on lolek {
98                remote tcp4://bolek
99                priority 0
100        }
101        on bolek {
102                remote tcp4://lolek
103                priority 1
104        }
105        on reksio {
106                remote tcp4://lolek tcp4://bolek
107        }
108}
109
110}}}
111Exec scripts, which are stored on all three nodes:
112
113/etc/daemon.sh:
114{{{
115#!/bin/sh
116
117DAEMON=/etc/rc.d/lpd # :-)
118
119case $1 in
120    start)
121        ${DAEMON} onestart
122        ;;
123    stop)
124        ${DAEMON} onestop
125        ;;
126    status)
127        ${DAEMON} onestatus
128        ;;
129    role|connect|disconnect|complain)
130        exit 0
131        ;;
132    *)
133        echo "usage: $0 stop|start|status|role|connect|disconnect|complain"
134        exit 1
135        ;;
136esac
137}}}
138
139/etc/storage.sh is more complicated:
140{{{
141#!/bin/sh
142
143PROV=storage
144POOL=storage
145IF=em0
146IP=172.20.68.100
147FS=storage/test
148MOUNTPOINT=/storage/test
149DEV="/dev/hast/${PROV}"
150
151HAST=/etc/rc.d/hastd
152HASTCTL=/sbin/hastctl
153ZPOOL=/sbin/zpool
154ZFS=/sbin/zfs
155IFCONFIG=/sbin/ifconfig
156MOUNTD=/etc/rc.d/mountd
157NFSD=/etc/rc.d/nfsd
158MOUNT=/sbin/mount
159
160RUN=0
161STOPPED=1
162UNKNOWN=2
163
164progname=$(basename $0)
165
166start()
167{
168    logger -p local0.debug -t "${progname}[$$]" "Starting $PROV..."
169    # Check if hastd is started and start if it is not.
170    "${HAST}" onestatus || "${HAST}" onestart
171
172    # If there is secondary worker process, it means that remote primary process is
173    # still running. We have to wait for it to terminate.
174    for i in `jot 30`; do
175        pgrep -f "hastd: ${PROV} \(secondary\)" >/dev/null 2>&1 || break
176        sleep 1
177    done
178    if pgrep -f "hastd: ${PROV} \(secondary\)" >/dev/null 2>&1; then
179        logger -p local0.error -t "${progname}[$$]" \
180	    "Secondary process for resource ${PROV} is still running after 30 seconds."
181        exit 1
182    fi
183    logger -p local0.debug -t "${progname}[$$]" "Secondary process in not running."
184
185    # Change role to primary for our resource.
186    out=`${HASTCTL} role primary "${PROV}" 2>&1`
187    if [ $? -ne 0 ]; then
188        logger -p local0.error -t "${progname}[$$]" \
189	    "Unable to change to role to primary for resource ${PROV}: ${out}."
190        exit 1
191    fi
192
193    # Wait few seconds for provider to appear.
194    for i in `jot 50`; do
195        [ -c "${DEV}" ] && break
196        sleep 0.1
197    done
198    if [ ! -c "${DEV}" ]; then
199        logger -p local0.error -t "${progname}[$$]" "Device ${DEV} didn't appear."
200        exit 1
201    fi
202    logger -p local0.debug -t "${progname}[$$]" "Role for resource ${prov} changed to primary."
203
204    # Import ZFS pool. Do it forcibly as it remembers hostid of the
205    # other cluster node. Before import we check current status of
206    # zfs: it might be that the script is called second time by
207    # hastmon (because not all operations were successful on the first
208    # run) and zfs is already here.
209
210    "${ZPOOL}" list | egrep -q "^${POOL} "
211    if [ $? -ne 0 ]; then
212	out=`"${ZPOOL}" import -f "${POOL}" 2>&1`
213	if [ $? -ne 0 ]; then
214            logger -p local0.error -t "${progname}[$$]" \
215		"ZFS pool import for resource ${PROV} failed: ${out}."
216            exit 1
217	fi
218	logger -p local0.debug -t "${progname}[$$]" "ZFS pool for resource ${PROV} imported."
219    fi
220
221    zfs mount | egrep -q "^${FS} "
222    if [ $? -ne 0 ]; then
223	out=`zfs mount "${FS}" 2>&1`
224	if [ $? -ne 0 ]; then
225            logger -p local0.error -t "${progname}[$$]" \
226		"ZFS mount for ${FS} failed: ${out}."
227            exit 1
228	fi
229	logger -p local0.debug -t "${progname}[$$]" "ZFS {$FS} mounted."
230    fi
231
232    "${IFCONFIG}" "${IF}" alias "${IP}" netmask 0xffffffff
233
234    out=`"${MOUNTD}" onerestart 2>&1`
235    if [ $? -ne 0 ]; then
236        logger -p local0.error -t "${progname}[$$]" \
237	    "Can't start mountd: ${out}."
238        exit 1
239    fi
240
241    out=`"${NFSD}" onerestart 2>&1`
242    if [ $? -ne 0 ]; then
243        logger -p local0.error -t "${progname}[$$]" \
244	    "Can't start nfsd: ${out}."
245        exit 1
246    fi
247
248    logger -p local0.debug -t "${progname}[$$]" "NFS started."
249}
250
251stop()
252{
253    logger -p local0.debug -t "${progname}[$$]" "Stopping $PROV..."
254
255    # Kill start script if it still runs in the background.
256    sig="TERM"
257    for i in `jot 30`; do
258        pgid=`pgrep -f '/etc/storage.sh start' | head -1`
259        [ -n "${pgid}" ] || break
260        kill -${sig} -- -${pgid}
261        sig="KILL"
262        sleep 1
263    done
264    if [ -n "${pgid}" ]; then
265        logger -p local0.error -t "${progname}[$$]" \
266	    "'/etc/storage.sh start' process for resource ${PROV} is still running after 30 seconds."
267        exit 1
268    fi
269    logger -p local0.debug -t "${progname}[$$]" "'/etc/storage.sh start' is not running."
270
271    "${NFSD}" onestop
272    "${MOUNTD}" onestop
273
274    "${IFCONFIG}" "${IF}" -alias "${IP}" netmask 0xffffffff
275
276    if "${HAST}" onestatus; then
277	"${ZPOOL}" list | egrep -q "^${POOL} "
278	if [ $? -eq 0 ]; then
279            # Forcibly export file pool.
280            out=`${ZPOOL} export -f "${POOL}" 2>&1`
281            if [ $? -ne 0 ]; then
282		logger -p local0.error -t "${progname}[$$]" \
283		    "Unable to export pool for resource ${PROV}: ${out}."
284		exit 1
285            fi
286	    logger -p local0.debug -t "${progname}[$$]" \
287		"ZFS pool for resource ${PROV} exported."
288	fi
289    else
290	"${HAST}" onestart
291    fi
292
293    # Change role to secondary for our resource.
294    out=`${HASTCTL} role secondary "${PROV}" 2>&1`
295    if [ $? -ne 0 ]; then
296        logger -p local0.error -t "${progname}[$$]" \
297	    "Unable to change to role to secondary for resource ${PROV}: ${out}."
298        exit 1
299    fi
300    logger -p local0.debug -t "${progname}[$$]" \
301	"Role for resource ${PROV} changed to secondary."
302
303    logger -p local0.info -t "${progname}[$$]" \
304	"Successfully switched to secondary for resource ${PROV}."
305}
306
307status()
308{
309    "${HASTCTL}" status "${PROV}" |
310    grep -q '^ *role: *primary *$' &&
311    "${ZFS}" list "${POOL}" > /dev/null 2>&1 &&
312    "${MOUNT}" | grep -q "${MOUNTPOINT}" &&
313    "${NFSD}" onestatus > /dev/null 2>&1 &&
314    "${MOUNTD}" onestatus > /dev/null 2>&1 &&
315    return ${RUN}
316
317    "${HASTCTL}" status "${PROV}" |
318    grep -q '^ *role: *secondary *$' &&
319    return ${STOPPED}
320
321    return ${UNKNOWN}
322}
323
324case $1 in
325    start)
326	start
327	;;
328    stop)
329	stop
330	;;
331    status)
332	status
333	;;
334    role|connect|disconnect|complain)
335	exit 0
336	;;
337    *)
338	echo "usage: $0 stop|start|status|role|connect|disconnect|complain"
339	exit 1
340	;;
341esac
342}}}
343
344Start hastmon daemon and set up role on all hosts:
345{{{
346lolek# hastmon
347lolek# hastmonctl role primary all
348
349bolek# hastmon
350bolek# hastmonctl role secondary all
351
352reksio# hastmon
353reksio# hastmonctl role watchdog all
354}}}
355
356Check nodes' status:
357{{{
358lolek# hastmonctl status
359daemon:
360  role: primary
361  exec: /etc/daemon.sh
362  remoteaddr: tcp4://bolek(connected)
363  state: run
364  attempts: 0 from 5
365  complaints: 0 for last 60 sec (threshold 3)
366  heartbeat: 10 sec
367storage:
368  role: primary
369  exec: /etc/storage.sh
370  remoteaddr: tcp4://bolek(connected)
371  state: run
372  attempts: 0 from 5
373  complaints: 0 for last 60 sec (threshold 3)
374  heartbeat: 10 sec
375
376bolek# hastmonctl status
377daemon:
378  role: secondary
379  exec: /etc/daemon.sh
380  remoteaddr: tcp4://lolek(connected)
381  state: stopped
382  attempts: 0 from 5
383  complaints: 0 for last 60 sec (threshold 3)
384  heartbeat: 10 sec
385storage:
386  role: secondary
387  exec: /etc/storage.sh
388  remoteaddr: tcp4://lolek(connected)
389  state: stopped
390  attempts: 0 from 5
391  complaints: 0 for last 60 sec (threshold 3)
392  heartbeat: 10 sec
393
394reksio# hastmonctl status
395daemon:
396  role: watchdog
397  exec: /etc/daemon.sh
398  remoteaddr: tcp4://lolek(primary/run) tcp4://bolek(secondary/stopped)
399  state: run
400  attempts: 0 from 5
401  complaints: 0 for last 60 sec (threshold 3)
402  heartbeat: 10 sec
403storage:
404  role: watchdog
405  exec: /etc/storage.sh
406  remoteaddr: tcp4://lolek(primary/run) tcp4://bolek(secondary/stopped)
407  state: run
408  attempts: 0 from 5
409  complaints: 0 for last 60 sec (threshold 3)
410  heartbeat: 10 sec
411}}}
412

README.md

1# hastmon
2Automatically exported from code.google.com/p/hastmon
3
4HASTMON is a monitoring daemon that allows a couple of hosts to
5run a service providing automatic failover. Those machines will
6be called a cluster and each machine is one cluster node. HASTMON
7is designed for clusters that work in Primary-Secondary
8configuration, which means that only one of the cluster nodes can
9be active at any given time. See README for details.
10