README
1#summary HASTMON -- cluster monitoring daemon.
2
3= Introduction. =
4
5HASTMON is a monitoring daemon that allows a couple of hosts to run a
6service providing automatic failover. Those machines will be called a
7cluster and each machine is one cluster node. HASTMON is designed for
8clusters that work in Primary-Secondary configuration, which means
9that only one of the cluster nodes can be active at any given
10time. Active node will be called Primary node. This is the node that
11will run the service. Other nodes will be running as Secondary
12ensuring that the service is not started there. There should be also
13at least one node acting as a watchdog -- it checks periodically
14status of all nodes and sends complaints to Secondary nodes if Primary
15is not available. Secondary node makes decision to change its role to
16Primary when two conditions are meat: there is no connection from
17primary and there are complaints from watchdog.
18
19Most of the HASTMON's code was taken from
20[http://wiki.freebsd.org/HAST FreeBSD HAST project]
21and it was developed as a monitoring daemon for HAST cluster but can be
22used for other setups.
23
24This software is being developed and tested under FreeBSD. Since 3.0
25it is supposed to work on NetBSD and Linux too.
26
27= Installation. =
28
29Since version 0.3.0 when support for several platforms was added
30hastmon requires
31[http://sourceforge.net/projects/mk-configure/ mk-configure]
32to be built and installed. mk-configure uses
33[http://www.crufty.net/help/sjg/bmake.html bmake] (NetBSD make).
34
35mk-configure and bmake have been already packaged on some platforms
36(FreeBSD, NetBSD, some Linux distros) but if it is not your case go to
37[http://sourceforge.net/projects/mk-configure/ mk-configure page],
38download the sources and read instructions how to install.
39
40When mk-configure is installed hastmon can be built and installed
41running the following commands:
42
43{{{
44cd <path to hastmon sources>
45mkcmake
46mkcmake install
47}}}
48
49= Configuration. =
50
51There should be at least 3 nodes: two that run the service and acting
52as Primary-Secondary and one is Watchdog node. Configuration for nodes
53is stored in /etc/hastmon.conf file, which is designed in a way that
54exactly the same file can be (and should be) used on all nodes.
55HASTMON can monitor several resources. For every resource the script
56should be provided that will be used to start/stop the resource and
57check its status. See hastmon.conf(5) and Examples section below how
58to write the configuration file and rc script.
59
60After the nodes are started their role is set up using hastmonctl
61utility. This utility is also used to check current status of the
62cluster.
63
64= Examples. =
65
66In this example two resources will be set up -- one is some
67application/daemon that may run only on one server and another is HAST
68cluster that provides NFS storage.
69
70The cluster is run on three nodes: lolek, bolek -- running the services
71(resources) and acting as Primary-Secondary, and reksio -- acting as
72Watchdog.
73
74Configuration file /etc/hastmon.conf is the same on all nodes:
75{{{
76resource daemon {
77 exec /etc/daemon.sh
78 friends lolek bolek reksio
79
80 on lolek {
81 remote tcp4://bolek
82 priority 0
83 }
84 on bolek {
85 remote tcp4://lolek
86 priority 1
87 }
88 on reksio {
89 remote tcp4://lolek tcp4://bolek
90 }
91}
92
93resource storage {
94 exec /etc/storage.sh
95 friends lolek bolek reksio
96
97 on lolek {
98 remote tcp4://bolek
99 priority 0
100 }
101 on bolek {
102 remote tcp4://lolek
103 priority 1
104 }
105 on reksio {
106 remote tcp4://lolek tcp4://bolek
107 }
108}
109
110}}}
111Exec scripts, which are stored on all three nodes:
112
113/etc/daemon.sh:
114{{{
115#!/bin/sh
116
117DAEMON=/etc/rc.d/lpd # :-)
118
119case $1 in
120 start)
121 ${DAEMON} onestart
122 ;;
123 stop)
124 ${DAEMON} onestop
125 ;;
126 status)
127 ${DAEMON} onestatus
128 ;;
129 role|connect|disconnect|complain)
130 exit 0
131 ;;
132 *)
133 echo "usage: $0 stop|start|status|role|connect|disconnect|complain"
134 exit 1
135 ;;
136esac
137}}}
138
139/etc/storage.sh is more complicated:
140{{{
141#!/bin/sh
142
143PROV=storage
144POOL=storage
145IF=em0
146IP=172.20.68.100
147FS=storage/test
148MOUNTPOINT=/storage/test
149DEV="/dev/hast/${PROV}"
150
151HAST=/etc/rc.d/hastd
152HASTCTL=/sbin/hastctl
153ZPOOL=/sbin/zpool
154ZFS=/sbin/zfs
155IFCONFIG=/sbin/ifconfig
156MOUNTD=/etc/rc.d/mountd
157NFSD=/etc/rc.d/nfsd
158MOUNT=/sbin/mount
159
160RUN=0
161STOPPED=1
162UNKNOWN=2
163
164progname=$(basename $0)
165
166start()
167{
168 logger -p local0.debug -t "${progname}[$$]" "Starting $PROV..."
169 # Check if hastd is started and start if it is not.
170 "${HAST}" onestatus || "${HAST}" onestart
171
172 # If there is secondary worker process, it means that remote primary process is
173 # still running. We have to wait for it to terminate.
174 for i in `jot 30`; do
175 pgrep -f "hastd: ${PROV} \(secondary\)" >/dev/null 2>&1 || break
176 sleep 1
177 done
178 if pgrep -f "hastd: ${PROV} \(secondary\)" >/dev/null 2>&1; then
179 logger -p local0.error -t "${progname}[$$]" \
180 "Secondary process for resource ${PROV} is still running after 30 seconds."
181 exit 1
182 fi
183 logger -p local0.debug -t "${progname}[$$]" "Secondary process in not running."
184
185 # Change role to primary for our resource.
186 out=`${HASTCTL} role primary "${PROV}" 2>&1`
187 if [ $? -ne 0 ]; then
188 logger -p local0.error -t "${progname}[$$]" \
189 "Unable to change to role to primary for resource ${PROV}: ${out}."
190 exit 1
191 fi
192
193 # Wait few seconds for provider to appear.
194 for i in `jot 50`; do
195 [ -c "${DEV}" ] && break
196 sleep 0.1
197 done
198 if [ ! -c "${DEV}" ]; then
199 logger -p local0.error -t "${progname}[$$]" "Device ${DEV} didn't appear."
200 exit 1
201 fi
202 logger -p local0.debug -t "${progname}[$$]" "Role for resource ${prov} changed to primary."
203
204 # Import ZFS pool. Do it forcibly as it remembers hostid of the
205 # other cluster node. Before import we check current status of
206 # zfs: it might be that the script is called second time by
207 # hastmon (because not all operations were successful on the first
208 # run) and zfs is already here.
209
210 "${ZPOOL}" list | egrep -q "^${POOL} "
211 if [ $? -ne 0 ]; then
212 out=`"${ZPOOL}" import -f "${POOL}" 2>&1`
213 if [ $? -ne 0 ]; then
214 logger -p local0.error -t "${progname}[$$]" \
215 "ZFS pool import for resource ${PROV} failed: ${out}."
216 exit 1
217 fi
218 logger -p local0.debug -t "${progname}[$$]" "ZFS pool for resource ${PROV} imported."
219 fi
220
221 zfs mount | egrep -q "^${FS} "
222 if [ $? -ne 0 ]; then
223 out=`zfs mount "${FS}" 2>&1`
224 if [ $? -ne 0 ]; then
225 logger -p local0.error -t "${progname}[$$]" \
226 "ZFS mount for ${FS} failed: ${out}."
227 exit 1
228 fi
229 logger -p local0.debug -t "${progname}[$$]" "ZFS {$FS} mounted."
230 fi
231
232 "${IFCONFIG}" "${IF}" alias "${IP}" netmask 0xffffffff
233
234 out=`"${MOUNTD}" onerestart 2>&1`
235 if [ $? -ne 0 ]; then
236 logger -p local0.error -t "${progname}[$$]" \
237 "Can't start mountd: ${out}."
238 exit 1
239 fi
240
241 out=`"${NFSD}" onerestart 2>&1`
242 if [ $? -ne 0 ]; then
243 logger -p local0.error -t "${progname}[$$]" \
244 "Can't start nfsd: ${out}."
245 exit 1
246 fi
247
248 logger -p local0.debug -t "${progname}[$$]" "NFS started."
249}
250
251stop()
252{
253 logger -p local0.debug -t "${progname}[$$]" "Stopping $PROV..."
254
255 # Kill start script if it still runs in the background.
256 sig="TERM"
257 for i in `jot 30`; do
258 pgid=`pgrep -f '/etc/storage.sh start' | head -1`
259 [ -n "${pgid}" ] || break
260 kill -${sig} -- -${pgid}
261 sig="KILL"
262 sleep 1
263 done
264 if [ -n "${pgid}" ]; then
265 logger -p local0.error -t "${progname}[$$]" \
266 "'/etc/storage.sh start' process for resource ${PROV} is still running after 30 seconds."
267 exit 1
268 fi
269 logger -p local0.debug -t "${progname}[$$]" "'/etc/storage.sh start' is not running."
270
271 "${NFSD}" onestop
272 "${MOUNTD}" onestop
273
274 "${IFCONFIG}" "${IF}" -alias "${IP}" netmask 0xffffffff
275
276 if "${HAST}" onestatus; then
277 "${ZPOOL}" list | egrep -q "^${POOL} "
278 if [ $? -eq 0 ]; then
279 # Forcibly export file pool.
280 out=`${ZPOOL} export -f "${POOL}" 2>&1`
281 if [ $? -ne 0 ]; then
282 logger -p local0.error -t "${progname}[$$]" \
283 "Unable to export pool for resource ${PROV}: ${out}."
284 exit 1
285 fi
286 logger -p local0.debug -t "${progname}[$$]" \
287 "ZFS pool for resource ${PROV} exported."
288 fi
289 else
290 "${HAST}" onestart
291 fi
292
293 # Change role to secondary for our resource.
294 out=`${HASTCTL} role secondary "${PROV}" 2>&1`
295 if [ $? -ne 0 ]; then
296 logger -p local0.error -t "${progname}[$$]" \
297 "Unable to change to role to secondary for resource ${PROV}: ${out}."
298 exit 1
299 fi
300 logger -p local0.debug -t "${progname}[$$]" \
301 "Role for resource ${PROV} changed to secondary."
302
303 logger -p local0.info -t "${progname}[$$]" \
304 "Successfully switched to secondary for resource ${PROV}."
305}
306
307status()
308{
309 "${HASTCTL}" status "${PROV}" |
310 grep -q '^ *role: *primary *$' &&
311 "${ZFS}" list "${POOL}" > /dev/null 2>&1 &&
312 "${MOUNT}" | grep -q "${MOUNTPOINT}" &&
313 "${NFSD}" onestatus > /dev/null 2>&1 &&
314 "${MOUNTD}" onestatus > /dev/null 2>&1 &&
315 return ${RUN}
316
317 "${HASTCTL}" status "${PROV}" |
318 grep -q '^ *role: *secondary *$' &&
319 return ${STOPPED}
320
321 return ${UNKNOWN}
322}
323
324case $1 in
325 start)
326 start
327 ;;
328 stop)
329 stop
330 ;;
331 status)
332 status
333 ;;
334 role|connect|disconnect|complain)
335 exit 0
336 ;;
337 *)
338 echo "usage: $0 stop|start|status|role|connect|disconnect|complain"
339 exit 1
340 ;;
341esac
342}}}
343
344Start hastmon daemon and set up role on all hosts:
345{{{
346lolek# hastmon
347lolek# hastmonctl role primary all
348
349bolek# hastmon
350bolek# hastmonctl role secondary all
351
352reksio# hastmon
353reksio# hastmonctl role watchdog all
354}}}
355
356Check nodes' status:
357{{{
358lolek# hastmonctl status
359daemon:
360 role: primary
361 exec: /etc/daemon.sh
362 remoteaddr: tcp4://bolek(connected)
363 state: run
364 attempts: 0 from 5
365 complaints: 0 for last 60 sec (threshold 3)
366 heartbeat: 10 sec
367storage:
368 role: primary
369 exec: /etc/storage.sh
370 remoteaddr: tcp4://bolek(connected)
371 state: run
372 attempts: 0 from 5
373 complaints: 0 for last 60 sec (threshold 3)
374 heartbeat: 10 sec
375
376bolek# hastmonctl status
377daemon:
378 role: secondary
379 exec: /etc/daemon.sh
380 remoteaddr: tcp4://lolek(connected)
381 state: stopped
382 attempts: 0 from 5
383 complaints: 0 for last 60 sec (threshold 3)
384 heartbeat: 10 sec
385storage:
386 role: secondary
387 exec: /etc/storage.sh
388 remoteaddr: tcp4://lolek(connected)
389 state: stopped
390 attempts: 0 from 5
391 complaints: 0 for last 60 sec (threshold 3)
392 heartbeat: 10 sec
393
394reksio# hastmonctl status
395daemon:
396 role: watchdog
397 exec: /etc/daemon.sh
398 remoteaddr: tcp4://lolek(primary/run) tcp4://bolek(secondary/stopped)
399 state: run
400 attempts: 0 from 5
401 complaints: 0 for last 60 sec (threshold 3)
402 heartbeat: 10 sec
403storage:
404 role: watchdog
405 exec: /etc/storage.sh
406 remoteaddr: tcp4://lolek(primary/run) tcp4://bolek(secondary/stopped)
407 state: run
408 attempts: 0 from 5
409 complaints: 0 for last 60 sec (threshold 3)
410 heartbeat: 10 sec
411}}}
412
README.md
1# hastmon
2Automatically exported from code.google.com/p/hastmon
3
4HASTMON is a monitoring daemon that allows a couple of hosts to
5run a service providing automatic failover. Those machines will
6be called a cluster and each machine is one cluster node. HASTMON
7is designed for clusters that work in Primary-Secondary
8configuration, which means that only one of the cluster nodes can
9be active at any given time. See README for details.
10