[Linux-HA] STONITH reincarnation pause
jgorman at mmspos.com
Thu Dec 13 07:28:34 MST 2007
I have successfully set up Heartbeat / Xen / LVM / DRBD / LVM
to run 3 Xen processes on each side of a pair of machines
that can fail over to the sister machine as a group.
They are named a1xen and b1xen.
Each group of 3 Xen consists of two Master-Master replicating
MySQL servers and an application server running Point of
Sale software that sales clerks log into.
I have instrumented my resource scripts to show what happens
when one node fails. This is the log on b1xen:
Thu Dec 13 01:18:44 AST 2007 stonith reset a1xen
Thu Dec 13 01:18:48 AST 2007 drbddisk vga start
Thu Dec 13 01:18:48 AST 2007 lvm start VolGroupA
Thu Dec 13 01:18:51 AST 2007 xen start a1my1
Thu Dec 13 01:18:51 AST 2007 xen start a1my2
Thu Dec 13 01:18:51 AST 2007 xen start a1asp
Thu Dec 13 01:20:02 AST 2007 xen stop a1my1
Thu Dec 13 01:20:03 AST 2007 xen stop a1my2
Thu Dec 13 01:20:03 AST 2007 xen stop a1asp
Thu Dec 13 01:20:38 AST 2007 lvm stop VolGroupA
Thu Dec 13 01:20:39 AST 2007 drbddisk vga stop
Node a1xen really did fail: I have flaky hardware to
test with for this purpose. Node b1xen did correctly
fence a1xen and took over its services. After a1xen
rebooted, it correctly migrated the services back.
Here is the problem: taking over the services right
away like this doesn't achieve anything except to
bounce the MySQL servers and irritate the users who
log in only to be dumped again one minute later.
What I am looking for is a way to tell the surviving
node to reset the sick node and wait a while to see
if it will come back before taking over its services.
Master Merchant Systems
P.S. I wrote a nice external/ippower9258 stonith script
to support the IPPower network power controller family.
Is there some place that I should be submitting it to
for other people to use?
More information about the Linux-HA