[Linux-ha-dev] Re: [Linux-HA] Recovering from "unexpected bad
things" - is STONITH the answer?
Robert Wipfel
rawipfel at novell.com
Wed Nov 7 18:45:47 MST 2007
>>> On Wed, Nov 7, 2007 at 6:25 PM, in message <47326571.4070008 at fitterer.org>,
Yan Fitterer <yan at fitterer.org> wrote:
>>> In addition, I have been thinking of complementing this mechanism with a
>>> disk- based "STONITH" (otherwise known as "poison pill"...) so that the
>>> unreachable node may (if things aren't too badly broken) take its
>>> resources down, and stop the disk heartbeat, which would then allow the
>>> rest of the cluster to consider it having left the cluster safely, and
>>> migrate the resources.
I've missed a lot of thread so in the hope these comments add some value...
The reason we implemented a shared disk based communication channel
for cluster split brain detection, and suicide via poison pill - back in the late
90s - was because the Fibre Channel Arbitrated Loop SANs we had then for
more than two node clusters had no I/O fencing intelligence whatsoever, and
SCSI-3 reservations weren't reliable or even supported in many cases.
The rather brutal approach of killing a node just to be sure it doesn't leak
out an I/O onto a shared disk that's since received I/O from other servers in
the same cluster, was the excuse for today's smart storage subsystems
and SAN fabrics that that can programmed to disable the initiator, at the target
side...
The rather tricky behavior of file systems to hang up the server OS because they
can't be umount'ed reliably can be somewhat worked around by running more
than one kernel on the same server - i.e. run your server applications inside
a VM, and get the unreliable code out of the kernel that runs the cluster software...
Hth,
Robert
More information about the Linux-HA-Dev
mailing list