[Linux-HA] Recovering from "unexpected bad things" - is STONITH the answer?
lmb at suse.de
Thu Nov 8 07:39:43 MST 2007
On 2007-11-08T14:49:44, Andrew Beekhof <beekhof at gmail.com> wrote:
>> I understand (and so far as that particular logic goes, I agree), but my
>> concern is with the proposal of having some "official" recommendation to
>> use the SSH plugin in production systems. It's simply (at present) just
>> not production quality,
> We could always try and remedy that :-)
Yes, the SSH plugin could be made much better.
(I'm thinking along the lines of the node - possibly the kernel -
sending a "Yes, suicide imminent" packet, after which the rest of the
cluster can assume, after delta-t seconds, that the node will indeed be
stopped - because this would be the next unavoidable operation of the
kernel after sending the packet. Of course, that's again only "extremly"
probable, but the chance should be low enough.)
But it is not as reliable as a tested external STONITH device, which
does not rely on cooperation from the failed OS image, nor network
connectivity to it. (Which makes node suicide unusable as split-brain
"node suicide" in response to a core process exiting doesn't hurt and
might make things better, because the other nodes will still fence it if
possible, and that's a good thing.
One problem our current STONITH implementation has is the impossibilty
to specify cascading fencing operations; first try the node suicide (if
ack'ed by the node, that ought to be good enough), and if that is not
possible because the node doesn't reply, fence it externally.
I'm not religious about how the internal, not-yet-recoverable errors are
treated: whether it's node suicide or escalation to a "forced hard stop"
of all other heartbeat processes (_possibly_ followed with a restart)
doesn't matter much, but a fail-fast strategy for such errors is highly
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde
More information about the Linux-HA