[Linux-HA] observations after some fencing tests in a two node
sebastia at l00-bugdead-prods.de
Thu Nov 8 04:28:05 MST 2007
Dejan Muhamedagic <dejanmm at fastmail.fm> wrote:
> On Wed, Nov 07, 2007 at 04:43:32PM +0100, Sebastian Reitenbach wrote:
> > Hi all,
> > I did some fencing tests in a two node cluster, here are some details of
> > setup:
> > - use stonith external/ilo for fencing (ssh to ilo board and issue a
> > command)
> > - both nodes are connected via two bridged ethernet interfaces to two
> > redundant switches. The ilo boards are connected to the each of the
> > switches.
> > My first observation:
> > - when removing the network cables from the node that is the DC at the
> > moment, it took at least three minutes, until it decided to stonith the
> > other node and to startup the resources that ran on the node without
> > connectivity
> > - when removing the network cables from the node that is not the DC,
> > was a matter of e.g. 20 seconds, then this node fenced the DC, and then
> > became DC
> This definitely deserves a set of logs, etc (is your hb_report
> operational? :).
humm, yes, with the latest patches (:
ok, I'll reproduce the problem and create a report.
> > Why is there such a difference? The first one takes too long in my eyes
> > detect the outage, but I hope there are timeout values that I can tweak.
> > which ones shall I take a look?
> deadtime in ha.cf.
> > Also I recognized the following line in the logfile from the DC in the
> > case:
> > tengine: ... info: extract_event: Stonith/shutdown of <uuid> not matched
> > This line shows up immediately after the DC detects that the other node
> > unreachable. From then it takes at least two minutes until the DC
> > fence the other node.
> Looks like a kind of misunderstanding between the CRM and
> stonithd. Again, a report would hopefully reveal what's going on.
> If you could turn debug on, that'd be great. A bugzilla is
> fine too.
I'll do, with above logs attached.
> > The second thing I observed:
> > My stonith is working via ssh to the ilo board to the node that shall be
> > fenced. When I remove the ethernet cables from one node, stonith will
> > to kill the other node.
> > take case two from above, remove the cables from the node that is not
> > DC, where I observed the following:
> > The DC needs about some minutes to decide to fence the other node,
> > of the above observed behaviour. Meanwhile the non DC node without
> > cables tried to fence the DC, that failed, and the node was in a unclean
> > state, until the DC fenced it in the end.
> > Luckily the stonith of the DC failed, then assume instead of ssh as
> > resource, use a stonith devied connected to e.g. serial port.
> > In that case, the non DC node were able to fence the DC, and then become
> > itself, starting all resources, mounting all filesystems, ...
> > Meanwhile the DC is restarted, and either heartbeat is not started
> > automatically, then the cluster is unusable, because the one node that
> > has no network. Or when heartbeat is started automatically, it cannot
> > communicate to the second node, and will assume this one is dead,
> and will insist on reseting it. Which would result in a yo-yo
> machinery. Not entirely useful. This kind of lack of
> communication is obviously detrimental, and that in spite of the
> stonith configured. Right now don't see a solution to this issue.
> Apart from pingd.
> > and start
> > all its resources, so that e.g. filesystems could be mounted on both
> > I don't have a hardware fencing device to test my theory, but could that
> > happen or not? Could the usage of some ping nodes, combined with a pingd
> > an external quorumd help to solve the dilemma?
> A pingd resource with appropriate constraints would help, i.e.
> something like "don't run resources if the pingd attribute is
I am already fiddling around with ping, but doesn't seem to get it to work,
see the other thread: "problem with locations depending on pingd"
> > Well, I am running heartbeat 2.1.2-15 on sles10sp1, any hints and
> > are appreciated.
More information about the Linux-HA