[Linux-HA] HP DL380 x2 and one raid
lmb at suse.de
Wed Aug 17 02:59:37 MDT 2005
On 2005-08-16T16:41:36, Peter Kruse <pk at q-leap.com> wrote:
> >It's been quite
> >complex, and I'm not sure if it ever worked right...
> no it didn't - not on Linux anyways...
Well, it mostly worked. Just that particular algorithm, I seem to
recall, did not.
> >If you want, I'd certainly encourage a design to deal with this, and
> >then a patch to do so...
> Does that mean heartbeat doesn't have anything like that?
No. heartbeat does not try to second guess a STONITH failure right now.
The reason why FailSafe had this logic was that on IRIX and their own
hardware, they used the cross-wired system management console, which
essentially has the same problem as the drac3 and the iloe: if the node
is indeed and truly physically down, so is the power switch.
Dealing with this thus was a quite important scenario for them, because
it made up a majority of their node failures.
heartbeat, on the other hand, mostly deals with power switches which are
external to the node, and some of them like the excellent WTI NPS series
or the baytech, even have redundant power supplies themselves. So,
being unable to reach the powerswitch after a node failure more likely
indicates a problem on the part of yourself than on behalf of the
This means that in our "common" case, this specific algorithm would
likely have a negative impact.
But, because it is a rather common issue, and the builtin mgmt
interfaces are becoming more pervasive on "common" hardware now too,
handling this in the common code instead of a special hack for a plugin
seems to be a good idea.
Which is why I'm encouraging you to write up a design how to fit this
into our fencing subsystem (which is neatly encapsulated in heartbeat
2.0) and even better, then write a patch ;-)
> Does it also mean in case of a powerfailure for the stonith
> device and the corresponding node a failover will never
> succeed, do you only have to manually failover to recover?
Right now this is true. If fencing does not succeed, no resources
requiring fencing will be recovered. Which, if you look at the above
explanation, makes sense for our case, most of the time.
What we still have to implement too is cascaded fencing; ie, first try
node suicide if the node is still reachable (there's some cases where we
must fence/reboot even then), then some power switch, and then let the
admin pull the trigger. That would also go a long way to allow us to
implement stretched clusters better.
Patches are being solicited. ;-)
Lars Marowsky-Brée <lmb at suse.de>
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"
More information about the Linux-HA