[Linux-ha-dev] Re: [Linux-HA] Recovering from "unexpected bad things" - is STONITH the answer?

Peter R. Badovinatz tabmowzo at us.ibm.com
Wed Nov 7 09:45:50 MST 2007


Alan Robertson wrote:
> Kevin Tomlinson wrote:
>> On Tue, 2007-11-06 at 10:25 -0700, Alan Robertson wrote:
>>
>>> We now have the ComponentFail test in CTS.  Thanks Lars for getting 
>>> it going!
>>>
>>> And, in the process, it's showing up some kinds of problems that we 
>>> hadn't been looking for before.  A couple examples of such problems 
>>> can be found here:
>>>
>>> http://old.linux-foundation.org/developer_bugzilla/show_bug.cgi?id=1762
>>> http://old.linux-foundation.org/developer_bugzilla/show_bug.cgi?id=1732
>>>
>>> The question that comes up is this:
>>>
>>> For problems that should "never" happen like death of one of our 
>>> core/key processes, is an immediate reboot of the machine the right 
>>> recovery technique?
>>>
>>> <snip>
> 
> Here's the issue:
> 
> The solution as I see it is to do one of:
> 
>     a) reboot the node and clear the problem with certainty

I'm well aware that saying "yeah, we did it that way" isn't Good Form, 
but, well, we did it that way (also we did 'd', see below).  I refer to 
existing proprietary HA products in which I was involved designing and 
implementing.

What we found was that certain processes were indistinguishable from the 
node itself and their failure was therefore near impossible to deal with 
cleanly.  The problem was, as described here, that the OS and 
applications/services would continue on but the other nodes in the 
cluster would see it as a node failure and take recovery actions (as you 
describe here).

This was admin-controllable, as we did offer something else...

> 
>     b) continue on and risk damaging your disks.
> 
>     c) write some new code to recover from specific cases more
>        gracefully and then test it thoroughly.
> 
>     d) Try and figure out how to propagate the failure to the
>         top layer of the cluster, and hope you get the notice
>         there soon enough so that it can "freeze" the cluster
>         before the code reacts to the apparent failure
>         and begins to try and recover from it.

Our architecture was that our clients were in a 'process group' across 
the cluster (group services) where each was connected to our server 
process via a unix domain socket on the same node.  Across the cluster 
these process group peers were assumed to be controlling resources and 
they mediated recovery actions through group services.

The unix-domain socket breaking was (well, still is) a defined condition 
  that the client was told to handle as our process dying.  They were 
told in this case to immediately clean up and assume that their peers 
would assume that the local process was gone since the node had died and 
would be doing takeovers.

This was meant to allow some hope of avoiding taking the node down for 
real, depending on the application space.  The intent was that the local 
client would notice the death immediately, while the remote ones would 
take some time (i.e., lack of heartbeats, etc.) to notice.

In cases where the node wasn't taken down, via inittab or similar our 
processes would get automatically restarted, and we'd reintegrate into 
the cluster.  The local client process(es) would be expected to 
reconnect and have to rejoin their group(s).  Our interface manual 
described all of this.

If for some reason our processes couldn't restart (or inittab gave up 
because of too many retries) that node would stay out of the cluster.

> 
> In the current code, sometimes you'll get behavior (a) and sometimes 
> you'll get behavior (b) and sometimes you'll get behavior (c).
> 
> In the particular case described by bug 1762, failure to reboot the node 
> did indeed start the same resource twice.  In a cluster where you have 
> shared disk (like yours for example), that would probably trash the 
> filesystem.  Not a good plan unless you're tired of your current job 
> ;-).  I'd like to take most/all of the cases where you might get 
> behavior (b) and cause them to use behavior (a).
> 
> If writing correct code and testing it were free, then (c) would 
> obviously be the right choice.
> 
> Quite honestly, I don't know how to do (d) in a reliable way at all. 
> It's much more difficult than it sounds.  Among other reasons, it relies 
> on the components you're telling to freeze things to work correctly. 
> Since resource freezes happen at the top level of the system, and the 
> top layers need all the layers under them to work correctly, getting 
> this right seems to be the kind of approach you could make into your 
> life's work - and still never get it right.

You're right.  In the scheme I described above we (group services) 
simply washed our hands of what our clients (layers above) were able to 
do and get right...  We didn't write those.  We offered this as the only 
thing we could think of, with the hope that some clients could do things 
correctly.  It assumes that the OS, for example, is still working so the 
client can take dependable actions.

And if they weren't confident they could enable node rebooting in this 
case and let recovery happen 'normally'.

> 
> Case (c) has to be handled on a case by case basis, where you write and 
> test the code for a particular failure case.  IMHO the only feasible 
> _general_ answer is (a).
> 
> There are an infinite number of things that can go wrong.  So, having a 
> reliable and general strategy to deal with the WTF's of the world is a 
> good thing.  Of course, for those cases where we have a (c) behavior 
> would not be affected by this change in general policy.
> 
> 

Peter
-- 
Peter R. Badovinatz aka 'Wombat'
STG Telecommunications, Linux & Standards
preferred: tabmowzo at us.ibm.com / alternate: wombat at us.ibm.com
These are my opinions and absolutely not official opinions of IBM, Corp.


More information about the Linux-HA-Dev mailing list