[Linux-ha-dev] Re: [Linux-HA] Recovering from "unexpected bad things" - is STONITH the answer?

Alan Robertson alanr at unix.sh
Tue Nov 6 11:45:17 MST 2007


Kevin Tomlinson wrote:
> On Tue, 2007-11-06 at 10:25 -0700, Alan Robertson wrote:
> 
>> We now have the ComponentFail test in CTS.  Thanks Lars for getting it 
>> going!
>>
>> And, in the process, it's showing up some kinds of problems that we 
>> hadn't been looking for before.  A couple examples of such problems can 
>> be found here:
>>
>> http://old.linux-foundation.org/developer_bugzilla/show_bug.cgi?id=1762
>> http://old.linux-foundation.org/developer_bugzilla/show_bug.cgi?id=1732
>>
>> The question that comes up is this:
>>
>> For problems that should "never" happen like death of one of our 
>> core/key processes, is an immediate reboot of the machine the right 
>> recovery technique?
>>
>> The advantages of such a choice include:
>>   It is fast
>>   It will invoke recovery paths that we exercise a lot in testing
>>   It is MUCH simpler than trying to recover from all these cases,
>> 	therefore almost certainly more reliable
>>
>> The disadvantages of such a choice include:
>>   It is crude, and very annoying
>>   It probably shouldn't be invoked for single-node clusters (?)
>>   It could be criticized as being lazy
>>   It shouldn't be invoked if there is another simple and correct method
>>   Continual rebooting becomes a possibility...
>>
>> We do not have a policy of doing this throughout the project, what we 
>> have is a few places where we do it.
>>
>> I propose that we should consider making a uniform policy decision for 
>> the project - and specifically decide to use ungraceful reboots as our 
>> recovery method for "key" processes dying (for example: CCM, heartbeat, 
>> CIB, CRM).  It should work for those cases where people don't configure 
>> in watchdogs or explicitly define any STONITH devices, and also 
>> independently of quorum policies - because AFAIK it seems like the right 
>> choice, there's no technical reason not to do so.
>>
>> My inclination is to think that this is a good approach to take for 
>> problems that in our best-guess judgment "shouldn't happen".
>>
>>
>> I'm bringing this to both lists, so that we can hear comments both from
>> developers and users.
>>
>>
>> Comments please...
>>
> 
> 
> I would say the "right thing" would depend on your cluster
> implementation and what is consider the right thing to do for the
> applications that the cluster is monitoring.
> I would propose that this action should be administrator configurable.
>>From a user point of view with the cluster that we are implementing we
> would expect any cluster failure (internal) to either get itself back
> and running or just send out an alert "Help me. im not working"... as we
> would want our applications to continue running on the nodes. ** We dont
> want a service outage just because the cluster is no longer monitoring
> our applications. **
> We would expect to get a 24x7 call out. Sev1 and then logon to the
> cluster and see what was happening. (configured alerting)
> Our applications only want a service outage if the node itself has
> issues not the Cluster..

Here's the issue:

The solution as I see it is to do one of:

	a) reboot the node and clear the problem with certainty

	b) continue on and risk damaging your disks.

	c) write some new code to recover from specific cases more
	   gracefully and then test it thoroughly.

	d) Try and figure out how to propagate the failure to the
		top layer of the cluster, and hope you get the notice
		there soon enough so that it can "freeze" the cluster
		before the code reacts to the apparent failure
		and begins to try and recover from it.

In the current code, sometimes you'll get behavior (a) and sometimes 
you'll get behavior (b) and sometimes you'll get behavior (c).

In the particular case described by bug 1762, failure to reboot the node 
did indeed start the same resource twice.  In a cluster where you have 
shared disk (like yours for example), that would probably trash the 
filesystem.  Not a good plan unless you're tired of your current job 
;-).  I'd like to take most/all of the cases where you might get 
behavior (b) and cause them to use behavior (a).

If writing correct code and testing it were free, then (c) would 
obviously be the right choice.

Quite honestly, I don't know how to do (d) in a reliable way at all. 
It's much more difficult than it sounds.  Among other reasons, it relies 
on the components you're telling to freeze things to work correctly. 
Since resource freezes happen at the top level of the system, and the 
top layers need all the layers under them to work correctly, getting 
this right seems to be the kind of approach you could make into your 
life's work - and still never get it right.

Case (c) has to be handled on a case by case basis, where you write and 
test the code for a particular failure case.  IMHO the only feasible 
_general_ answer is (a).

There are an infinite number of things that can go wrong.  So, having a 
reliable and general strategy to deal with the WTF's of the world is a 
good thing.  Of course, for those cases where we have a (c) behavior 
would not be affected by this change in general policy.


-- 
     Alan Robertson <alanr at unix.sh>

"Openness is the foundation and preservative of friendship...  Let me 
claim from you at all times your undisguised opinions." - William 
Wilberforce


More information about the Linux-HA-Dev mailing list