[Linux-ha-dev] Re: The issue of shutdown

Alan Robertson alanr at unix.sh
Wed Dec 1 06:56:00 MST 2004


Huang Zhen wrote:
> Alan Robertson wrote:
> 
>> You should probably be setting each resource agent you fork as a 
>> process group anyway, because then if you have to kill them, you will 
>> kill ALL of its child processes too...
> 
> Ok, I will change the process group of the RA.
> 
>>
>> But, I still wonder what kind of testing you're doing such that you 
>> shut down heartbeat while things are still being taken over.  This is 
>> not an ideal circumstance.
>>
> 
> The situation is to shutdown the last node in the cluster.
> 1. heartbeat -k
> 2. heartbeat sends SIGTERM to CRM process group
> 3. CRM sends several stop operations to LRM to release all the resources 
> it holding and exit immediatly.

OK.  Then the CRM is broken.  It should wait for all resources to be 
released before it exits.  This is because it MUST take some kind of action 
if a resource fails to stop correctly.  It is not enough to pray for 
correct stop status ;-).

Simple software is good.  This is one of those cases where it is too 
simple.  Unfortunately shutdown is one of the more difficult things about a 
resource manager.  It has been a continual problem in release 1, and not 
just because the current architecture isn't very good.

> 4. Then heartbeat sends SIGTERM to LRM process group, but now the child 
> process of LRM, stop operations (step 3) are on going.

Yes, I see that.

> Any suggestion? I will be on IRC.

First, I think the CRM MUST monitor "stop" status and wait until everything 
  is correctly stopped.  This exiting immediately is a bug.  In fact, in 
addition to being incorrect, you've had to work twice now to work around 
this bug in the CRM.  This is an additional sign that something is wrong in 
the CRM.

Secondly, there's no harm in making your processes process groups anyway. 
You can't really kill them correctly otherwise.  You only need to do this 
in the case of a cancelled monitor operation or a timeout limit exceeded - 
but you do need to do that in those cases - and you can't kill everything 
cleanly without making each resource operation a process group.

But, the bug is primarily the CRM's, here.


-- 
     Alan Robertson <alanr at unix.sh>

"Openness is the foundation and preservative of friendship...  Let me claim 
from you at all times your undisguised opinions." - William Wilberforce



More information about the Linux-HA-Dev mailing list