multiple failovers with a Stonith device?

Alan Robertson alanr@unix.sh
Mon, 28 Oct 2002 10:30:14 -0700


Aaron Bush wrote:
> I had a problem early this morning (around 4:00am) with heartbeat 
> failing over multiple times on a two node cluster.  Heartbeat is set to 
> "nice_failback on".  The os on both nodes is Linux 2.4.5.18-5 (RedHat 
> 7.3).  The hostnames are cctcpa and cctcpb.  cctcpa is the primary and 
> is powered by a WTI RPS10 Power Switch.  The RPS10 is atteched to the 
> serial port of cctcpb.  I am using both serial and ethernet (cross over 
> cable) heartbeat channels.  I am running heartbeat-0.4.9.2 on both nodes.
> 
> It appears from the log files that the cctcpa (active) node became 
> bogged down and logged that it and the other  node, cctcpb were both 
> dead.  In the archives i came across a post that mentioned that if the 
> system was too heavily loaded then the following entry might be logged 
> if the local box failed to heartbeat itself after deadtime seconds:
> "ERROR: No local heartbeat. Forcing shutdown"
> Is this true?

Yes.

> It also appears that the RPS10 failed to take down the primary server on 
> the first attempt.  This entry was logged in /var/log/messages (why is 
> this not in the ha-log or ha-debug log files?):

The STONITH library is designed to be independent of heartbeat, so it cannot 
rely on heartbeat's logging methods.

> 
> Oct 28 04:01:13 cctcpb heartbeat: Host cctcpa.microcenter.com being 
> rebooted.
> Oct 28 04:01:13 cctcpb heartbeat: Did not find string: 'Plug' fromWTI 
> RPS10 Power Switch.
> Oct 28 04:01:29 cctcpb heartbeat: Did not find string: 'RPS-10 Ready' 
> fromWTI RPS10 Power Switch.
> 
> The logs show two more calls to the RPS10 to reset the power.  These two 
> attempts were successful and can be confirmed via syslog boot entries.
> 
> I had a deatime set to 10; which from reading the archives is probably 
> way too short.  The system is not in production yet so the load is 
> normally very low.  The only activity on the system around 4:00am is log 
> file rotation (standard RedHat cron entries).  I have adjusted deadtime 
> to be 90 now.

It depends on a lot of factors.  Amount of memory, which release of 
heartbeat, which kernel, how much I/O, etc....  Sorry :-(.

> I have attached the ha-log entries from both cctcpa and cctcpb.  If 
> anyone on the list can take a look to confirm that the behavior of the 
> two nodes is normal and expected that would be great.

Of course, having the machines shut down isn't normal, but I think you 
understand that one.  Note that the newest Red Hat kernel appears to have 
problems above and beyond "normal" problems.


 > I have also
> attached ha.cf (same on boith nodes).  haresources on both nodes has the 
> following lines:
> 
> # use cctcpa as primary, use 10.10.1.11 as shared IP
> cctcpa.microcenter.com  10.10.1.11 mpsd logger_alert
> 
> 
> Something else that i noticed today:
> 
> Why all the defunct ifstat and status processes?  The 05:17 entries are 
> from when the system restarted (via Stonith).  The 11:44 entries are 
> from a heartbeart reload on the cctcpb (spare node).
> 
> -- ps from cctcpa (active node) after reload on cctcpb (spare node) --
> 
> root      1311     1  0 05:17 ttyS0    00:00:00 heartbeat
> root      1312  1311  0 05:17 ttyS0    00:00:00 heartbeat
> root      1313  1311  0 05:17 ttyS0    00:00:00 heartbeat
> root      1314  1311  0 05:17 ttyS0    00:00:00 heartbeat
> root      1315  1311  0 05:17 ttyS0    00:00:00 heartbeat
> root      1316  1311  0 05:17 ttyS0    00:00:00 heartbeat
> root      1317  1316  0 05:17 ttyS0    00:00:00 [ifstat <defunct>]
> root      1320  1316  0 05:17 ttyS0    00:00:00 [ifstat <defunct>]
> root      1321  1316  0 05:17 ttyS0    00:00:00 [status <defunct>]
> root      1326  1316  0 05:17 ttyS0    00:00:00 [status <defunct>]
> root      1329  1316  0 05:17 ttyS0    00:00:00 [ifstat <defunct>]
> root      1332  1316  0 05:17 ttyS0    00:00:00 [heartbeat <defunct>]
> root      1370  1316  0 05:17 ttyS0    00:00:00 [ip-request <defunct>]
> root      3052  1316  0 11:44 ttyS0    00:00:00 [status <defunct>]
> root      3053  1316  0 11:44 ttyS0    00:00:00 [ifstat <defunct>]
> root      3054  1316  0 11:44 ttyS0    00:00:00 [ifstat <defunct>]
> root      3074  1316  0 11:44 ttyS0    00:00:00 [ifstat <defunct>]
> root      3075  1316  0 11:44 ttyS0    00:00:00 [status <defunct>]
> root      3080  1316  0 11:44 ttyS0    00:00:00 [status <defunct>]
> root      3083  1316  0 11:44 ttyS0    00:00:00 [ifstat <defunct>]

I don't know why these processes are hanging around here.  I know that 
process management in the betas is much different, and generally much 
better.  As is STONITH handling.


	-- Alan Robertson
	   alanr@unix.sh