multiple failovers with a Stonith device?
Alan Robertson
alanr@unix.sh
Mon, 28 Oct 2002 10:30:14 -0700
Aaron Bush wrote:
> I had a problem early this morning (around 4:00am) with heartbeat
> failing over multiple times on a two node cluster. Heartbeat is set to
> "nice_failback on". The os on both nodes is Linux 2.4.5.18-5 (RedHat
> 7.3). The hostnames are cctcpa and cctcpb. cctcpa is the primary and
> is powered by a WTI RPS10 Power Switch. The RPS10 is atteched to the
> serial port of cctcpb. I am using both serial and ethernet (cross over
> cable) heartbeat channels. I am running heartbeat-0.4.9.2 on both nodes.
>
> It appears from the log files that the cctcpa (active) node became
> bogged down and logged that it and the other node, cctcpb were both
> dead. In the archives i came across a post that mentioned that if the
> system was too heavily loaded then the following entry might be logged
> if the local box failed to heartbeat itself after deadtime seconds:
> "ERROR: No local heartbeat. Forcing shutdown"
> Is this true?
Yes.
> It also appears that the RPS10 failed to take down the primary server on
> the first attempt. This entry was logged in /var/log/messages (why is
> this not in the ha-log or ha-debug log files?):
The STONITH library is designed to be independent of heartbeat, so it cannot
rely on heartbeat's logging methods.
>
> Oct 28 04:01:13 cctcpb heartbeat: Host cctcpa.microcenter.com being
> rebooted.
> Oct 28 04:01:13 cctcpb heartbeat: Did not find string: 'Plug' fromWTI
> RPS10 Power Switch.
> Oct 28 04:01:29 cctcpb heartbeat: Did not find string: 'RPS-10 Ready'
> fromWTI RPS10 Power Switch.
>
> The logs show two more calls to the RPS10 to reset the power. These two
> attempts were successful and can be confirmed via syslog boot entries.
>
> I had a deatime set to 10; which from reading the archives is probably
> way too short. The system is not in production yet so the load is
> normally very low. The only activity on the system around 4:00am is log
> file rotation (standard RedHat cron entries). I have adjusted deadtime
> to be 90 now.
It depends on a lot of factors. Amount of memory, which release of
heartbeat, which kernel, how much I/O, etc.... Sorry :-(.
> I have attached the ha-log entries from both cctcpa and cctcpb. If
> anyone on the list can take a look to confirm that the behavior of the
> two nodes is normal and expected that would be great.
Of course, having the machines shut down isn't normal, but I think you
understand that one. Note that the newest Red Hat kernel appears to have
problems above and beyond "normal" problems.
> I have also
> attached ha.cf (same on boith nodes). haresources on both nodes has the
> following lines:
>
> # use cctcpa as primary, use 10.10.1.11 as shared IP
> cctcpa.microcenter.com 10.10.1.11 mpsd logger_alert
>
>
> Something else that i noticed today:
>
> Why all the defunct ifstat and status processes? The 05:17 entries are
> from when the system restarted (via Stonith). The 11:44 entries are
> from a heartbeart reload on the cctcpb (spare node).
>
> -- ps from cctcpa (active node) after reload on cctcpb (spare node) --
>
> root 1311 1 0 05:17 ttyS0 00:00:00 heartbeat
> root 1312 1311 0 05:17 ttyS0 00:00:00 heartbeat
> root 1313 1311 0 05:17 ttyS0 00:00:00 heartbeat
> root 1314 1311 0 05:17 ttyS0 00:00:00 heartbeat
> root 1315 1311 0 05:17 ttyS0 00:00:00 heartbeat
> root 1316 1311 0 05:17 ttyS0 00:00:00 heartbeat
> root 1317 1316 0 05:17 ttyS0 00:00:00 [ifstat <defunct>]
> root 1320 1316 0 05:17 ttyS0 00:00:00 [ifstat <defunct>]
> root 1321 1316 0 05:17 ttyS0 00:00:00 [status <defunct>]
> root 1326 1316 0 05:17 ttyS0 00:00:00 [status <defunct>]
> root 1329 1316 0 05:17 ttyS0 00:00:00 [ifstat <defunct>]
> root 1332 1316 0 05:17 ttyS0 00:00:00 [heartbeat <defunct>]
> root 1370 1316 0 05:17 ttyS0 00:00:00 [ip-request <defunct>]
> root 3052 1316 0 11:44 ttyS0 00:00:00 [status <defunct>]
> root 3053 1316 0 11:44 ttyS0 00:00:00 [ifstat <defunct>]
> root 3054 1316 0 11:44 ttyS0 00:00:00 [ifstat <defunct>]
> root 3074 1316 0 11:44 ttyS0 00:00:00 [ifstat <defunct>]
> root 3075 1316 0 11:44 ttyS0 00:00:00 [status <defunct>]
> root 3080 1316 0 11:44 ttyS0 00:00:00 [status <defunct>]
> root 3083 1316 0 11:44 ttyS0 00:00:00 [ifstat <defunct>]
I don't know why these processes are hanging around here. I know that
process management in the betas is much different, and generally much
better. As is STONITH handling.
-- Alan Robertson
alanr@unix.sh