[Linux-HA] WARN: Exiting HBREAD process returned rc 10.
Dave Dykstra
dwdha at drdykstra.us
Mon Aug 1 10:37:14 MDT 2005
My computer room experienced a power outage last weekend. The HA-NFS
servers survived it because they're on extra UPSs, but everything else
was out for a little over 10 minutes. The two HA servers have a direct
connection on eth1, and on eth0 they've got a connection to a gigabit
switch that goes to the rest of the network, and the switch lost power
along with everything else. I am running version 1.99.5.cvs.20050708
from ultramonkey.
The active HA server stayed active for about 40 seconds but then said
WARN: Exiting HBREAD process 4227 returned rc 10.
and started a heartbeat shutdown. Does anybody have any idea of what
could cause that? It looks to me in the source that all exits use a
code of LSB_EXIT_something and I don't see any of them defined to be 10.
The same message appeared on the standby server two seconds later but
its shutdown kept being delayed. A very similar power outage occurred
on July 10 and these WARNs did not occur and a heartbeat shutdown did
not happen, although at that time I was running a CVS version of 1.2.3
(the old log messages say heartbeat's version was 1.2.4).
This error might not have been so bad except that when the standby server
attempted to take over it hung during the start of nfs-kernel-server
until the power came back on. I'm not sure why, I'm planning on doing
some experiments at the end of this week to try to figure that out.
Worse, after that, even though the standby server had finished taking
over, when the clients booted up they all got their mounts refused with
messages like
rpc.mountd: refused mount request from 172.18.30.2 for /mnt/home (/): no export entry
which was the real killer and also a mystery although I assume it was
related to the long-delayed start of nfs-kernel-server.
Below are all the log messages leading up to the WARN message on both sides.
- Dave
*** active server ***
Jul 29 22:21:29 swfs1 kernel: e1000: eth0: e1000_watchdog: NIC Link is Down
Jul 29 22:21:38 swfs1 ipfail: [4231]: info: Telling other node that we have more visible ping nodes.
Jul 29 22:21:38 swfs1 heartbeat: [4118]: info: Link swfs2:eth0 dead.
Jul 29 22:21:38 swfs1 ipfail: [4231]: info: Link Status update: Link swfs2/eth0 now has status dead
Jul 29 22:21:38 swfs1 ipfail: [4231]: info: Asking other side for ping node count.
Jul 29 22:21:38 swfs1 ipfail: [4231]: info: Checking remote count of ping nodes.
Jul 29 22:21:38 swfs1 heartbeat: [4118]: WARN: node 172.18.1.1: is dead
Jul 29 22:21:38 swfs1 ipfail: [4231]: info: Status update: Node 172.18.1.1 now has status dead
Jul 29 22:21:38 swfs1 heartbeat: [4118]: info: Link 172.18.1.1:172.18.1.1 dead.
Jul 29 22:21:38 swfs1 ipfail: [4231]: info: NS: We are dead. :<
Jul 29 22:21:38 swfs1 ipfail: [4231]: info: Link Status update: Link 172.18.1.1/172.18.1.1 now has status dead
Jul 29 22:21:38 swfs1 ipfail: [4231]: info: We are dead. :<
Jul 29 22:21:38 swfs1 ipfail: [4231]: info: Asking other side for ping node count.
Jul 29 22:21:38 swfs1 ipfail: [4231]: info: Ping node count is balanced.
Jul 29 22:21:38 swfs1 ipfail: [4231]: info: No giveup timer to abort.
Jul 29 22:21:38 swfs1 harc[19003]: info: Running /etc/ha.d/rc.d/status status
Jul 29 22:22:10 swfs1 heartbeat: [4118]: WARN: Exiting HBREAD process 4227 returned rc 10.
Jul 29 22:22:10 swfs1 heartbeat: [4118]: info: Heartbeat shutdown in progress. (4118)
*** standby server ***
Jul 29 22:21:29 swfs2 kernel: e1000: eth0: e1000_watchdog: NIC Link is Down
Jul 29 22:21:38 swfs2 heartbeat: [9802]: WARN: node 172.18.1.1: is dead
Jul 29 22:21:38 swfs2 ipfail: [9815]: info: Status update: Node 172.18.1.1 now has status dead
Jul 29 22:21:38 swfs2 heartbeat: [9802]: info: Link 172.18.1.1:172.18.1.1 dead.
Jul 29 22:21:38 swfs2 ipfail: [9815]: info: NS: We are dead. :<
Jul 29 22:21:38 swfs2 ipfail: [9815]: info: Link Status update: Link 172.18.1.1/172.18.1.1 now has status dead
Jul 29 22:21:38 swfs2 ipfail: [9815]: info: We are dead. :<
Jul 29 22:21:38 swfs2 ipfail: [9815]: info: Asking other side for ping node count.
Jul 29 22:21:38 swfs2 ipfail: [9815]: info: Giving up because we were told that we have less ping nodes.
Jul 29 22:21:38 swfs2 ipfail: [9815]: info: Delayed giveup in 2 seconds.
Jul 29 22:21:38 swfs2 harc[8160]: info: Running /etc/ha.d/rc.d/status status
Jul 29 22:21:38 swfs2 ipfail: [9815]: info: Giving up because we have less visible ping nodes.
Jul 29 22:21:38 swfs2 ipfail: [9815]: info: Delayed giveup in 2 seconds.
Jul 29 22:21:38 swfs2 heartbeat: [9802]: info: Link swfs1:eth0 dead.
Jul 29 22:21:38 swfs2 ipfail: [9815]: info: Link Status update: Link swfs1/eth0 now has status dead
Jul 29 22:21:38 swfs2 ipfail: [9815]: info: We are dead. :<
Jul 29 22:21:38 swfs2 ipfail: [9815]: info: Asking other side for ping node count.
Jul 29 22:21:38 swfs2 ipfail: [9815]: info: Ping node count is balanced.
Jul 29 22:21:38 swfs2 ipfail: [9815]: info: Aborted delayed giveup (4)
Jul 29 22:21:38 swfs2 ipfail: [9815]: info: No giveup timer to abort.
Jul 29 22:22:12 swfs2 heartbeat: [9802]: WARN: Exiting HBREAD process 9812 returned rc 10.
Jul 29 22:22:12 swfs2 heartbeat: [9802]: WARN: Shutdown delayed until current resource activity finishes.
Jul 29 22:22:16 swfs2 kernel: drbd1: Secondary/Primary --> Secondary/Secondary
Jul 29 22:22:17 swfs2 kernel: drbd0: Secondary/Primary --> Secondary/Secondary
Jul 29 22:22:17 swfs2 heartbeat: [9802]: info: Received shutdown notice from 'swfs1'.
Jul 29 22:22:17 swfs2 heartbeat: [9802]: info: Resources being acquired from swfs1.
More information about the Linux-HA
mailing list