[Linux-HA] WARN: Exiting HBREAD process returned rc 10.
Alan Robertson
alanr at unix.sh
Wed Aug 3 10:24:37 MDT 2005
Dave Dykstra wrote:
> My computer room experienced a power outage last weekend. The HA-NFS
> servers survived it because they're on extra UPSs, but everything else
> was out for a little over 10 minutes. The two HA servers have a direct
> connection on eth1, and on eth0 they've got a connection to a gigabit
> switch that goes to the rest of the network, and the switch lost power
> along with everything else. I am running version 1.99.5.cvs.20050708
> from ultramonkey.
>
> The active HA server stayed active for about 40 seconds but then said
> WARN: Exiting HBREAD process 4227 returned rc 10.
> and started a heartbeat shutdown. Does anybody have any idea of what
> could cause that? It looks to me in the source that all exits use a
> code of LSB_EXIT_something and I don't see any of them defined to be 10.
> The same message appeared on the standby server two seconds later but
> its shutdown kept being delayed. A very similar power outage occurred
> on July 10 and these WARNs did not occur and a heartbeat shutdown did
> not happen, although at that time I was running a CVS version of 1.2.3
> (the old log messages say heartbeat's version was 1.2.4).
>
> This error might not have been so bad except that when the standby server
> attempted to take over it hung during the start of nfs-kernel-server
> until the power came back on. I'm not sure why, I'm planning on doing
> some experiments at the end of this week to try to figure that out.
> Worse, after that, even though the standby server had finished taking
> over, when the clients booted up they all got their mounts refused with
> messages like
> rpc.mountd: refused mount request from 172.18.30.2 for /mnt/home (/): no export entry
> which was the real killer and also a mystery although I assume it was
> related to the long-delayed start of nfs-kernel-server.
>
> Below are all the log messages leading up to the WARN message on both sides.
>
> - Dave
>
> *** active server ***
> Jul 29 22:21:29 swfs1 kernel: e1000: eth0: e1000_watchdog: NIC Link is Down
> Jul 29 22:21:38 swfs1 ipfail: [4231]: info: Telling other node that we have more visible ping nodes.
> Jul 29 22:21:38 swfs1 heartbeat: [4118]: info: Link swfs2:eth0 dead.
> Jul 29 22:21:38 swfs1 ipfail: [4231]: info: Link Status update: Link swfs2/eth0 now has status dead
> Jul 29 22:21:38 swfs1 ipfail: [4231]: info: Asking other side for ping node count.
> Jul 29 22:21:38 swfs1 ipfail: [4231]: info: Checking remote count of ping nodes.
> Jul 29 22:21:38 swfs1 heartbeat: [4118]: WARN: node 172.18.1.1: is dead
> Jul 29 22:21:38 swfs1 ipfail: [4231]: info: Status update: Node 172.18.1.1 now has status dead
> Jul 29 22:21:38 swfs1 heartbeat: [4118]: info: Link 172.18.1.1:172.18.1.1 dead.
> Jul 29 22:21:38 swfs1 ipfail: [4231]: info: NS: We are dead. :<
> Jul 29 22:21:38 swfs1 ipfail: [4231]: info: Link Status update: Link 172.18.1.1/172.18.1.1 now has status dead
> Jul 29 22:21:38 swfs1 ipfail: [4231]: info: We are dead. :<
> Jul 29 22:21:38 swfs1 ipfail: [4231]: info: Asking other side for ping node count.
> Jul 29 22:21:38 swfs1 ipfail: [4231]: info: Ping node count is balanced.
> Jul 29 22:21:38 swfs1 ipfail: [4231]: info: No giveup timer to abort.
> Jul 29 22:21:38 swfs1 harc[19003]: info: Running /etc/ha.d/rc.d/status status
> Jul 29 22:22:10 swfs1 heartbeat: [4118]: WARN: Exiting HBREAD process 4227 returned rc 10.
> Jul 29 22:22:10 swfs1 heartbeat: [4118]: info: Heartbeat shutdown in progress. (4118)
>
> *** standby server ***
> Jul 29 22:21:29 swfs2 kernel: e1000: eth0: e1000_watchdog: NIC Link is Down
> Jul 29 22:21:38 swfs2 heartbeat: [9802]: WARN: node 172.18.1.1: is dead
> Jul 29 22:21:38 swfs2 ipfail: [9815]: info: Status update: Node 172.18.1.1 now has status dead
> Jul 29 22:21:38 swfs2 heartbeat: [9802]: info: Link 172.18.1.1:172.18.1.1 dead.
> Jul 29 22:21:38 swfs2 ipfail: [9815]: info: NS: We are dead. :<
> Jul 29 22:21:38 swfs2 ipfail: [9815]: info: Link Status update: Link 172.18.1.1/172.18.1.1 now has status dead
> Jul 29 22:21:38 swfs2 ipfail: [9815]: info: We are dead. :<
> Jul 29 22:21:38 swfs2 ipfail: [9815]: info: Asking other side for ping node count.
> Jul 29 22:21:38 swfs2 ipfail: [9815]: info: Giving up because we were told that we have less ping nodes.
> Jul 29 22:21:38 swfs2 ipfail: [9815]: info: Delayed giveup in 2 seconds.
> Jul 29 22:21:38 swfs2 harc[8160]: info: Running /etc/ha.d/rc.d/status status
> Jul 29 22:21:38 swfs2 ipfail: [9815]: info: Giving up because we have less visible ping nodes.
> Jul 29 22:21:38 swfs2 ipfail: [9815]: info: Delayed giveup in 2 seconds.
> Jul 29 22:21:38 swfs2 heartbeat: [9802]: info: Link swfs1:eth0 dead.
> Jul 29 22:21:38 swfs2 ipfail: [9815]: info: Link Status update: Link swfs1/eth0 now has status dead
> Jul 29 22:21:38 swfs2 ipfail: [9815]: info: We are dead. :<
> Jul 29 22:21:38 swfs2 ipfail: [9815]: info: Asking other side for ping node count.
> Jul 29 22:21:38 swfs2 ipfail: [9815]: info: Ping node count is balanced.
> Jul 29 22:21:38 swfs2 ipfail: [9815]: info: Aborted delayed giveup (4)
> Jul 29 22:21:38 swfs2 ipfail: [9815]: info: No giveup timer to abort.
> Jul 29 22:22:12 swfs2 heartbeat: [9802]: WARN: Exiting HBREAD process 9812 returned rc 10.
> Jul 29 22:22:12 swfs2 heartbeat: [9802]: WARN: Shutdown delayed until current resource activity finishes.
> Jul 29 22:22:16 swfs2 kernel: drbd1: Secondary/Primary --> Secondary/Secondary
> Jul 29 22:22:17 swfs2 kernel: drbd0: Secondary/Primary --> Secondary/Secondary
> Jul 29 22:22:17 swfs2 heartbeat: [9802]: info: Received shutdown notice from 'swfs1'.
> Jul 29 22:22:17 swfs2 heartbeat: [9802]: info: Resources being acquired from swfs1.
Dave:
This may be fixed by the patches I posted for a very similar problem
encountered by Ulrich Thomas. If you are running an app on the dying
machine which is pinging something like mad in the same time interval,
then these patches may be for you.
--
Alan Robertson <alanr at unix.sh>
"Openness is the foundation and preservative of friendship... Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
More information about the Linux-HA
mailing list