[Linux-HA] passive - passive problem
Alan Robertson
alanr at unix.sh
Fri Aug 5 22:41:02 MDT 2005
Pavol Gono wrote:
> Hi
>
> We did some additional tests with heartbeat and we found next
> possible error. 2 machines were running some hours, we made
> intentional failovers with hb_standby, hb_takeover and
> unplugging cables.
>
> The last operations we writed down (time from m-machine):
>
> 17:37:24, m-machine, /usr/lib/heartbeat/hb_standby
> 17:46:17, m-machine, /usr/lib/heartbeat/hb_takeover
> 17:51:56, m-machine, unplugging cable from eth2 interface (ping
> node 172.30.128.1 unreachable)
> 17:52:55, m-machine, plugging cable to eth2 interface
> 17:53:42, m and s machines, unplugging cables from eth3
> interfaces (ping node 10.253.51.1 unreachable)
> !!! now both machines became passive (no virtual interfaces)
> 18:04:49, m and s machines, plugging cables to eth3 interfaces
>
> In this state both machines had all interfaces plugged, both
> ping nodes from both machines were pingable, but machines
> remained passive.
>
>>From this time I stored almost untouched syslog messages,
> configurations of HB, list of processes. If you need also
> complete netstat and lsof output, I can provide too.
>
> OS: Suse 9.3
> kernel: 2.6.11.4-21.7
> heartbeat: 1.2.3-3.1 (RPM from Suse 9.2)
>
> Next I tried run hb_standby and hb_takeover on both machines,
> these are results:
>
> m-machine:
> Aug 4 19:00:13 Hig50v3Pm heartbeat: info: Running
> /etc/ha.d/rc.d/hb_takeover hb_takeover
> Aug 4 19:00:13 Hig50v3Pm heartbeat[6546]: WARN: standby
> message [me] from hig50v3ps ignored. Other side is in flux.
> Aug 4 19:03:17 Hig50v3Pm heartbeat[6546]: WARN: standby
> message [me] from hig50v3ps ignored. Other side is in flux.
> Aug 4 19:04:28 Hig50v3Pm heartbeat[6546]: WARN: standby
> message [me] from hig50v3pm ignored. Other side is in flux.
> Aug 4 19:04:28 Hig50v3Pm heartbeat[6546]: WARN: standby
> message [other] from hig50v3ps ignored. Other side is in flux.
>
> s-machine:
> Aug 4 18:52:57 Hig50v3Ps heartbeat[6525]: WARN: No reply to
> standby request. Standby request cancelled.
> Aug 4 18:59:24 Hig50v3Ps heartbeat[29889]: debug:
> notify_world: setting SIGCHLD Handler to SIG_DFL
> Aug 4 18:59:24 Hig50v3Ps heartbeat: info: Running
> /etc/ha.d/rc.d/hb_takeover hb_takeover
> Aug 4 18:59:24 Hig50v3Ps heartbeat: Going standby [all].
> Aug 4 18:59:24 Hig50v3Ps heartbeat[6525]: info: hig50v3ps
> wants to go standby [all]
> Aug 4 18:59:34 Hig50v3Ps heartbeat[6525]: WARN: No reply to
> standby request. Standby request cancelled.
> Aug 4 19:02:28 Hig50v3Ps heartbeat[6525]: info: hig50v3ps
> wants to go standby [all]
> Aug 4 19:02:38 Hig50v3Ps heartbeat[6525]: WARN: No reply to
> standby request. Standby request cancelled.
> Aug 4 19:03:38 Hig50v3Ps heartbeat[6525]: info: hig50v3pm
> wants to go standby [all]
>
> Last thing I did was unplugging crossover cable (used for
> heartbeats) and then plugging - after this one machine remained
> active and everyting worked ok again.
>
> Configuration is almost the same like in previous case (ipfail
> did not triggered failover - Wed 08/03), but now we have fixed
> resource scripts so that they never block.
You need to not return success if they aren't finished.
You could have run into a bug in 1.2.3. I seem to remember a bug like
that. It may be fixed in CVS. But, I'm not 100% sure. The 2.0.0
release kind of took away all my attention from 1.2.x for a VERY long time.
--
Alan Robertson <alanr at unix.sh>
"Openness is the foundation and preservative of friendship... Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
More information about the Linux-HA
mailing list