[Linux-HA] passive - passive problem

Alan Robertson alanr at unix.sh
Fri Aug 5 22:41:02 MDT 2005


Pavol Gono wrote:
> Hi
> 
> We did some additional tests with heartbeat and we found next
> possible error. 2 machines were running some hours, we made
> intentional failovers with hb_standby, hb_takeover and
> unplugging cables.
> 
> The last operations we writed down (time from m-machine):
> 
> 17:37:24, m-machine, /usr/lib/heartbeat/hb_standby
> 17:46:17, m-machine, /usr/lib/heartbeat/hb_takeover
> 17:51:56, m-machine, unplugging cable from eth2 interface (ping
> node 172.30.128.1 unreachable)
> 17:52:55, m-machine, plugging cable to eth2 interface
> 17:53:42, m and s machines, unplugging cables from eth3
> interfaces (ping node 10.253.51.1 unreachable)
> !!! now both machines became passive (no virtual interfaces)
> 18:04:49, m and s machines, plugging cables to eth3 interfaces 
> 
> In this state both machines had all interfaces plugged, both
> ping nodes from both machines were pingable, but machines
> remained passive.
> 
>>From this time I stored almost untouched syslog messages,
> configurations of HB, list of processes. If you need also
> complete netstat and lsof output, I can provide too.
> 
> OS: Suse 9.3
> kernel: 2.6.11.4-21.7
> heartbeat: 1.2.3-3.1 (RPM from Suse 9.2)
> 
> Next I tried run hb_standby and hb_takeover on both machines,
> these are results:
> 
> m-machine:
> Aug  4 19:00:13 Hig50v3Pm heartbeat: info: Running
> /etc/ha.d/rc.d/hb_takeover hb_takeover
> Aug  4 19:00:13 Hig50v3Pm heartbeat[6546]: WARN: standby
> message [me] from hig50v3ps ignored.  Other side is in flux.
> Aug  4 19:03:17 Hig50v3Pm heartbeat[6546]: WARN: standby
> message [me] from hig50v3ps ignored.  Other side is in flux.
> Aug  4 19:04:28 Hig50v3Pm heartbeat[6546]: WARN: standby
> message [me] from hig50v3pm ignored.  Other side is in flux.
> Aug  4 19:04:28 Hig50v3Pm heartbeat[6546]: WARN: standby
> message [other] from hig50v3ps ignored.  Other side is in flux.
> 
> s-machine:
> Aug  4 18:52:57 Hig50v3Ps heartbeat[6525]: WARN: No reply to
> standby request.  Standby request cancelled.
> Aug  4 18:59:24 Hig50v3Ps heartbeat[29889]: debug:
> notify_world: setting SIGCHLD Handler to SIG_DFL
> Aug  4 18:59:24 Hig50v3Ps heartbeat: info: Running
> /etc/ha.d/rc.d/hb_takeover hb_takeover
> Aug  4 18:59:24 Hig50v3Ps heartbeat: Going standby [all].
> Aug  4 18:59:24 Hig50v3Ps heartbeat[6525]: info: hig50v3ps
> wants to go standby [all]
> Aug  4 18:59:34 Hig50v3Ps heartbeat[6525]: WARN: No reply to
> standby request.  Standby request cancelled.
> Aug  4 19:02:28 Hig50v3Ps heartbeat[6525]: info: hig50v3ps
> wants to go standby [all]
> Aug  4 19:02:38 Hig50v3Ps heartbeat[6525]: WARN: No reply to
> standby request.  Standby request cancelled.
> Aug  4 19:03:38 Hig50v3Ps heartbeat[6525]: info: hig50v3pm
> wants to go standby [all]
> 
> Last thing I did was unplugging crossover cable (used for
> heartbeats) and then plugging - after this one machine remained
> active and everyting worked ok again.
> 
> Configuration is almost the same like in previous case (ipfail
> did not triggered failover - Wed 08/03), but now we have fixed
> resource scripts so that they never block.

You need to not return success if they aren't finished.

You could have run into a bug in 1.2.3.  I seem to remember a bug like 
that.  It may be fixed in CVS.  But, I'm not 100% sure.  The 2.0.0 
release kind of took away all my attention from 1.2.x for a VERY long time.


-- 
     Alan Robertson <alanr at unix.sh>

"Openness is the foundation and preservative of friendship...  Let me 
claim from you at all times your undisguised opinions." - William 
Wilberforce


More information about the Linux-HA mailing list