Failover no longer works after replacing a node

Holger Kiehl Holger.Kiehl@dwd.de
Mon, 2 Apr 2001 13:24:01 +0200 (CEST)


Hello

I have been running heartbeat for a very long time and it worked very nicely.
However, recently I got a new and better computer and replaced the secondary
node with this. The secondary node is no longer on the net and is turned off.
The new computer is configured exactly like the one it replaced, except
having a newer distribution on it, but heartbeat versions are the same.
My plan was to now make the secondary node (since it is much faster) as
primary node. This still worked, by shutting down heartbeat on the former
primary node (and starting it again) and the new one took over. However when
I want to switch back, this no longer works. Nice failback is set and I
tried with heartbeat 0.4.8, 0.4.8i and 0.4.9 always with the same result.

Here are my config files ha.cf:

Node diagnostix is the old primary node and talentix is the new node.
diagnostix:
      debugfile /var/log/ha-debug
      logfile /var/log/ha-log
      keepalive 2
      deadtime 20
      initdead 40
      serial  /dev/ttyS1
      baud    19200
      udpport 1001
      udp     eth0
      nice_failback on
      node    diagnostix
      node    talentix

talentix:
      debugfile /var/log/ha-debug
      logfile /var/log/ha-log
      keepalive 2
      deadtime 20
      initdead 40
      serial  /dev/ttyS0
      baud    19200
      udpport 1001
      udp     eth0
      nice_failback on
      node    talentix
      node    diagnostix


haresources is on both nodes the same:
      diagnostix	141.38.42.140

Here is what ha-log on talentix had to say when it is shutdown:
      12:24:37 info: Heartbeat shutdown in progress.
      12:24:37 ERROR: controlfifo2msg: cannot create message
      12:24:37 ERROR: control_process: NULL message
      12:24:37 info: Giving up all HA resources.
      12:24:41 info: Releasing resource group: diagnostix 141.38.42.140
      12:24:42 info: Running /etc/ha.d/resource.d/IPaddr 141.38.42.140 stop
      12:24:47 info: IP Address 141.38.42.140 released
      12:24:48 info: All HA resources relinquished.
      12:24:49 info: Heartbeat shutdown complete.

If I remember correctly Alan once said those two error messages are harmless.

And here diagnostix ha-log at that time:
      12:24:46 info: Running /etc/ha.d/rc.d/shutdone shutdone
      12:26:00 ERROR: TTY write timeout on [/dev/ttyS1] (no connection?)

diagnostix will just sit there and do nothing. If heartbeat is started again
on talentix the following messages pop up in talentix ha-log:
      12:26:08 info: Configuration validated. Starting heartbeat 0.4.9
      12:26:08 info: nice_failback is in effect.
      12:26:08 info: heartbeat: version 0.4.9
      12:26:08 info: Heartbeat generation: 12
      12:26:08 notice: Starting serial heartbeat on tty /dev/ttyS0
      12:26:08 notice: UDP heartbeat started on port 1001 interface eth0
      12:26:08 info: Local status now set to: 'up'
      12:26:09 info: Heartbeat restart on node talentix
      12:26:09 info: Link talentix:eth0 up.
      12:26:09 info: Local status now set to: 'active'
      12:26:09 info: Heartbeat restart on node diagnostix
      12:26:09 info: Link diagnostix:/dev/ttyS0 up.
      12:26:09 info: Node diagnostix: status active
      12:26:09 info: Node talentix: status up
      12:26:09 info: Node talentix: status active
      12:26:09 info: Link diagnostix:eth0 up.
      12:26:09 info: remote resource transition completed.
      12:26:09 info: local resource transition completed.
      12:26:09 ERROR: No one owns foreign resources!
      12:26:09 ERROR: No one owns foreign resources!
      12:26:09 info: Running /etc/ha.d/rc.d/ifstat ifstat
      12:26:09 info: Running /etc/ha.d/rc.d/ifstat ifstat
      12:26:09 info: Running /etc/ha.d/rc.d/ifstat ifstat
      12:26:09 info: Running /etc/ha.d/rc.d/status status
      12:26:09 info: Running /etc/ha.d/rc.d/status status
      12:26:09 ERROR: No one owns foreign resources!
      12:26:09 info: Running /etc/ha.d/rc.d/status status
      12:26:09 info: No local resources [/usr/lib/heartbeat/ResourceManager listkeys talentix]
      12:26:09 ERROR: No one owns foreign resources!
      12:26:09 info: Resource acquisition completed.
      12:26:11 ERROR: No one owns foreign resources!
      12:26:11 ERROR: No one owns foreign resources!
      12:26:13 ERROR: No one owns foreign resources!
      12:26:13 ERROR: No one owns foreign resources!
      12:26:15 ERROR: No one owns foreign resources!
      12:26:15 ERROR: No one owns foreign resources!
      12:26:17 ERROR: No one owns foreign resources!
      12:26:17 ERROR: No one owns foreign resources!
      12:26:19 ERROR: No one owns foreign resources!
      12:26:19 ERROR: No one owns foreign resources!
      12:26:20 info: Running /etc/ha.d/rc.d/shutdone shutdone

As one can see I had to stop heartbeat on diagnostix so talentix does take
over again.

Tried to reproduce this on my cluster at home but there it works. Any idea
what I am doing wrong?

Thanks,
Holger