Failover no longer works after replacing a node
Holger Kiehl
Holger.Kiehl@dwd.de
Mon, 2 Apr 2001 13:24:01 +0200 (CEST)
Hello
I have been running heartbeat for a very long time and it worked very nicely.
However, recently I got a new and better computer and replaced the secondary
node with this. The secondary node is no longer on the net and is turned off.
The new computer is configured exactly like the one it replaced, except
having a newer distribution on it, but heartbeat versions are the same.
My plan was to now make the secondary node (since it is much faster) as
primary node. This still worked, by shutting down heartbeat on the former
primary node (and starting it again) and the new one took over. However when
I want to switch back, this no longer works. Nice failback is set and I
tried with heartbeat 0.4.8, 0.4.8i and 0.4.9 always with the same result.
Here are my config files ha.cf:
Node diagnostix is the old primary node and talentix is the new node.
diagnostix:
debugfile /var/log/ha-debug
logfile /var/log/ha-log
keepalive 2
deadtime 20
initdead 40
serial /dev/ttyS1
baud 19200
udpport 1001
udp eth0
nice_failback on
node diagnostix
node talentix
talentix:
debugfile /var/log/ha-debug
logfile /var/log/ha-log
keepalive 2
deadtime 20
initdead 40
serial /dev/ttyS0
baud 19200
udpport 1001
udp eth0
nice_failback on
node talentix
node diagnostix
haresources is on both nodes the same:
diagnostix 141.38.42.140
Here is what ha-log on talentix had to say when it is shutdown:
12:24:37 info: Heartbeat shutdown in progress.
12:24:37 ERROR: controlfifo2msg: cannot create message
12:24:37 ERROR: control_process: NULL message
12:24:37 info: Giving up all HA resources.
12:24:41 info: Releasing resource group: diagnostix 141.38.42.140
12:24:42 info: Running /etc/ha.d/resource.d/IPaddr 141.38.42.140 stop
12:24:47 info: IP Address 141.38.42.140 released
12:24:48 info: All HA resources relinquished.
12:24:49 info: Heartbeat shutdown complete.
If I remember correctly Alan once said those two error messages are harmless.
And here diagnostix ha-log at that time:
12:24:46 info: Running /etc/ha.d/rc.d/shutdone shutdone
12:26:00 ERROR: TTY write timeout on [/dev/ttyS1] (no connection?)
diagnostix will just sit there and do nothing. If heartbeat is started again
on talentix the following messages pop up in talentix ha-log:
12:26:08 info: Configuration validated. Starting heartbeat 0.4.9
12:26:08 info: nice_failback is in effect.
12:26:08 info: heartbeat: version 0.4.9
12:26:08 info: Heartbeat generation: 12
12:26:08 notice: Starting serial heartbeat on tty /dev/ttyS0
12:26:08 notice: UDP heartbeat started on port 1001 interface eth0
12:26:08 info: Local status now set to: 'up'
12:26:09 info: Heartbeat restart on node talentix
12:26:09 info: Link talentix:eth0 up.
12:26:09 info: Local status now set to: 'active'
12:26:09 info: Heartbeat restart on node diagnostix
12:26:09 info: Link diagnostix:/dev/ttyS0 up.
12:26:09 info: Node diagnostix: status active
12:26:09 info: Node talentix: status up
12:26:09 info: Node talentix: status active
12:26:09 info: Link diagnostix:eth0 up.
12:26:09 info: remote resource transition completed.
12:26:09 info: local resource transition completed.
12:26:09 ERROR: No one owns foreign resources!
12:26:09 ERROR: No one owns foreign resources!
12:26:09 info: Running /etc/ha.d/rc.d/ifstat ifstat
12:26:09 info: Running /etc/ha.d/rc.d/ifstat ifstat
12:26:09 info: Running /etc/ha.d/rc.d/ifstat ifstat
12:26:09 info: Running /etc/ha.d/rc.d/status status
12:26:09 info: Running /etc/ha.d/rc.d/status status
12:26:09 ERROR: No one owns foreign resources!
12:26:09 info: Running /etc/ha.d/rc.d/status status
12:26:09 info: No local resources [/usr/lib/heartbeat/ResourceManager listkeys talentix]
12:26:09 ERROR: No one owns foreign resources!
12:26:09 info: Resource acquisition completed.
12:26:11 ERROR: No one owns foreign resources!
12:26:11 ERROR: No one owns foreign resources!
12:26:13 ERROR: No one owns foreign resources!
12:26:13 ERROR: No one owns foreign resources!
12:26:15 ERROR: No one owns foreign resources!
12:26:15 ERROR: No one owns foreign resources!
12:26:17 ERROR: No one owns foreign resources!
12:26:17 ERROR: No one owns foreign resources!
12:26:19 ERROR: No one owns foreign resources!
12:26:19 ERROR: No one owns foreign resources!
12:26:20 info: Running /etc/ha.d/rc.d/shutdone shutdone
As one can see I had to stop heartbeat on diagnostix so talentix does take
over again.
Tried to reproduce this on my cluster at home but there it works. Any idea
what I am doing wrong?
Thanks,
Holger