Failover no longer works after replacing a node
Holger.Kiehl at dwd.de
Wed Apr 4 05:06:19 MDT 2001
First of all the problem is now solved and I can again switch around as
much as I want and everything works as expected. What I did is that I
rebooted node diagnostix and since then everything is working. Why, I
don't know. See below for a more detailed description of what I did.
On Tue, 3 Apr 2001, Alan Robertson wrote:
> > As one can see I had to stop heartbeat on diagnostix so talentix does take
> > over again.
> > Tried to reproduce this on my cluster at home but there it works. Any idea
> > what I am doing wrong?
> I don't know. I'm not 100% sure what you did.
> Here's what I think you did...
> Shut down diagnostix
> Restarted diagnostix
diagnostix was never shut down, only heartbeat was stopped and then started
again. In fact diagnostix had an uptime of 351 days.
> Shut down talentix
> Restarted talentix
> I'm not sure what "sit there and do nothing" means.
> If you didn't wait long enough (at least deadtime seconds) between the shut
> down and restart of a node, this will cause great confusion. I don't have
> enough logs to see if that's what you did...
Here is a more detailed description of what I did.
At first I had two nodes diagnostix and praktifix where diagnostix was always
the primary node running some very old heartbeat version (I think it was
0.4.6). I then got a new computer talentix that was much more powerful to
replace praktifix. Praktifix was taken completely from the net and is turned
off. Now talentix had version 0.4.8 of heartbeat and I wanted it to be the
master. On diagnostix heartbeat was stopped and then started on talentix. Here
I ran into the first difficulties that heartbeat refused to start all
services on talentix and the vitual IP address. Only after putting diagnostix
in the haresources file would the services start on talentix. This seemed
strange at first, but since I decided to use nice failback it did not
matter. Then heartbeat was upgraded on diagnostix to 0.4.8 and then
started. Everything was fine so far. But when heartbeat is stopped on
talentix diagnostix does not take over. I waited several minutes but nothing
would happen. If diagnostix has the services and I stop heartbeat on it,
talentix will take the services. But never the other way around. I have tried
this with 0.4.8, 0.4.8i and 0.4.9. Always the same behaviour.
And as I read the mail from Alan about shutting down diagnostix, I decided
to reboot and since then it does work. I am absolutly sure that all heartbeat
process where always gone when heartbeat was stopped. With ifconfig I always
checked that the virtual IP address is gone. So I am a bit puzzled.
But please do not put to much effort in trying to solve the problem. It
would be nice to know what went wrong. But since I did most of this
a few month back I am not sure if I remember everything correctly. But the
ha-log is available and I can send it if it is needed.
More information about the Linux-HA