[Linux-ha-dev] Heartbeat 0.45 experiences
Alan Robertson
alanr@bell-labs.com
Mon, 18 Oct 1999 16:09:48 -0600
Thomas Hepper wrote:
>
> Hi,
> On Mon, Oct 18, 1999 at 01:28:18PM -0700, Steve Beattie wrote:
> >
> > 2. If heartbeat has failed over to the backup machine, and then the
> > heartbeat on the backup machine is cleanly stopped, it keeps the
> > resource even though it claims to have relinquished it (i.e. it
> > still has the IP address it took over from the original host).
>
> Same here. Still looking for some more debug output
See my earlier messages on this subject. There's a SuSE bug I identified
earlier. Thomas, could you check to see if that's your situation? These two
things have to be true:
I have to be able to rely on script-name status to give "running" when it's
up
I have to be able to rely on ifconfig to give the alias names when they're
configured
>
> >
> > 4. Situation: two machines, "good" and "bad". Bad has a failing disk, which
> > corrupts a few files on its filesystem. Good is the primary, bad is
> > the backup. On startup, good does not successfully grab the resource.
> > However, killing the heartbeat on good causes bad to successfully
> > take over. Restarting the heartbeat on good causes bad to relinquish,
> > but again good unsuccessfully attempts to take the resource.
> >
> > Here's the typical sort of log on good:
> >
> > heartbeat: 1999/10/14_15:24:53 info: ***********************
> > heartbeat: 1999/10/14_15:24:53 info: Configuration validated. Starting heartbeat.
> > heartbeat: 1999/10/14_15:24:53 notice: UDP heartbeat started on port 1001 interface eth0
> > heartbeat: 1999/10/14_15:24:53 error: Cannot open /proc/ha/.control: No such file or directory
> > heartbeat: 1999/10/14_15:24:59 warn: node bad.int.wirex.com: is dead
> > heartbeat: 1999/10/14_15:24:59 INFO: Running /etc/ha.d/rc.d/status status
> >
> > and then nothing.
>
> Yup the same here ...
>
> This happens here after an longer uptime. After the start on both nodes
> an killing and starting of heartbeast on the master works as expected,
> next day the slave will take the resources when stopping heartbeat on the
> master, but it will not release it after starting the master. The result
> than is that both nodes have the ip address.
>
> So the question is, how to debug this ?
I have still never seen this one. Thomas: Is this still happening to you in
0.4.5?
You could try cranking up the debug level on the slave just before restarting
the master. I don't think I've seen your logs for this circumstance...
I still need to see detailed logs from both sides with config files.
What OS are you running? Can I see your (non-auth) config files?
-- Alan Robertson
alanr@bell-labs.com