[Linux-ha-dev] Heartbeat 0.45 experiences

Alan Robertson alanr@bell-labs.com
Mon, 18 Oct 1999 16:09:48 -0600


Thomas Hepper wrote:
> 
> Hi,
> On Mon, Oct 18, 1999 at 01:28:18PM -0700, Steve Beattie wrote:
> >
> > 2. If heartbeat has failed over to the backup machine, and then the
> >    heartbeat on the backup machine is cleanly stopped, it keeps the
> >    resource even though it claims to have relinquished it (i.e. it
> >    still has the IP address it took over from the original host).
> 
> Same here. Still looking for some more debug output

See my earlier messages on this subject.  There's a SuSE bug I identified
earlier.  Thomas, could you check to see if that's your situation?  These two
things have to be true:
	I have to be able to rely on script-name status to give "running" when it's
up
	I have to be able to rely on ifconfig to give the alias names when they're
configured
> 
> >
> > 4. Situation: two machines, "good" and "bad". Bad has a failing disk, which
> >    corrupts a few files on its filesystem. Good is the primary, bad is
> >    the backup. On startup, good does not successfully grab the resource.
> >    However, killing the heartbeat on good causes bad to successfully
> >    take over. Restarting the heartbeat on good causes bad to relinquish,
> >    but again good unsuccessfully attempts to take the resource.
> >
> >    Here's the typical sort of log on good:
> >
> >    heartbeat: 1999/10/14_15:24:53 info: ***********************
> >    heartbeat: 1999/10/14_15:24:53 info: Configuration validated.  Starting heartbeat.
> >    heartbeat: 1999/10/14_15:24:53 notice: UDP heartbeat started on port 1001 interface eth0
> >    heartbeat: 1999/10/14_15:24:53 error: Cannot open /proc/ha/.control: No such file or directory
> >    heartbeat: 1999/10/14_15:24:59 warn: node bad.int.wirex.com: is dead
> >    heartbeat: 1999/10/14_15:24:59 INFO: Running /etc/ha.d/rc.d/status status
> >
> >    and then nothing.
> 
> Yup the same here ...
> 
> This happens here after an longer uptime. After the start on both nodes
> an killing and starting of heartbeast on the master works as expected,
> next day the slave will take the resources when stopping heartbeat on the
> master, but it will not release it after starting the master. The result
> than is that both nodes have the ip address.
> 
> So the question is, how to debug this ?

I have still never seen this one.  Thomas: Is this still happening to you in
0.4.5?

You could try cranking up the debug level on the slave just before restarting
the master.  I don't think I've seen your logs for this circumstance...

I still need to see detailed logs from both sides with config files.

What OS are you running?  Can I see your (non-auth) config files?


	-- Alan Robertson
	   alanr@bell-labs.com