[Linux-ha-dev] Heartbeat 0.45 experiences

Alan Robertson alanr@bell-labs.com
Mon, 18 Oct 1999 18:43:18 -0600


This is not in the archives, so I'm resending...

Hi Steve,

Thanks for the detailed report.

Steve Beattie wrote:
> 
> Hi Alan, ha-developers,
> 
> I toyed around with 0.45 release last week, and thought I would report
> my results. Mostly, I am very happy with it; however, I did run into a
> few snags of varying seriousness.
> 
> 1. "/etc/rc.d/init.d/heartbeat restart" only stops heartbeat, it does
>    not actually restart it.

There are lock problems which might occur if the start actually started before
the stop was complete.  I suspect that of being the trouble.  Try putting a
sleep 2 in between the StopHA and StartHA calls right around line 257 to see
if that fixes the problem.  There should be a message about locking problems,
but it might get lost in this circumstance.

> 2. If heartbeat has failed over to the backup machine, and then the
>    heartbeat on the backup machine is cleanly stopped, it keeps the
>    resource even though it claims to have relinquished it (i.e. it
>    still has the IP address it took over from the original host).

This was just reported to me separately.  The particular instance reported to
me earlier was due to a strange behavior of the apache script when given a
"status" argument.  There are other possibilities to check out as well. 
Another would be that certain buggy versions of ifconfig don't show aliases
when you just do "ifconfig".  When they don't tell me, I don't know to take
them down.  Logs and config files from the backup machine would help.

I rely on the "status" argument to tell me whether I have a particular
resource, so I can tell whether to give it up.

> 3. One of my test machines is a laptop with a PCMCIA ethernet card.
>    When I yanked the card out, heartbeat failed over to the other
>    machine just fine, but when I put my NIC back in, the alias
>    interface was not recreated. Heartbeat was running on my laptop
>    the entire time, and was attempting to send out heartbeats on the
>    interface that no longer existed.
> 
>    While I can see that laptops are unlikely HA hardware, I can
>    foresee using PCMCIA cards as hot-swappable devices. Something to
>    think about, though I can understand a response of "not our problem."

I highly recommend that you consider redundant heartbeat media for this and
other reasons.  This problem is not one that the current code handles
correctly.  It's on the TODO list on the web.  It's roughly the same as
pulling the cable.  That will have the same effect.  This is our problem, but
we can't handle it the way you expect right at this point in time.

> 4. Situation: two machines, "good" and "bad". Bad has a failing disk, which
>    corrupts a few files on its filesystem. Good is the primary, bad is
>    the backup. On startup, good does not successfully grab the resource.
>    However, killing the heartbeat on good causes bad to successfully
>    take over. Restarting the heartbeat on good causes bad to relinquish,
>    but again good unsuccessfully attempts to take the resource.
> 
>    Here's the typical sort of log on good:
> 
>    heartbeat: 1999/10/14_15:24:53 info: ***********************
>    heartbeat: 1999/10/14_15:24:53 info: Configuration validated.  Starting heartbeat.
>    heartbeat: 1999/10/14_15:24:53 notice: UDP heartbeat started on port 1001 interface eth0
>    heartbeat: 1999/10/14_15:24:53 error: Cannot open /proc/ha/.control: No such file or directory
>    heartbeat: 1999/10/14_15:24:59 warn: node bad.int.wirex.com: is dead
>    heartbeat: 1999/10/14_15:24:59 INFO: Running /etc/ha.d/rc.d/status status
> 
>    and then nothing.
> 
>    Occasionally, I would see something like this in good's logs when bad
>    would start up:
> 
>    heartbeat: 1999/10/14_14:32:04 info: ***********************
>    heartbeat: 1999/10/14_14:32:04 info: Configuration validated.  Starting heartbeat.
>    heartbeat: 1999/10/14_14:32:04 notice: UDP heartbeat started on port 1001 interface eth0
>    heartbeat: 1999/10/14_14:32:04 error: Cannot open /proc/ha/.control: No such file or directory
>    heartbeat: 1999/10/14_14:32:14 error: string2msg: no MSG_START
>    heartbeat: 1999/10/14_14:32:14 error: Bad message is: [9^FÞÍ]
> 
>    Nothing of interest was found in the debug log.
> 
>    All of this went away as soon as I stopped using bad, and started
>    using a third machine with good.
> 
>    In this instance, the software failed me. It was unable to detect
>    that one of my machines was mildly insane; furthermore, the way the
>    problems manifested themselves, I was starting to believe the problem
>    was with the machine good, until I looked in bad's /var/log/messages
>    and saw disk errors.
> 
>    Also note that I am using md5 authentication, and good did not complain
>    about bad's packets failing authentication (these would have shown up
>    in the debug log). Which begs me to ask: what is the security model
>    behind the authentication scheme? What sort of threats are you
>    attempting to prevent by using it?

If you crank up the debug level (with several SIGUSR1 signals), then you can
see the what's going on here.  I seem to recall that I log packets with
invalid auth information.
 
>    If necessary, I can recreate the situation with bad, though I don't
>    have a lot of time that I can allocate to it.
> 
> 5. Oh yeah, the proc module does not compile under 2.0.36, which is what
>    all my machines in my testbed are running.

I mangled Volker's code, and it won't compile correctly on old kernels.  I
don't think it's complicated, but I don't have any way to test it to fix it
here.

> Hope this is of use to you. Let me know if I can provide you with more
> information. Thanks.
> 
> Steve

I very much appreciate the report.  Let me know what you find out.

 
	-- Alan Robertson
	   alanr@bell-labs.com