[Linux-ha-dev] Heartbeat 0.45 experiences

Steve Beattie steve@wirex.net
Mon, 18 Oct 1999 13:28:18 -0700 (PDT)


Hi Alan, ha-developers,

I toyed around with 0.45 release last week, and thought I would report
my results. Mostly, I am very happy with it; however, I did run into a
few snags of varying seriousness.

1. "/etc/rc.d/init.d/heartbeat restart" only stops heartbeat, it does 
   not actually restart it.

2. If heartbeat has failed over to the backup machine, and then the 
   heartbeat on the backup machine is cleanly stopped, it keeps the 
   resource even though it claims to have relinquished it (i.e. it 
   still has the IP address it took over from the original host).

3. One of my test machines is a laptop with a PCMCIA ethernet card. 
   When I yanked the card out, heartbeat failed over to the other 
   machine just fine, but when I put my NIC back in, the alias 
   interface was not recreated. Heartbeat was running on my laptop 
   the entire time, and was attempting to send out heartbeats on the
   interface that no longer existed.  

   While I can see that laptops are unlikely HA hardware, I can 
   foresee using PCMCIA cards as hot-swappable devices. Something to 
   think about, though I can understand a response of "not our problem."

4. Situation: two machines, "good" and "bad". Bad has a failing disk, which
   corrupts a few files on its filesystem. Good is the primary, bad is
   the backup. On startup, good does not successfully grab the resource.
   However, killing the heartbeat on good causes bad to successfully
   take over. Restarting the heartbeat on good causes bad to relinquish,
   but again good unsuccessfully attempts to take the resource. 

   Here's the typical sort of log on good:

   heartbeat: 1999/10/14_15:24:53 info: ***********************
   heartbeat: 1999/10/14_15:24:53 info: Configuration validated.  Starting heartbeat.
   heartbeat: 1999/10/14_15:24:53 notice: UDP heartbeat started on port 1001 interface eth0
   heartbeat: 1999/10/14_15:24:53 error: Cannot open /proc/ha/.control: No such file or directory
   heartbeat: 1999/10/14_15:24:59 warn: node bad.int.wirex.com: is dead
   heartbeat: 1999/10/14_15:24:59 INFO: Running /etc/ha.d/rc.d/status status

   and then nothing. 

   Occasionally, I would see something like this in good's logs when bad
   would start up:

   heartbeat: 1999/10/14_14:32:04 info: ***********************
   heartbeat: 1999/10/14_14:32:04 info: Configuration validated.  Starting heartbeat.
   heartbeat: 1999/10/14_14:32:04 notice: UDP heartbeat started on port 1001 interface eth0
   heartbeat: 1999/10/14_14:32:04 error: Cannot open /proc/ha/.control: No such file or directory
   heartbeat: 1999/10/14_14:32:14 error: string2msg: no MSG_START
   heartbeat: 1999/10/14_14:32:14 error: Bad message is: [9^FÞÍ]

   Nothing of interest was found in the debug log.

   All of this went away as soon as I stopped using bad, and started
   using a third machine with good. 

   In this instance, the software failed me. It was unable to detect
   that one of my machines was mildly insane; furthermore, the way the
   problems manifested themselves, I was starting to believe the problem
   was with the machine good, until I looked in bad's /var/log/messages
   and saw disk errors. 

   Also note that I am using md5 authentication, and good did not complain
   about bad's packets failing authentication (these would have shown up
   in the debug log). Which begs me to ask: what is the security model
   behind the authentication scheme? What sort of threats are you
   attempting to prevent by using it?

   If necessary, I can recreate the situation with bad, though I don't
   have a lot of time that I can allocate to it.

5. Oh yeah, the proc module does not compile under 2.0.36, which is what
   all my machines in my testbed are running.

Hope this is of use to you. Let me know if I can provide you with more
information. Thanks.

Steve