alex at incredimail.com
Thu Oct 24 11:17:46 MDT 2002
----- Original Message -----
From: "Alan Robertson" <alanr at unix.sh>
To: "Alex Kramarov" <alex at incredimail.com>
Cc: <linux-ha at muc.de>
Sent: Thursday, October 24, 2002 7:05 PM
Subject: Re: Heartbeat failure
> Alex Kramarov wrote:
> > This thing just happened again, on another cluster, where i didn't
> > the deadtime before. But this failure depicts the scenario that i
> > :
> > w5 detects that w4 dies
> > w5 takes over the vip
> > w5 detects that w4 is back
> > w5 releases the vip
> > w4 doesn't detect that ANYTHING happened, so the traffic to the VIP
> > flows to w5 !!!
> Until the ARP cache times out. Usually a few minutes.
as far as i know, arp cache will not time out while traffic trying to flow
to the ip - am i correct ? in my situation, service was interrupted for more
then half an hour, and didn't resume automatically, until heartbeat was
> > THIS IS A VERY SERIOS BUG which results in denial of service until
> > is manually restarted on w4, which can be simply resolved by having the
> > heartbeat that owns the vip send gracios arps from time to time ...
> As was mentioned before, this is fixed in the beta code, and your
> doesn't fix the problem for all users. You are free to make that patch to
> your version if you like.
> This bug will not be fixed in 0.4.9.2. The only thing that would make me
> turn the crank and create 0.4.9.3 would be a serious, exploitable security
> hole in 0.4.9.2 discovered before 0.5 comes out in a month or two.
> That branch of code is dead, and the necessary changes to do this right
> somewhat painful to make in that version.
> So, do this:
> Set deadtime as recommended in the documentation.
> If you are still concerned about this possibility then
> a) add a STONITH device to your configuration
> or b) run beta version 0.4.9e.
> Do step (1) and either (a) or (b) and your problem should go completely
> If you do step (1) and one of (a) or (b) and you still have problems, then
> we'll fix it. I would encourage you to test 0.4.9e to ensure it restarts
> both sides reliably every time.
Thank you. Finally i see you understand my idea, and you give me a concise
answer . I will test 0.4.9e in my test cluster in hopes that it works for me
... and i wonder what scenario could exist where my suggestion wouldn't fix
the problem (where the problem would be fixed by restarting both sides after
a split brain)?
More information about the Linux-HA