heartbeat failure.
David Lang
david.lang@digitalinsight.com
Tue, 22 Oct 2002 13:15:40 -0700 (PDT)
On Tue, 22 Oct 2002, Alan Robertson wrote:
> > once again, i cam not talking about the exact moment that the machine
> > detects there has been a split brain and tries to recover. I am talking
> > about when machine 1 detected a problem and taking over ip because machine 1
> > doesn't hear anything form machine 2, but machine 2 hears machine 1 fine and
> > assumes that everithing is fine (or at least not critical) - this can be
> > caused by some weird iptables config - i did that more then once in testing,
> > where machine 2 hears machine 1, but not wise versa. now, machine 1 will
> > take over the ip, and then release it when situation normalised, but machine
> > 2 doesn't send arp because it doesn't know anything happened, so you end up
> > where machine 2 has the ip, but everyine has the mak of machine 1 for it !
> > in this situation the only solution is for both machines to send arps period
> > ically for their active ip's, so after such scenario machine 2 will refresh
> > the arp cache of other machines, and remedy the problem !
> >
> > excuse me if i am not making any sence repeating this for the 3rd time, but
> > this does make a lot of sence to me !
>
> OK... Sorry to be so dense...
>
> NOW I remember why I put the "both sides restart" logic in to the betas ;-)
>
> It was to fix this problem. If *both* sides restart, then this problem goes
> away. This has been discussed on the list before, and it is believed to
> work correctly in the beta versions.
>
> What the code does now is it delays restarting for a bit, which gives the
> other machine a chance to see it and also restart.
>
> The reason why only one machine restarted was because Machine A restarted as
> soon as it hears a packet from machine B. If it had not sent any packets to
> B between the time connectivity was restored and the time it heard the
> packet from machine B, then B wouldn't know that A had returned, and
> wouldn't restart.
>
> Since it now delays a little before restarting (IIRC), this problem doesn't
> occur.
Alan,
but when they both start talking do they tell each other that they both
need to restart or do you have them both independantly decide to restart?
if they each independantly decide to restart then Alex's problem still
isn't solved
the senerio is:
start:
sys1 active
sys2 standby
problem happens: (sys2 cannot hear sys1, but sys1 can hear sys2)
sys1 active
sys2 takeover->active (sends out arps last so it wins)
problem solved:
sys1 active (no change it doesn't know there was ever a problem
sys2 split-brain detected -> restart -> standby
result
sys1 active
sys2 standby
but the last system to send out arps was sys2 so the service is down until
arp timeout happens.
David Lang