alanr at unix.sh
Tue Oct 22 14:15:32 MDT 2002
Alex Kramarov wrote:
>>Alex Kramarov wrote:
>>>>>... i have
>>>>>been running these machines for many months, but last week i upgraded
>>>>>the latest kernel redhat released - it has the latest ingo molnar's
>>>>>enchancements - this also may be the cause. but this only means that i
>>>I am not talking about 2 machines serving the same services at the same
>>>time. i am talking about that if a machine taken over an ip address and
>>>issued a gracios arp for it, it should do it again every x minutes as
>>>as it holds that ip. so if there was some problem that arp got messed up
>>>other machines during split brain recovery, the situation would remedy
>>>itself without human intervention.
>>I believe the split brain recovery is doing the right thing. It already
>>sends several ARPs - AFTER the other side has shut down the service. If
>>want to send more, then change the IPaddr script to send more. Right now
>>sends five ARPS - two seconds apart.
>>I have never heard of any cases where human intervention was required. I
>>think some people did have to raise the number of ARPs due to bugs in
> once again, i cam not talking about the exact moment that the machine
> detects there has been a split brain and tries to recover. I am talking
> about when machine 1 detected a problem and taking over ip because machine 1
> doesn't hear anything form machine 2, but machine 2 hears machine 1 fine and
> assumes that everithing is fine (or at least not critical) - this can be
> caused by some weird iptables config - i did that more then once in testing,
> where machine 2 hears machine 1, but not wise versa. now, machine 1 will
> take over the ip, and then release it when situation normalised, but machine
> 2 doesn't send arp because it doesn't know anything happened, so you end up
> where machine 2 has the ip, but everyine has the mak of machine 1 for it !
> in this situation the only solution is for both machines to send arps period
> ically for their active ip's, so after such scenario machine 2 will refresh
> the arp cache of other machines, and remedy the problem !
> excuse me if i am not making any sence repeating this for the 3rd time, but
> this does make a lot of sence to me !
OK... Sorry to be so dense...
NOW I remember why I put the "both sides restart" logic in to the betas ;-)
It was to fix this problem. If *both* sides restart, then this problem goes
away. This has been discussed on the list before, and it is believed to
work correctly in the beta versions.
What the code does now is it delays restarting for a bit, which gives the
other machine a chance to see it and also restart.
The reason why only one machine restarted was because Machine A restarted as
soon as it hears a packet from machine B. If it had not sent any packets to
B between the time connectivity was restored and the time it heard the
packet from machine B, then B wouldn't know that A had returned, and
Since it now delays a little before restarting (IIRC), this problem doesn't
-- Alan Robertson
alanr at unix.sh
More information about the Linux-HA