sganesan at iperia.com
Mon Oct 21 18:54:17 MDT 2002
I ran into this in March or April I think and it was fixed 0.4.(was it
8.x) If you need a current rpm built from source either on rh7.2 or 7.3
give me a shout
On Mon, 2002-10-21 at 16:32, Alex Kramarov wrote:
> ----- Original Message -----
> From: "Alan Robertson" <alanr at unix.sh>
> To: "Alex Kramarov" <alex at incredimail.com>
> Cc: <linux-ha at muc.de>
> Sent: Monday, October 21, 2002 10:17 PM
> Subject: Re: heartbeat failure.
> > Alex Kramarov wrote:
> > > Hello.
> > >
> > > I have successfully ran heartbeat for months now, and last weel ihave
> > > upgraded to the latest 0.4.9.2 rpm because of the security problem. the
> > > cluster is an lvs active-active setup. boxes are named f1 and f2
> > >
> > > an hour ago, i have detected that the vip's on one of the machines died.
> > > was able to login into both directors, and i was all processes ip, all
> > > are assigned correctly. no traffic was going through f2 though.
> > > after manual restart of heartbeat on f2 the traffic went back to normal.
> > >
> > > the problem is that i cannot determine what caused the problem - i am
> > > 3 network interfaces for heartbeat, but by the logs, f1 detected that
> > > eth0 dies, and went crasy :
> > >
> > > Oct 21 18:43:27 f1 heartbeat: WARN: node f2: is dead
> > > Oct 21 18:43:27 f1 heartbeat: info: Link f2:eth0 dead.
> > > Oct 21 18:43:27 f1 heartbeat: WARN: Cluster node f2 returning
> > > partition
> > > Oct 21 18:43:27 f1 heartbeat: info: Heartbeat shutdown in
> > > (1038)
> > Any given workload on a particular kernel and heartbeat version and memory
> > size will have a particular minimum deadtime. You need to find out what
> > that time is and configure your systems accordingly - or you will run into
> > the problem you ran into.
> > Heartbeat *should* restart on both sides again. Is that what happened?
> no, as you can see from the logs, f1 restarted, but f2 only wrote :
> Oct 21 18:43:19 f2 heartbeat: info: Running /etc/ha.d/rc.d/shutdone shutdone
> Oct 21 18:43:22 f2 heartbeat: info: Heartbeat restart on node f1
> Oct 21 18:43:22 f2 heartbeat: info: Node f1: status up
> Oct 21 18:43:22 f2 heartbeat: info: Running /etc/ha.d/rc.d/status status
> Oct 21 18:43:23 f2 heartbeat: info: Node f1: status active
> i have ran into such situation when playing with iptables on systems - one
> thinks he is taking over, the otehr does nothing, then the first hears the
> heartbeat again, releases resources, but the other doesn't know that
> anything happened, so he doesn't send gratios arp request, so the oether
> machines still send packets to the machine that send arp last. i have to
> restart ehartbeat on the second in such cases. this seems to be similar,
> only i didn't touch anything at that time - maybe the network was a littls
> overloaded, but no more then the highest peak i have ever had ... i have
> been running these machines for many months, but last week i upgraded f1 to
> the latest kernel redhat released - it has the latest ingo molnar's sheduler
> enchancements - this also may be the cause. but this only means that i will
> have to adapt ... still, since i have seen such situations before, already
> then i thought that any of these issues can be solved, if the active
> heartbeat deamon would send gratios arp requests on it's interfaces once a
> minute for example, then these issues would cure themselvs within one
> minute, resulting in only a small amount of downtime . i believe that this
> wouldn't be too hard to implement - just add a timer to the code and an
> option to the config file ...
More information about the Linux-HA