alanr at unix.sh
Mon Oct 21 14:17:51 MDT 2002
Alex Kramarov wrote:
> I have successfully ran heartbeat for months now, and last weel ihave
> upgraded to the latest 0.4.9.2 rpm because of the security problem. the
> cluster is an lvs active-active setup. boxes are named f1 and f2
> an hour ago, i have detected that the vip's on one of the machines died. i
> was able to login into both directors, and i was all processes ip, all ips
> are assigned correctly. no traffic was going through f2 though.
> after manual restart of heartbeat on f2 the traffic went back to normal.
> the problem is that i cannot determine what caused the problem - i am using
> 3 network interfaces for heartbeat, but by the logs, f1 detected that f2's
> eth0 dies, and went crasy :
> Oct 21 18:43:27 f1 heartbeat: WARN: node f2: is dead
> Oct 21 18:43:27 f1 heartbeat: info: Link f2:eth0 dead.
> Oct 21 18:43:27 f1 heartbeat: WARN: Cluster node f2 returning after
> Oct 21 18:43:27 f1 heartbeat: info: Heartbeat shutdown in progress.
The situation you observed is called "split brain" or a cluster partition.
In most cases it is caused by process scheduling issues, not by
From the faqntips document:
10. How to tune heartbeat on heavily loaded system ?
10. Since the default probably isn't reasonable for most linux systems
under heavy load (sorry!), here is suggestion:
Set deadtime to 60 seconds or higher
Set warntime to whatever you *want* your deadtime to be.
Run your system under heavy load for a few weeks.
Look at your logs for the longest time either system went without
hearing a heartbeat.
Set your deadtime to 1.5-2 times that amount. Set warntime to
Continue to monitor logs for warnings about long heartbeat times.
The beta releases generally are a little better in this regard than earlier
releases. But, fundamentally, the scheduling issue seems to be related to
memory pressure issues in the kernel - and we have relatively little control
Any given workload on a particular kernel and heartbeat version and memory
size will have a particular minimum deadtime. You need to find out what
that time is and configure your systems accordingly - or you will run into
the problem you ran into.
Heartbeat *should* restart on both sides again. Is that what happened?
-- Alan Robertson
alanr at unix.sh
More information about the Linux-HA