[Linux-HA] Is this heartbeat behaviour correct ?
david.lang at digitalinsight.com
Mon Aug 8 11:55:31 MDT 2005
IPfail doesn't mark the node as down if it can't reach it's destination
becouse it doesn't know if the problem is with the destination or not
(imagine the situation where both nodes are fully functional but the thing
you are pinging goes down, you don't want both nodes to shutdown)
instead the ipfail data is just part of the data used to decide if the
node is healthy or not.
each node has a count of how many ipfail targets it can reach, if all
nodes can reach the same number of targets then ipfail has no effect on
decididng if a node is good or not. However if one node can reach more
targets then the other node it is considered healthier and the one that
can't reach as many targets gets shutdown.
however this is only a sidenote to your test
in your test you are shutting off all connectivity.
when you do this each node thinks that it's in good shape and all other
nodes have failed, since the other nodes have all failed it will take
the only way to prevent this is to either have enough communications
redundancy so that not everythign fails at the same time and the nodes
will know that they are both still up and what each one can reach, or to
implement STONITH, which means that if both systems think that the other
has failed they both try to shut the other down, and whoever sends the
shutdown message first wins (the other one shuts down, the winning node
On Mon, 8 Aug 2005, Boris Berger wrote:
> Date: Mon, 8 Aug 2005 14:24:00 +0200
> From: Boris Berger <boris.berger at capgemini.com>
> Reply-To: General Linux-HA mailing list <linux-ha at lists.linux-ha.org>
> To: linux-ha at lists.linux-ha.org
> Subject: Re: [Linux-HA] Is this heartbeat behaviour correct ?
> I don't think there is any poblem with the configuration of the heartbeat
> messages broadcasting
> on eth0 and eth1, we have tested it again by replacing "," by a space or tab
> (bcast eth1 eth0)
> and by inverting eth0 and eth1, and it's OK (messages are broadcasted on the
> two interfaces).
> But the problem still exists : when we disconnect the three network cables
> (the "internal" link between
> the two nodes AND the two cables from the nodes to the "external" LAN), both
> nodes take back the service.
> Is it normal ? I thought IPFAIL should prevent from this : ipfail (on both
> nodes) pings an external host
> which is no more reachable as the cables are disconnected.
> Thanks a lot,
>> From : linux-ha-bounces at lists.linux-ha.org
> To : "General Linux-HA mailing list" linux-ha at lists.linux-ha.org
> Cc :
> Date : Fri, 05 Aug 2005 22:37:35 -0600
> Subject : Re: [Linux-HA] Is this heartbeat behaviour correct ?
>> Boris Berger wrote:
>>> Thanks for your answer. Are these particular settings in the ha.cf file
>>> correct ?
>>> # Both nodes broadcast on both network cards (here : eth0 and eth1)
>>> bcast eth1,eth0
>>> ## communication port : 694 : the nodes broadast towards the whole
>>> udpport 694
>>> # Use port 694 for bcast or ucast communications . This is the default
>>> # port and the official one registered at the IANA, organisation
>>> # for assigning new IP addresses
>>> Is there something to change in there or elsewhere ? Here is our
>>> ha.cf file :
>>> bcast eth1,eth0
>>> debugfile /var/log/ha-debug
>>> logfile /var/log/ha-log
>>> logfacility local0
>>> keepalive 2
>>> deadtime 10
>>> warntime 6
>>> initdead 60
>>> udpport 694
>>> node EEPCLU1
>>> node EEPCLU2
>>> auto_failback on
>>> respawn hacluster /usr/lib/heartbeat/ipfail
>>> ping EEPNFS
>>> ---------- Initial Header -----------
>>>> From : linux-ha-bounces at lists.linux-ha.org
>>> To : "General Linux-HA mailing list"
>>> linux-ha at lists.linux-ha.org
>>> Cc :
>>> Date : Fri, 05 Aug 2005 08:17:53 -0600
>>> Subject : Re: [Linux-HA] Is this heartbeat behaviour correct ?
>>> Boris Berger wrote:
>>>> Hello all,
>>>> I have tested a 2 node active/passive Heartbeat cluster.
>>>> To check the connection of each node in the external network,
>>>> ipfail is active with a ping towards a third machine, as
>>>> specified in ha.cf file :
>>>> respawn hacluster /usr/lib/heartbeat/ipfail
>>>> ping theThirdMachine
>>>> Before performing the tests, we have the initial situation :
>>>> - Heartbeat is running on both nodes,
>>>> - one service (apache) is running on node 1,
>>>> - no service is running on node 2,
>>>> as specified the haresource file :
>>>> node1 addrIpServ1 apache
>>>> Now I cut simultaneously :
>>>> - the direct connection between the 2 nodes
>>>> - the connection between node 1 and the third machine
>>>> - the connection between node 2 and the third machine
>>>> Then, one can notice in the log that :
>>>> - Apache does not stop on node 1
>>>> - Apache start on node 2.
>>>> So Apache is now running on both nodes.
>>>> Now, if I reestablish :
>>>> - EITHER the connection between node 1 and the third machine ONLY
>>>> - OR the connection between node 2 and the third machine ONLY
>>>> then nothing special is happening, so Apache is still running on both
>>>> Do you know is this is a normal behaviour ? And how can this be
>>> It can most probably be explained as a multiple failure you haven't
>>> configured heartbeat to deal with. In other words, a configuration
>>> When you restore the direct connection (the only one you are
>>> heartbeating over, I strongly suspect), it will restart heartbeat on
>>> both sides.
>>> If you want that to work, you need to tell heartbeat to send heartbeats
>>> over all (both?) interfaces - not just the direct connection.
>> I actually didn't think that , was valid as a separator. But, If it
>> didn't give you an error, then I guess it must be OK. But, maybe it's
>> Could you try it again with a space or tab instead of the "," (comma)?
> This message contains information that may be privileged or confidential and is the property of the Capgemini Group. It is intended only for the person to whom it is addressed. If you are not the intended recipient, you are not authorized to read, print, retain, copy, disseminate, distribute, or use this message or any part thereof. If you receive this message in error, please notify the sender immediately and delete all copies of this message.
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies.
-- C.A.R. Hoare
More information about the Linux-HA