[Linux-HA] Issues with simple failover setup
dejanmm at fastmail.fm
Mon Jan 5 03:42:42 MST 2009
On Sun, Jan 04, 2009 at 10:04:58AM +0000, Stephen Nelson-Smith wrote:
> I am running Heartbeat 2.3 on CentOS 5.2. I have 2 nodes - both
> apache servers. All I want to achieve is a simple failover:
> In the case where one of the two nodes is running httpd, if the
> running node experiences a failure - httpd is stopped, or the machine
> stops responding (ie the network has been lost or the machine down
> hard), fail over to the second node.
> I seem to have achieved this when starting with a fresh install. I
> have defined two resources:
> <primitive class="ocf" id="IPaddr_10_0_0_53"
> provider="heartbeat" type="IPaddr">
> <op id="IPaddr_10_0_0_53_mon" interval="5s"
> name="monitor" timeout="5s"/>
> <instance_attributes id="IPaddr_10_0_0_53_inst_attr">
> <nvpair id="IPaddr_10_0_0_53_attr_0" name="ip"
> <primitive class="lsb" id="httpd_2" provider="heartbeat" type="httpd">
> <op id="httpd_2_mon" interval="20s" name="monitor" timeout="10s"/>
> As I understand it, the IP, primitive type="IPaddr" has a monitor set
> to fire every 5 seconds, and
> timeout after 5 seconds, and it has one attribute, the IP address itself.
> The httpd, primitive type="httpd", really just refers to the
> /etc/init.d/httpd script, since it is of class="lsb". It only has a
> single operation and no attributes - the operation is a monitor which
> fires every 10 seconds, and will timeout after 10 seconds. For an
> init script, the monitor just consists of running the script as
> "/etc/init.d/httpd status" and looking for "running" in the response.
> I've defined one constraint:
> <rsc_colocation id="web_same" from="IPaddr_10_0_0_53"
> to="httpd_2" score="INFINITY"/>
> The IP address and the httpd are preferred to run on the same
> machine, with INFINITE priority - in other words, they MUST run on the
> same machine.
> This should have the effect of forcing the migration of both resources together.
> I've modified default-resource-stickiness and
> <nvpair id="cib-bootstrap-options-default-resource-stickiness"
> name="default-resource-stickiness" value="1000"/>
> <nvpair id="cib-bootstrap-options-default-resource-failure-stickiness"
> name="default-resource-failure-stickiness" value="-6001"/>
> AIUI, these two options define how the CRM and the LRM handle failures
> and failovers.
> The default-resource-stickiness is the score given to each active
> resource on the active node, leading to a default score of 2000 for
> the active
> node and 0 for the inactive node.
> When there is a failure, the failure-stickiness score is applied, and
> since it's negative, it should lower the score on the failed (active)
> node to below 0, triggering a
> If the second node fails as well, that node will be taken negative,
> leaving no nodes capable of running the resources. If a node reboots,
> it should reset its score to 0, or it can be manually reset by running
> "crm_failcount -D -r httpd_2" on the previously-failed node.
> So far so good. Do please correct my understanding if I've gone wrong.
No, everything looks ok. Just don't ask me to calculate the
> Live test below:
> Ok - so taking my cluster, erasing the cib with cibadmin -E, and
> rebooting both nodes. I've not got httpd starting by default on
> either machine, so when they come up, I will start httpd on one
> machine. Interestingly the result of cibadmin -E seems to have been
> that cibadmin -Q now times out,
> so I've hacked around a bit deleting
> /var/lib/heartbeat/crm/cib.xml and trying to load it, by making the
> admin_epoch bigger than that which seemed to be there (though from
> where I know not).
Fiddling with cib.xml is allowed only when heartbeat/CRM is not
running. Otherwise, and that's prefered, use the CRM tools
(crm_resource, cibadmin, etc).
> $ crm_resource -W -r httpd_2
> seems to show that httpd_2 is running on node2, and I can confirm
> this. I don't know how this happened, as I didn't start apache, but
> it has happened...
> So - if I shutdown httpd on node 2, it should failover, and it does.
> So, now apache is running on node 1, and node 2 should have a score of
> -6001 as it failed. This is reflected in the failcount on node 2.
> I shouldn't be able to move the resource back to node2 - it still has
> a failure count > 0.
> However, it seems I can - using crm_resource -M -r httpd_2 -H node2
This inserts a -INFINITY location constraint...
> Ok - resetting the failcount to 0. The cluster should be in the same
> state it was before - let's try to kill apache.
> This time, apache seems to have restarted on node 2, and there was no
> failover. I don't understand this. The failcount has gone back up to
> 1, but the resource hasn't moved.
... which prevents it from even again starting on this node.
crm_resource should have printed a warning about it.
> Let's try to kill it again. Same again - it gets restarted on node 2.
> The failure count hasn't gone to 2. Killing it one more time gives
> the same behaviour. Oh well... let's try to move the resource to node
> Fine - that works with crm_resource, and now the cluster claims apache
> is on node 1. I concur.
crm_resource -U removes the -INFINITY constraint, hence now the
cluster should start to behave as you expect it.
> Let's reset failure count for good measure.
> Now let's try killing apache on node1. Once again, apache gets
> restarted on node1, but there's no failover.
> So - what's going on - what have I got wrong? Also could someone
> please tell me the canonical way to reset the cluster, and import a
> new cib.xml?
cibadmin -R -x cib.xml should do (perhaps cibadmin -E before,
can't recall anymore). It may happen that, if your old resource
names don't exist in the new configuration, there will be some
remnants in the status section of the CIB. Those can be removed
by crm_resource -C or by restarting heartbeat.
Or stop the cluster, remove cib.xml and cib.xml.sig on all nodes
(from /var/lib/heartbeat/crm), copy new cib.xml to all nodes,
start cluster. Use crm_verify to make sure that your cib.xml is
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> See also: http://linux-ha.org/ReportingProblems
More information about the Linux-HA