[Linux-HA] Issues with simple failover setup
dk at in-telegence.net
Mon Jan 5 04:14:06 MST 2009
Dejan Muhamedagic wrote:
> On Sun, Jan 04, 2009 at 10:04:58AM +0000, Stephen Nelson-Smith wrote:
>> I am running Heartbeat 2.3 on CentOS 5.2. I have 2 nodes - both
>> apache servers. All I want to achieve is a simple failover:
>> In the case where one of the two nodes is running httpd, if the
>> running node experiences a failure - httpd is stopped, or the machine
>> stops responding (ie the network has been lost or the machine down
>> hard), fail over to the second node.
>> I seem to have achieved this when starting with a fresh install. I
>> have defined two resources:
>> <primitive class="ocf" id="IPaddr_10_0_0_53"
>> provider="heartbeat" type="IPaddr">
>> <op id="IPaddr_10_0_0_53_mon" interval="5s"
>> name="monitor" timeout="5s"/>
>> <instance_attributes id="IPaddr_10_0_0_53_inst_attr">
>> <nvpair id="IPaddr_10_0_0_53_attr_0" name="ip"
>> <primitive class="lsb" id="httpd_2" provider="heartbeat" type="httpd">
>> <op id="httpd_2_mon" interval="20s" name="monitor" timeout="10s"/>
>> As I understand it, the IP, primitive type="IPaddr" has a monitor set
>> to fire every 5 seconds, and
>> timeout after 5 seconds, and it has one attribute, the IP address itself.
>> The httpd, primitive type="httpd", really just refers to the
>> /etc/init.d/httpd script, since it is of class="lsb". It only has a
>> single operation and no attributes - the operation is a monitor which
>> fires every 10 seconds, and will timeout after 10 seconds. For an
>> init script, the monitor just consists of running the script as
>> "/etc/init.d/httpd status" and looking for "running" in the response.
>> I've defined one constraint:
>> <rsc_colocation id="web_same" from="IPaddr_10_0_0_53"
>> to="httpd_2" score="INFINITY"/>
>> The IP address and the httpd are preferred to run on the same
>> machine, with INFINITE priority - in other words, they MUST run on the
>> same machine.
>> This should have the effect of forcing the migration of both resources together.
>> I've modified default-resource-stickiness and
>> <nvpair id="cib-bootstrap-options-default-resource-stickiness"
>> name="default-resource-stickiness" value="1000"/>
>> <nvpair id="cib-bootstrap-options-default-resource-failure-stickiness"
>> name="default-resource-failure-stickiness" value="-6001"/>
>> AIUI, these two options define how the CRM and the LRM handle failures
>> and failovers.
>> The default-resource-stickiness is the score given to each active
>> resource on the active node, leading to a default score of 2000 for
>> the active
>> node and 0 for the inactive node.
>> When there is a failure, the failure-stickiness score is applied, and
>> since it's negative, it should lower the score on the failed (active)
>> node to below 0, triggering a
>> If the second node fails as well, that node will be taken negative,
>> leaving no nodes capable of running the resources. If a node reboots,
>> it should reset its score to 0, or it can be manually reset by running
>> "crm_failcount -D -r httpd_2" on the previously-failed node.
>> So far so good. Do please correct my understanding if I've gone wrong.
> No, everything looks ok. Just don't ask me to calculate the
> stickiness :)
>> Live test below:
>> Ok - so taking my cluster, erasing the cib with cibadmin -E, and
>> rebooting both nodes. I've not got httpd starting by default on
>> either machine, so when they come up, I will start httpd on one
>> machine. Interestingly the result of cibadmin -E seems to have been
>> that cibadmin -Q now times out,
> Shouldn't happen.
>> so I've hacked around a bit deleting
>> /var/lib/heartbeat/crm/cib.xml and trying to load it, by making the
>> admin_epoch bigger than that which seemed to be there (though from
>> where I know not).
> Fiddling with cib.xml is allowed only when heartbeat/CRM is not
> running. Otherwise, and that's prefered, use the CRM tools
> (crm_resource, cibadmin, etc).
>> $ crm_resource -W -r httpd_2
>> seems to show that httpd_2 is running on node2, and I can confirm
>> this. I don't know how this happened, as I didn't start apache, but
>> it has happened...
>> So - if I shutdown httpd on node 2, it should failover, and it does.
>> So, now apache is running on node 1, and node 2 should have a score of
>> -6001 as it failed. This is reflected in the failcount on node 2.
>> I shouldn't be able to move the resource back to node2 - it still has
>> a failure count > 0.
>> However, it seems I can - using crm_resource -M -r httpd_2 -H node2
> This inserts a -INFINITY location constraint...
Nope, with -H, it inserts an INFINITY (no minus) location constraint,
which overrides the numeric -6001 (or whatever it had at that point).
This forces httpd_2 to run on node2.
>> Ok - resetting the failcount to 0. The cluster should be in the same
>> state it was before - let's try to kill apache.
>> This time, apache seems to have restarted on node 2, and there was no
>> failover. I don't understand this. The failcount has gone back up to
>> 1, but the resource hasn't moved.
> ... which prevents it from even again starting on this node.
> crm_resource should have printed a warning about it.
See above: Now node2 has +INFINITY, so httpd failure will not have any
effect on the score as failure stickiness is just a numeric value
(INFINITY - number = INFINITY).
More information about the Linux-HA