[Linux-HA] Failure to start resource makes it impossible to fail
back
Andrew Beekhof
beekhof at gmail.com
Tue Nov 13 00:37:14 MST 2007
On Nov 13, 2007, at 12:15 AM, Anders Brownworth wrote:
> Hi,
>
> I have a primary / backup v2.0.8 setup monitoring OpenSer and 2 IP
> addresses.
>
> If I make a mistake in a config file for a resource that is being
> controlled by Linux-HA (OpenSer) and for whatever reason the
> resource dies and a restart is attempted, the restart will fail and
> the resource will migrate to the backup node as expected. However
> once I fix the problem so the resource could start again on the
> primary, I can never get Linux-HA to migrate the resource back.
>
> I don't think this has anything to do with scoring because when I
> don't break my config files and manually kill the service 13 times
> on box01 (the reason for 13 is in my included cib.xml) the resources
> migrates from box01 to box02 as expected. Setting the fail count
> back below 13 causes the service to migrate back, also as expected.
>
> However, trying to fail back to a system that previously had broken
> OpenSer config files that have now been fixed, I can't get them to
> come back no matter how low I set the fail count. Is there another
> variable or INFINITY constraint somewhere that gets set when a
> resource fails to start that makes the resources stay away? What can
> I do when I want Linux-HA to re-try migration of the service back to
> a recently hand fixed primary?
prior to the latest interim build, starts were always fatal and
required the use of crm_resource -C to make the node eligible again.
as of the last interim release, just make sure start-failure-is-
fatal=false and use crm_failcount as you have below for "normal"
failures.
> Additionally, I followed the advice under "Resetting Failure Counts"
> in the V2 FAQ ( http://linux-ha.org/v2/faq ) where it suggests:
>
> crm_failcount -D -U nodeA -r my_rsc
>
> Rather than reset the failure count, this just torches it in such a
> way that you can't even read it with the query command given in the
> next step of the same example. I found statically setting the count
> back to 0 with:
>
> crm_failcount -v 0 -U box01 -r OpenSer
>
> worked much better and allowed me to push resources back and forth
> just by moving the fail count up and down.
>
> Thanks.
>
> -Anders
>
>
>
>
>
>
>
>
> <cib admin_epoch="1" have_quorum="true" num_peers="1"
> cib_feature_revision="1.3" ignore_dtd="false" ccm_transition="3"
> generated="true" dc_uuid="9052abe5-87ee-4400-a008-c5f13205e94b"
> epoch="15" num_updates="606" cib-last-written="Mon Nov 12 22:37:10
> 2007">
> <configuration>
> <crm_config>
> <cluster_property_set id="cluster-property-set">
> <attributes>
> <nvpair id="short_resource_names"
> name="short_resource_names" value="true"/>
> <nvpair id="pe-input-series-max" name="pe-input-series-max"
> value="-1"/>
> <nvpair id="default-resource-stickiness" name="default-
> resource-stickiness" value="10"/>
> <nvpair id="default-resource-failure-stickiness"
> name="default-resource-failure-stickiness" value="-10"/>
> </attributes>
> </cluster_property_set>
> </crm_config>
> <nodes>
> <node id="9052abe5-87ee-4400-a008-c5f13205e94b" uname="box01"
> type="normal"/>
> <node id="47658455-4da2-48d4-a8da-419b2f93f039" uname="box02"
> type="normal"/>
> </nodes>
> <resources>
> <group id="IPaddr2_OpenSer_group">
> <primitive id="IPaddr2-10.1.53.235" class="ocf"
> type="IPaddr2" provider="heartbeat">
> <operations>
> <op id="ipaddr2-10.1.53.235-monitor" name="monitor"
> interval="5s" timeout="3s"/>
> </operations>
> <instance_attributes id="IPaddr2-10.1.53.235-attributes">
> <attributes>
> <nvpair id="ipaddr2-10.1.53.235-ip" name="ip"
> value="10.1.53.235"/>
> <nvpair id="ipaddr2-10.1.53.235-broadcast"
> name="broadcast" value="10.1.53.255"/>
> <nvpair id="ipaddr2-10.1.53.235-cidr_netmask"
> name="cidr_netmask" value="24"/>
> </attributes>
> </instance_attributes>
> </primitive>
> <primitive id="IPaddr2-10.1.53.236" class="ocf"
> type="IPaddr2" provider="heartbeat">
> <operations>
> <op id="ipaddr2-10.1.53.236-monitor" name="monitor"
> interval="5s" timeout="3s"/>
> </operations>
> <instance_attributes id="IPaddr2-10.1.53.236-attributes">
> <attributes>
> <nvpair id="ipaddr2-10.1.53.236-ip" name="ip"
> value="10.1.53.236"/>
> <nvpair id="ipaddr2-10.1.53.236-broadcast"
> name="broadcast" value="10.1.53.255"/>
> <nvpair id="ipaddr2-10.1.53.236-cidr_netmask"
> name="cidr_netmask" value="24"/>
> </attributes>
> </instance_attributes>
> </primitive>
> <primitive id="OpenSer" class="ocf" type="OpenSer"
> provider="bandwidth.com">
> <operations>
> <op id="openser-start" name="start" timeout="5s"/>
> <op id="openser-stop" name="stop" timeout="3s"/>
> <op id="openser-monitor" name="monitor" interval="10s"
> timeout="3s">
> <instance_attributes id="monitor_10s">
> <attributes>
> <nvpair id="openser-monitor-ip" name="ip"
> value="127.0.0.1"/>
> </attributes>
> </instance_attributes>
> </op>
> </operations>
> </primitive>
> </group>
> </resources>
> <constraints>
> <rsc_location id="OpenSer_resource_location" rsc="OpenSer">
> <rule id="rule_box01" score="100">
> <expression id="expression_uname_eq_box01"
> attribute="#uname" operation="eq" value="box01"/>
> </rule>
> <rule id="rule_box02" score="10">
> <expression id="expression_uname_eq_box02"
> attribute="#uname" operation="eq" value="box02"/>
> </rule>
> </rsc_location>
> </constraints>
> </configuration>
> </cib>
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
More information about the Linux-HA
mailing list