[Linux-HA] Failure to start resource makes it impossible to fail back

Andrew Beekhof beekhof at gmail.com
Tue Nov 13 00:37:14 MST 2007


On Nov 13, 2007, at 12:15 AM, Anders Brownworth wrote:

> Hi,
>
> I have a primary / backup v2.0.8 setup monitoring OpenSer and 2 IP  
> addresses.
>
> If I make a mistake in a config file for a resource that is being  
> controlled by Linux-HA (OpenSer) and for whatever reason the  
> resource dies and a restart is attempted, the restart will fail and  
> the resource will migrate to the backup node as expected. However  
> once I fix the problem so the resource could start again on the  
> primary, I can never get Linux-HA to migrate the resource back.
>
> I don't think this has anything to do with scoring because when I  
> don't break my config files and manually kill the service 13 times  
> on box01 (the reason for 13 is in my included cib.xml) the resources  
> migrates from box01 to box02 as expected. Setting the fail count  
> back below 13 causes the service to migrate back, also as expected.
>
> However, trying to fail back to a system that previously had broken  
> OpenSer config files that have now been fixed, I can't get them to  
> come back no matter how low I set the fail count. Is there another  
> variable or INFINITY constraint somewhere that gets set when a  
> resource fails to start that makes the resources stay away? What can  
> I do when I want Linux-HA to re-try migration of the service back to  
> a recently hand fixed primary?

prior to the latest interim build, starts were always fatal and  
required the use of crm_resource -C to make the node eligible again.

as of the last interim release, just make sure start-failure-is- 
fatal=false and use crm_failcount as you have below for "normal"  
failures.

> Additionally, I followed the advice under "Resetting Failure Counts"  
> in the V2 FAQ ( http://linux-ha.org/v2/faq ) where it suggests:
>
> crm_failcount -D -U nodeA -r my_rsc
>
> Rather than reset the failure count, this just torches it in such a  
> way that you can't even read it with the query command given in the  
> next step of the same example. I found statically setting the count  
> back to 0 with:
>
> crm_failcount -v 0 -U box01 -r OpenSer
>
> worked much better and allowed me to push resources back and forth  
> just by moving the fail count up and down.
>
> Thanks.
>
> -Anders
>
>
>
>
>
>
>
>
> <cib admin_epoch="1" have_quorum="true" num_peers="1"  
> cib_feature_revision="1.3" ignore_dtd="false" ccm_transition="3"  
> generated="true" dc_uuid="9052abe5-87ee-4400-a008-c5f13205e94b"  
> epoch="15" num_updates="606" cib-last-written="Mon Nov 12 22:37:10  
> 2007">
>  <configuration>
>    <crm_config>
>      <cluster_property_set id="cluster-property-set">
>        <attributes>
>          <nvpair id="short_resource_names"  
> name="short_resource_names" value="true"/>
>          <nvpair id="pe-input-series-max" name="pe-input-series-max"  
> value="-1"/>
>          <nvpair id="default-resource-stickiness" name="default- 
> resource-stickiness" value="10"/>
>          <nvpair id="default-resource-failure-stickiness"  
> name="default-resource-failure-stickiness" value="-10"/>
>        </attributes>
>      </cluster_property_set>
>    </crm_config>
>    <nodes>
>      <node id="9052abe5-87ee-4400-a008-c5f13205e94b" uname="box01"  
> type="normal"/>
>      <node id="47658455-4da2-48d4-a8da-419b2f93f039" uname="box02"  
> type="normal"/>
>    </nodes>
>    <resources>
>      <group id="IPaddr2_OpenSer_group">
>        <primitive id="IPaddr2-10.1.53.235" class="ocf"  
> type="IPaddr2" provider="heartbeat">
>          <operations>
>            <op id="ipaddr2-10.1.53.235-monitor" name="monitor"  
> interval="5s" timeout="3s"/>
>          </operations>
>          <instance_attributes id="IPaddr2-10.1.53.235-attributes">
>            <attributes>
>              <nvpair id="ipaddr2-10.1.53.235-ip" name="ip"  
> value="10.1.53.235"/>
>              <nvpair id="ipaddr2-10.1.53.235-broadcast"  
> name="broadcast" value="10.1.53.255"/>
>              <nvpair id="ipaddr2-10.1.53.235-cidr_netmask"  
> name="cidr_netmask" value="24"/>
>            </attributes>
>          </instance_attributes>
>        </primitive>
>        <primitive id="IPaddr2-10.1.53.236" class="ocf"  
> type="IPaddr2" provider="heartbeat">
>          <operations>
>            <op id="ipaddr2-10.1.53.236-monitor" name="monitor"  
> interval="5s" timeout="3s"/>
>          </operations>
>          <instance_attributes id="IPaddr2-10.1.53.236-attributes">
>            <attributes>
>              <nvpair id="ipaddr2-10.1.53.236-ip" name="ip"  
> value="10.1.53.236"/>
>              <nvpair id="ipaddr2-10.1.53.236-broadcast"  
> name="broadcast" value="10.1.53.255"/>
>              <nvpair id="ipaddr2-10.1.53.236-cidr_netmask"  
> name="cidr_netmask" value="24"/>
>            </attributes>
>          </instance_attributes>
>        </primitive>
>        <primitive id="OpenSer" class="ocf" type="OpenSer"  
> provider="bandwidth.com">
>          <operations>
>            <op id="openser-start" name="start" timeout="5s"/>
>            <op id="openser-stop" name="stop" timeout="3s"/>
>            <op id="openser-monitor" name="monitor" interval="10s"  
> timeout="3s">
>              <instance_attributes id="monitor_10s">
>                <attributes>
>                  <nvpair id="openser-monitor-ip" name="ip"  
> value="127.0.0.1"/>
>                </attributes>
>              </instance_attributes>
>            </op>
>          </operations>
>        </primitive>
>      </group>
>    </resources>
>    <constraints>
>      <rsc_location id="OpenSer_resource_location" rsc="OpenSer">
>        <rule id="rule_box01" score="100">
>          <expression id="expression_uname_eq_box01"  
> attribute="#uname" operation="eq" value="box01"/>
>        </rule>
>        <rule id="rule_box02" score="10">
>          <expression id="expression_uname_eq_box02"  
> attribute="#uname" operation="eq" value="box02"/>
>        </rule>
>      </rsc_location>
>    </constraints>
>  </configuration>
> </cib>
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems



More information about the Linux-HA mailing list