[Linux-HA] RE:Re: Re: Failover of resource
Andrew Beekhof
beekhof at gmail.com
Wed Jul 11 08:56:11 MDT 2007
On 7/11/07, Taldevkar, Chetan <chetan.taldevkar at patni.com> wrote:
> Thanks Andrew,
>
> Now I have started using OCF resource and exiting with $OCF_ERR_GENERIC.
> But still monitor continues to run on the same node for some time. I
> have tried options given in forced_failover link but it does not fails
> over to another node immediately.
_attach_ the result of "cibadmin -Q" when the cluster is in this state
and i will have a look.
>
> Can you please check my cib.xml and let me know whether anything is
> missing in this. This will help me a lot. I have to give maximum time
> taken to failover by tomorrow. Currently it is not consistent, it comes
> sometimes 30 secs, 25, 44 etc.
>
>
> ----cib xml-------------
>
> <cib admin_epoch="0" have_quorum="true" ignore_dtd="false" num_peers="2"
> cib_feature_revision="1.3" ccm_transition="2" generated="true"
> dc_uuid="1c3fdfbd-ee55-47e3-a8c2-52f34a5c5553" epoch="32"
> num_updates="822" cib-last-written="Wed Jul 11 19:29:54 2007">
> <configuration>
> <crm_config>
> <cluster_property_set id="cib-bootstrap-options">
> <attributes>
> <nvpair id="symmetric_cluster" name="symmetric_cluster"
> value="true"/>
> <nvpair id="no_quorum_policy" name="no_quorum_policy"
> value="stop"/>
> <nvpair id="default_resource_stickiness"
> name="default_resource_stickiness" value="0"/>
> <nvpair id="default_resource_failure_stickiness"
> name="default_resource_failure_stickiness" value="-1000"/>
> <nvpair name="default-resource-failure-stickiness"
> id="cib-bootstrap-options-default-resource-failure-stickiness"
> value="-1001"/>
> <nvpair name="last-lrm-refresh"
> id="cib-bootstrap-options-last-lrm-refresh" value="1184156965"/>
> <nvpair id="cib-bootstrap-options-default-action-timeout"
> name="default-action-timeout" value="10s"/>
> </attributes>
> </cluster_property_set>
> </crm_config>
> <nodes>
> <node id="1c3fdfbd-ee55-47e3-a8c2-52f34a5c5553"
> uname="wl1.patni.com" type="normal"/>
> <node id="5426e37c-9469-40a3-813c-eebeb0b7c6a0"
> uname="testconfig.patni.com" type="normal"/>
> </nodes>
> <resources>
> <group ordered="true" collocated="true" resource_stickiness="500"
> restart_type="ignore" id="group_org">
> <primitive class="ocf" type="IPaddr" provider="heartbeat"
> id="res_vip">
> <instance_attributes id="res_vip_instance_attrs">
> <attributes>
> <nvpair id="d9d9e988-8d3d-434c-aae6-9cdd9e90e354"
> name="ip" value="172.20.1.94"/>
> <nvpair id="7e9d6497-e010-4d1b-b89e-f856913a1bc2"
> name="netmask" value="24"/>
> <nvpair id="d5427493-a54f-485f-a6d4-adb6843aee5a"
> name="nic" value="eth0"/>
> </attributes>
> </instance_attributes>
> </primitive>
> <primitive class="ocf" type="ttsvc.sh" provider="heartbeat"
> id="res_ttsvc">
> <instance_attributes id="res_ttsvc_instance_attrs">
> <attributes/>
> </instance_attributes>
> <operations>
> <op name="start" description="start tt" start_delay="1s"
> disabled="false" role="Started" prereq="nothing" on_fail="stop"
> id="5a535fd1-7c17-43e4-aa88-3acaadb3c90e" timeout="3s"/>
> <op name="monitor" description="check" start_delay="0"
> disabled="false" role="Started" prereq="nothing" on_fail="restart"
> interval="2s" id="a89ffb28-711f-4fab-9bfc-d6a9e7fcd580" timeout="3s"/>
> <op id="25caed10-606b-42e5-8fed-aa08e7bfa2df" name="stop"
> description="stop op" timeout="1s" start_delay="1s" disabled="false"
> role="Started" prereq="nothing" on_fail="ignore"/>
> </operations>
> </primitive>
> </group>
> </resources>
> <constraints>
> <rsc_location id="place_testconfig" rsc="group_org">
> <rule id="prefered_testconfig" score="1500">
> <expression id="e1" attribute="#uname" operation="eq"
> value="testconfig.patni.com"/>
> </rule>
> </rsc_location>
> <rsc_location id="place_wl1config" rsc="group_org">
> <rule id="prefered_wl1config" score="1000">
> <expression id="e2" attribute="#uname" operation="eq"
> value="wl1.patni.com"/>
> </rule>
> </rsc_location>
> </constraints>
> </configuration>
> </cib>
>
> ------------
>
>
> >
> >
> > monitor is the correct name
> > (the LRM will magically change the action to status for heartbeat and
> > lsb
> > scripts)
> >
> > Both options are not working. They continue to execute the
> > > script even though it returned "stopped".
> >
> > more information?
> >
> > <Chetan> What I mean here is on getting error during execution of
> > monitor operation, I am returning echo "Error: stopped" followed by
> exit
> > 9.
>
> its not a good idea to make up return codes.
> please read:
> http://linux-ha.org/LSBResourceAgent
> and in particular follow the link through to the specification of how
> an LSB resource is required to behave.
>
> >
> > My understanding is that after seeing 'stopped' string linux-ha will
> > trigger on_fail='stop' and invoke stop part my script.
>
> no, only return codes count. see above page.
>
> > After invoking
> > stop operation of the script I am expecting linux-ha to start the
> > failover operation on another node and invoke start operation on the
> > resource.
> > But actual behavior is as below.
> > 1. monitor operation continues on the same node for some time around
> 20
> > seconds (it varies). And then it starts the resource on the another
> > node.
> >
> > Is it possible to avoid this? Can I achieve failover on first instance
> > of the error during monitor operation?
>
> http://linux-ha.org/v2/faq/forced_failover
>
> >
> > One more observation ,if I use on_fail='fence' without stonith
> enabled,
> > fail over occurs with lesser time.
> >
> > Will use of resource type (OCF , heartbeat) fetch different results?
> >
> > My requirement is as below :
> >
> > In between two nodes there is no shared device, Each node has times
> ten
> > datastore. One is in active and other is in standby mode. If active
> > instance of timesten fails then scripts on standby node needs to be
> > executed on standby node making it active.
> >
> > </Chetan>
> >
> >
> > at startup, we will call you script to check that it is not already
> > running
> > in the cluster.
> > is this what you are talking about or something else?
> >
> >
> > Am I wrong in choosing resource type?
> > >
> > > What should I give on_fail as. (I tried stop, restart,block).
> >
> >
> >
> > It depends on what you're trying to achieve.
> >
> >
> > I am not
> > > using fence as my understanding is, it will reboot the failed
> machine
> > > which I don't want or there is option not to reboot.
> > >
> > > What option should I use with on_fail to stop the monitor/status
> > > operation in case it fails in first instance?
> >
> >
> > on_fail is irrelevant here. as the page i referred you to indicates,
> you
> > need to set default_resource_failure_stickiness
> >
> > Thanks again,
> > > Chetan
> > >
>
> http://www.patni.com
> World-Wide Partnerships. World-Class Solutions.
> _____________________________________________________________________
>
> This e-mail message may contain proprietary, confidential or legally
> privileged information for the sole use of the person or entity to
> whom this message was originally addressed. Any review, e-transmission
> dissemination or other use of or taking of any action in reliance upon
> this information by persons or entities other than the intended
> recipient is prohibited. If you have received this e-mail in error
> kindly delete this e-mail from your records. If it appears that this
> mail has been forwarded to you without proper authority, please notify
> us immediately at netadmin at patni.com and delete this mail.
> _____________________________________________________________________
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
More information about the Linux-HA
mailing list