[Linux-HA] Failover not working as I expected
Jerome Yanga
jyanga at esri.com
Mon Jan 26 15:23:03 MST 2009
Andrew,
I apologize for my sending my previous email abruptly.
I have followed your recommendation and installed Pacemaker.
Here is my config.
Packages Installed:
heartbeat-2.99.2-6.1
heartbeat-common-2.99.2-6.1
heartbeat-debug-2.99.2-6.1
heartbeat-ldirectord-2.99.2-6.1
heartbeat-resources-2.99.2-6.1
libheartbeat2-2.99.2-6.1
libpacemaker3-1.0.1-3.1
pacemaker-1.0.1-3.1
pacemaker-debug-1.0.1-3.1
pacemaker-pygui-1.4-11.9
pacemaker-pygui-debug-1.4-11.9
ha.cf:
# Logging
debug 1
use_logd false
logfacility daemon
# Misc Options
traditional_compression off
compression bz2
coredumps true
# Communications
udpport 691
bcast eth1 eth0
autojoin any
# Thresholds (in seconds)
keepalive 1
warntime 6
deadtime 10
initdead 15
ping 10.50.254.254
crm respawn
apiauth mgmtd uid=root
respawn root /usr/lib/heartbeat/mgmtd -v
cib.xml:
<cib admin_epoch="0" validate-with="pacemaker-1.0" crm_feature_set="3.0" have-quorum="1" epoch="57" dc-uuid="5e3e3c2d-55e7-4c51-90be-5c4a1912bf3e" num_updates="0" cib-last-written="Mon Jan 26 13:57:32 2009">
<configuration>
<crm_config>
<cluster_property_set id="cib-bootstrap-options">
<nvpair id="cib-bootstrap-options-dc-version" name="dc-version" value="1.0.1-node: 6fc5ce8302abf145a02891ec41e5a492efbe8efe"/>
</cluster_property_set>
</crm_config>
<nodes>
<node id="5e3e3c2d-55e7-4c51-90be-5c4a1912bf3e" uname="nomen.esri.com" type="normal">
<instance_attributes id="nodes-5e3e3c2d-55e7-4c51-90be-5c4a1912bf3e">
<nvpair id="standby-5e3e3c2d-55e7-4c51-90be-5c4a1912bf3e" name="standby" value="off"/>
</instance_attributes>
</node>
<node id="27f54ec3-b626-4b4f-b8a6-4ed0b768513c" uname="rubric.esri.com" type="normal">
<instance_attributes id="nodes-27f54ec3-b626-4b4f-b8a6-4ed0b768513c">
<nvpair id="standby-27f54ec3-b626-4b4f-b8a6-4ed0b768513c" name="standby" value="off"/>
</instance_attributes>
</node>
</nodes>
<resources>
<group id="Directory_Server">
<meta_attributes id="Directory_Server-meta_attributes">
<nvpair id="Directory_Server-meta_attributes-collocated" name="collocated" value="true"/>
<nvpair id="Directory_Server-meta_attributes-ordered" name="ordered" value="true"/>
<nvpair id="Directory_Server-meta_attributes-resource_stickiness" name="resource_stickiness" value="100"/>
</meta_attributes>
<primitive class="ocf" id="VIP" provider="heartbeat" type="IPaddr">
<instance_attributes id="VIP-instance_attributes">
<nvpair id="VIP-instance_attributes-ip" name="ip" value="10.50.26.250"/>
</instance_attributes>
<operations id="VIP-ops">
<op id="VIP-monitor-5s" interval="5s" name="monitor" timeout="5s"/>
</operations>
</primitive>
<primitive class="ocf" id="ECAS" provider="esri" type="ecas">
<operations id="ECAS-ops">
<op id="ECAS-monitor-3s" interval="3s" name="monitor" timeout="3s"/>
</operations>
<meta_attributes id="ECAS-meta_attributes">
<nvpair id="ECAS-meta_attributes-target-role" name="target-role" value="Started"/>
</meta_attributes>
</primitive>
<primitive class="ocf" id="FDS_Admin" provider="esri" type="fdsadm">
<operations id="FDS_Admin-ops">
<op id="FDS_Admin-monitor-3s" interval="3s" name="monitor" timeout="3s"/>
</operations>
</primitive>
</group>
</resources>
<constraints>
<rsc_location id="cli-prefer-Directory_Server" rsc="Directory_Server">
<rule id="cli-prefer-rule-Directory_Server" score="INFINITY" boolean-op="and">
<expression id="cli-prefer-expr-Directory_Server" attribute="#uname" operation="eq" value="rubric.esri.com" type="string"/>
</rule>
</rsc_location>
<rsc_location id="cli-prefer-FDS_Admin" rsc="FDS_Admin">
<rule id="cli-prefer-rule-FDS_Admin" score="INFINITY" boolean-op="and">
<expression id="cli-prefer-expr-FDS_Admin" attribute="#uname" operation="eq" value="nomen.esri.com" type="string"/>
</rule>
</rsc_location>
</constraints>
</configuration>
</cib>
I still have the following issues when I only had heartbeat 2.1.3-1. My concerns are still as follows:
01) When a node comes back up after a restart of heartbeat, resources gets bounced when it rejoins the cluster.
02) Stopping one resource in a group does not failover the group to the other node.
Help.
Regards,
Jerome
-----Original Message-----
From: linux-ha-bounces at lists.linux-ha.org [mailto:linux-ha-bounces at lists.linux-ha.org] On Behalf Of Andrew Beekhof
Sent: Tuesday, January 20, 2009 1:33 PM
To: General Linux-HA mailing list
Subject: Re: [Linux-HA] Failover not working as I expected
On Tue, Jan 20, 2009 at 21:48, Jerome Yanga <jyanga at esri.com> wrote:
> Dominik,
>
> Per your request, attached is my current configuration.
>
> To reiterate, the following are still concerns:
>
> 01) Resources gets bounced when Nomen rejoins the cluster.
> 02) Group failover will not work as hoped.
>
> As per resource monitoring, I believe that the customized init scripts are working properly; however, me being a noob seems to contradict this. I have tested the init scripts in a way that when a failure of the resource is experienced the service is restarted. After seeing that the init script is working, I have set the "On Fail" value to "stop" instead of "restart".
>
> Moreover, I have tried varying the group scores by changing the resource_stickiness and the resource_failure_stickiness values.
I would highly encourage you to upgrade to the latest stable series of
Pacemaker.
The whole failure stickiness nonsense has been completely dropped in
favor of something thats actually usable.
http://clusterlabs.org/wiki/Install
http://clusterlabs.org/wiki/Documentation <-- look for the 1.0 version
of configuration explained
> However, I have not been able to consistently failover the group by stopping one of the resources. During the testing, I have tried using the equation below from the site you provided in your previous email.
>
> node = (constraint-score) + (num_group_resources * resource_stickiness) + (failcount * (resource_failure_stickiness) )
>
> Unfortunately, the scores does not seem to follow this equation as I would verify them using the showscores.sh. The following values were assign to the Directory_Server group during this testing.
>
> resource_stickiness=100
> resource_failure_stickiness=-500
>
> I have also attempted to use the crm_failcount command to make sure that the scores prior to failing any resource gets reset, but showscores.sh seems to show that the command is not working.
>
> I have also tried to change the cib.xml file manually to assign the values above to default-resource-stickiness and default-resource-failure-stickiness respectively, but after doing so, all the resources seems to disappear. (Good thing I had created a copy of the cib.xml file.)
>
> By the way, I have changed the values back to the following:
>
> resource_stickiness=100
> resource_failure_stickiness=-100
>
> Help.
>
> Regards,
> Jerome
>
>
>
>
> -----Original Message-----
> From: linux-ha-bounces at lists.linux-ha.org [mailto:linux-ha-bounces at lists.linux-ha.org] On Behalf Of Dominik Klein
> Sent: Monday, January 19, 2009 11:31 PM
> To: General Linux-HA mailing list
> Subject: Re: [Linux-HA] Failover not working as I expected
>
> Jerome Yanga wrote:
>> Dominik,
>>
>> Thank you much. Adding "resource-stickiness" and getting rid of the constraint helped a lot. The resources does not go back to Nomen anymore when it's heartbeat is started again (resources stays with Rubric). However, the resources still gets bounced once Nomen joins the cluster. Is there any way to keep the resources from bouncing when Nomen rejoins the cluster?
>
> Please share your current configuration.
>
>> I have also observed another issue. As you have seen in my cib.xml, I have created a group called Directory_Server. In this group, there are three resources, namely: VIP, ECAS and FDS_Admin. If I manually turn off any of these resources, I would like the group resource, Directory_Server, to failover to the other node. Is there a configuration that will do this? Currently, if one of three resources goes down it stays down and the rest continues running. All three resources will need to be up and running for our applications to work properly.
>
> Sounds like you're not doing any resource monitoring. Read up on that
> and configure it. The ScoreCalculation page might be handy to understand
> how things work: http://www.linux-ha.org/ScoreCalculation
>
> Regards
> Dominik
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
_______________________________________________
Linux-HA mailing list
Linux-HA at lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
More information about the Linux-HA
mailing list