[Linux-HA] Groups vs colocations.... etc
Andreas Kurz
akurz at sms.at
Wed Dec 6 09:38:27 MST 2006
Andrew Beekhof wrote:
> On 11/28/06, Andreas Kurz <akurz at sms.at> wrote:
>> Serge Dubrouski wrote:
>> > Most of clusterware products, at least those that I've worked for
>> > (Veritas VCS, RedHat ClusterSuite, HP ServiceGuard, etc..) consider
>> > resources in a group dependent on each other. Upper resources depend
>> > om lower ones. Like DB depend on Filesystem with data files. That
>> > means that if Filesysten fails DB has to be restarted. And Heartbeat
>> > works exactly like this if you have a group with collocated property
>> > set to "true". Per my understanding it's completely right. If you
>> > don't want that dependency exclude yor NFS filesystem from the group
>> > but add collocated constaint between that group and separate NFS
>> > resource. That might help.
>> >
>> > As for stickiness I personally don't like how it's implemented in
>> > Heartbeat, I'd prefer having a simple property
>> > "number_of_fails_before_failover".
>
> which doesn't in any way affect this scenario (groups) because you've
> still got on part of the group trying to stay where it is and the
> other trying to move. at least with the scoring the CRM gets some
> hint as to which part of the group it should take the most notice of.
>
> there is a comment further down which talks about one resource being
> "buggy"... expecting cluster software to magically compensate for
> inherently broken resources is unrealistic.
You are right, of course! I only wanted to produce some errors for the
test-scenario ;-)
>> eg:
>>
>> a group with 5 resources, 2 nodes
>> location constraint: score 1 for node1, score 10 for node2
>> resource stickiness: 10
>> failure stickiness: 5
>>
>> resource failed over to node1 because of a unexpected server hang of
>> node2, node2 up again (I assume the location scores working correctly
>> ;-) )
>>
>> node1: 5*(1) + 5(10) = 55
>> node2: 5*(10) = 50
>>
>> ok ... resource stays on node1
>>
>> one resource is buggy, heartbeat starts do stop/start it
>>
>> restart1:
>>
>> node1: 5*1 + 5*10 - 1*5 = 50
>> node2: 5*10 = 50
>>
>> ok ... resource stays on node1
So with equal scores the group is moved away because of the "lower load"
of node2? Is this computed by the number of resources running on each node?
>>
>> restart2:
>> node1: 5*1 + 5*10 - 2*5 = 45
>> node2: 5*10 = 50
>>
>> takeover, after 1 local restart, am I right?
>
> you tell me - try ptest and see what it does.
OK. The group is moved away when either the combined score of node1 is
lower than node2 or if the score for one resource is negative.
>
>> resource group is on node2,
>
>> failcount reset on node1:
>
> the failcount is never reset automatically
I did it manually ;-)
>> node1: 5*1 = 5
>> node2: 5*10 + 5*10 = 100
>>
>> hmm ... thats a problem, or have I missed something?
>
> why is this a problem?
>
>> that would lead to about 20 local restarts before a failover to node1
>> happens ....
Not so many, but more than on the other node whith the lower scores. The
group fails over when the local score for the failing resource is negative.
>
> so choose different values
> or dont apply the rsc_location preference to every member of the group
I tried to configure instance_attributes for the group with different
resource_failure_stickiness values but without success, the rule never
matches:
ptest[27606]: 2006/12/06_17:30:43 debug: debug2: test_rule:rules.c
Testing rule higher_failure_stickiness_rule
ptest[27606]: 2006/12/06_17:30:43 debug: debug2: test_expression:rules.c
Expression test failed on all ndoes
ptest[27606]: 2006/12/06_17:30:43 debug: debug3: test_rule:rules.c
Expression higher_failure_stickiness_rule/test failed
ptest[27606]: 2006/12/06_17:30:43 debug: debug3: unpack_attr_set:rules.c
Adding attributes from lower_failure_stickiness_inst
<instance_attributes id="higher_failure_stickiness_inst" score="100">
<rule id="higher_failure_stickiness_rule" boolean_op="and">
<expression attribute="#uname" operation="eq"
value="sms-nfs-02" id="test"/>
</rule>
<attributes>
<nvpair id="higher_failure_stickiness_id"
name="resource_failure_stickiness" value="-10"/>
</attributes>
</instance_attributes>
<instance_attributes id="lower_failure_stickiness_inst" score="10">
<attributes>
<nvpair id="lower_failure_stickiness_id"
name="resource_failure_stickiness" value="-1"/>
</attributes>
</instance_attributes>
Andrew, do you have a hint why this is not working? The group is
currently running on the node sms-nfs-02. I tried the same with a time
based rule and it worked.
Regards,
Andi
>
>>
>> If I am completely wrong please correct me!
>>
>> Regards,
>> Andreas
>>
>> >
>> > On 11/28/06, Andre van der Vlies <andre at vandervlies.xs4all.nl> wrote:
>> >>
>> >> Andreas Kurz wrote:
>> >> > Andre van der Vlies wrote:
>> >> >> Andrew Beekhof wrote:
>> >> >>>> So, given:
>> >> >>>> IPaddr_1
>> >> >>>> IPaddr_2
>> >> >>>> NFS_1
>> >> >>>> NFS_2
>> >> >>>> PG
>> >> >>>>
>> >> >>>> there's no way I can prevent NFS_2 and PG from being stopped and
>> >> >>>> started
>> >> >>>> if NFS_1 fails, make NFS_1 retry 5 times and if it doesn't
>> >> succeed the
>> >> >>>> whole group needs to failover... :-/
>> >> >>>
>> >> >>> not in a group.
>> >> >>> but groups are only a syntactic shortcut for a bunch of colocation
>> >> and
>> >> >>> ordering constraints.
>> >> >>>
>> >> >>> so dont use a group and dont make NFS_2 depend on NFS_1
>> >> >>>
>> >> >>
>> >> >> Sorry, I still don't get it...
>> >> >>
>> >> >> I've got 5 resources.
>> >> >> I make constraints to start them in the right order (1, 2, 3, 4, 5)
>> >> >> I make constraints to get them start on the same node...
>> >> >
>> >> > That's what a group implies, you don't need to make them 'by hand'
>> >> or if
>> >> > you prefer it that way you can disable all constraints from the
>> group.
>> >> > Then your group is only a naming convention for your convenience.
>> >> >
>> >> >>
>> >> >> As a bonus I can do stuff with the stickiness of a resource. For
>> >> >> instance
>> >> >> resource 3 fails and is retried 5 times before it fails over to
>> >> >> another
>> >> >> node; which makes all the other resources migrate...
>> >> >>
>> >> >
>> >> > Yes, because of the colocation constraints.
>> >> >
>> >> >> But....
>> >> >> If I put those 5 resources in a group (colocation, order), I can
>> only
>> >> >> use
>> >> >> the stickiness of the last resource in the group. None of the
>> others
>> >> >> seems
>> >> >> to have any vote in the matter. And if a 'midlist' resource
>> fails all
>> >> >> lower resources are stopped and started....
>> >> >
>> >> > The stickiness, no matter if it's the
>> 'resource_failure_stickiness' or
>> >> > the 'resource_stickiness', is bound to a resource independent from
>> >> where
>> >> > the resource is defined in the group.
>> >> >
>> >>
>> >> Okay.
>> >>
>> >> > All resources in a group are bound together by the colocation
>> >> > constraints so a failing resource has influence on the whole
>> group and
>> >> > the score of the group. The sum of all scores of all resources in a
>> >> > group decides on which node the whole group has to run. So if you
>> >> define
>> >> > a failure stickiness every failing resource lowers the group score.
>> >> >
>> >>
>> >> That has been my reasoning too... My experience tells me otherwise
>> >>
>> >> > Because the ordering constraints are per default symmetric they
>> imply
>> >> > also a stop_before and not only the defined start_before constraint,
>> >> and
>> >> > I think it makes sense most of the time ... but it can also be
>> >> disabled.
>> >> >
>> >>
>> >> Hmmm.... How do I do this exactly?
>> >>
>> >> > Hope that helps ;-)
>> >> >
>> >>
>> >> I bit. I have been reasoning along the same path. The behaviour of mys
>> >> cluster is (very) different from what I expected...
>> >>
More information about the Linux-HA
mailing list