[Linux-HA] Adding third node turns all resources unmanaged

Gerard Petersen gerard at gp-net.nl
Mon Jul 28 01:34:50 MDT 2008


Dear Andrew,

Nice one ... But I'm into python and not into C coding... ;-)

Seriously, where my conclusions far of, because I'm a bit at a loss here.

Thanx again.

Regards,

Gerard.

Andrew Beekhof wrote:
> On Mon, Jul 28, 2008 at 08:56, Gerard Petersen <gerard at gp-net.nl> wrote:
>> Dear Andrew,
>>
>> Thanx for your response.
>>
>> I see two options/conclusions on which I would like your feedback:
>>
>> - Enable stonith so the attempt to start the resources on the third node,
>> shall be 'naturally' disabled and therefore moved back to the first two
>> nodes by the cluster software.
>>
>> - Install Xen (and drbd) on the third node, so the cluster software get's a
>> change to initialise some commands and get a proper answer to see that the
>> resources don't belong here.
> 
> I think you missed the most preferable option... fix the RA to return
> OCF_NOT_INSTALLED in such cases and send us a patch :-)
> 
>>
>> Kind regards,
>>
>> Gerard.
>>
>> Andrew Beekhof wrote:
>>> On Thu, Jul 24, 2008 at 16:29, Gerard Petersen <gerard at gp-net.nl> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I'm trying to add a third node to a two node working cluster
>>>> withresources
>>>> in the form of mirrored Xen (and underlying drbd) virtual servers. The
>>>> two
>>>> node setup works great and as expected. (On failure, the drbd mirrors
>>>> switch master/slave roles, XenU's migrate automatically, etc). The goal
>>>> is
>>>> to manually spread master slave combinations of the XenU's over the three
>>>> pysical nodes.
>>>>
>>>> The third node is already added to heartbeat config, and in standby mode.
>>>> We have contraints in place (full log and config will follow), that work
>>>> with the +INF, 'zero' and -INF values, respectively as Master location,
>>>> Slave location and  'Never' location constraints.
>>>>
>>>> When we take the third node online, where the current XenU's according to
>>>> the constraints are not allowed, the resources somehow all are moved to
>>>> the third node, where no xen or drbd is present yet. It seems some of the
>>>> constraints are completely ignored. We have tried this, among other
>>>> things, with the symmetric_cluster value True and False, but no luck.
>>>>
>>>> Furthermore the log shows that the resources become 'to active', and
>>>> after
>>>> that they become unmanaged.
>>>>
>>> When a new node joins the cluster, we check to see if its running any
>>> of the cluster resources.
>>> These checks occur regardless of any location constraints (precisely
>>> so that we can enforce them for you).
>>>
>>> What can happen however, is that these checks may fail.
>>> Sometimes they fail because the service was unexpectedly found to be
>>> active on the node.
>>> Sometimes its because the resource agent (or the software it tries to
>>> talk to) isnt installed.
>>>
>>> in your case, it seems the RA is misbehaving and incorrectly telling
>>> the cluster that the resources are active
>>> eg.
>>>             <lrm_rsc_op id="server128_monitor_0" operation="monitor"
>>> crm-debug-origin="build_active_RAs"
>>> transition_key="15:10:c195d63f-e91f-4162-8454-f6dde2c71ef1"
>>> transition_magic="0:0;15:10:c195d63f-e91f-4162-8454-f6dde2c71ef1"
>>> call_id="6" crm_feature_set="2.0" rc_code="0" op_status="0"
>>> interval="0" op_digest="78122685b830dcb8197c65561be6d6a5"/>
>>>
>>> rc_code="0" being the relevant piece of information
>>>
>>> The cluster then thinks that the service is active on more than one
>>> node and tries to recover.
>>> But the RA then compounds the initial problem by failing to stop the
>>> service:
>>>
>>>             <lrm_rsc_op id="server128_stop_0" operation="stop"
>>> crm-debug-origin="build_active_RAs"
>>> transition_key="25:11:c195d63f-e91f-4162-8454-f6dde2c71ef1"
>>> transition_magic="0:1;25:11:c195d63f-e91f-4162-8454-f6dde2c71ef1"
>>> call_id="12" crm_feature_set="2.0" rc_code="1" op_status="0"
>>> interval="0" op_digest="78122685b830dcb8197c65561be6d6a5"/>
>>>
>>> again, rc_code="1" being the part indicating failure.
>>>
>>> at which point the cluster can do nothing (since stonith is disabled)
>>>
>>>
>>>> Some notes to clearify the setup (and make the log more readable):
>>>>
>>>> We run heartbeat version 2.1.3-5~bpo40+1 from debian backports. At the
>>>> time of testing, one node was still on 2.1.3-2~bpo40+1.
>>>>
>>>> Fysical nodes:
>>>> server010 (still to be added)
>>>> server011
>>>> server012
>>>>
>>>> Virtual servers (the resources):
>>>> server128 - server133
>>>>
>>>> All resources have contraints allowing a primary role on server011 and
>>>> secondary role on server012 (or viceversa). And are not allowed on
>>>> server010.
>>>>
>>>> # Attached files are:
>>>>
>>>> - cleancib.xml
>>>> The one we started of with.
>>>>
>>>> - fullcib.xml
>>>> The most recent full dump (with counters etc. added by the cluster
>>>> software itself).
>>>>
>>>> - syslog.clusterlog.080722.full(.tgz)
>>>> A cleaned up syslog wherein, with different values for symmetric_cluster,
>>>> the trail can be followed how all resources became to active, and end up
>>>> unmanaged on server010
>>>>
>>>> - syslog.clusterlog.080722.part(.tgz)
>>>> A stripped version of the previous one with only one trail, hopefully
>>>> isolation enough information, for easier analyses.
>>>>
>>>> It looks like the behaviour deviates from what the docs describe in
>>>> relation to the symmetric_cluster directive, or it's just a very ugly
>>>> typo
>>>> somewhere .. :-)
>>>>
>>>> I sincerely hope somebody can pinpoint the weakspot.
>>>>
>>>> Thanx a lot!!
>>>>
>>>>
>>>> Kind regards,
>>>>
>>>> Gerard.
>>>>
>>>>
>>>>
>>>> --
>>>> ~
>>>> ~
>>>> :wq!
>>>> _______________________________________________
>>>> Linux-HA mailing list
>>>> Linux-HA at lists.linux-ha.org
>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>>> See also: http://linux-ha.org/ReportingProblems
>>>>
>>>>
>>> _______________________________________________
>>> Linux-HA mailing list
>>> Linux-HA at lists.linux-ha.org
>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>> See also: http://linux-ha.org/ReportingProblems
>>>
>>>
>>
>> --
>>>>> urls
>> {'fun':  'www.zonderbroodje.nl',  'tech':  'www.gp-net.nl'}
>>
>> _______________________________________________
>> Linux-HA mailing list
>> Linux-HA at lists.linux-ha.org
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
>>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
> 

-- 
 >>> urls
{'fun':  'www.zonderbroodje.nl',  'tech':  'www.gp-net.nl'}



More information about the Linux-HA mailing list