[Linux-HA] Adding third node turns all resources unmanaged

Gerard Petersen gerard at gp-net.nl
Mon Jul 28 00:56:15 MDT 2008


 Dear Andrew,

Thanx for your response.

I see two options/conclusions on which I would like your feedback:

- Enable stonith so the attempt to start the resources on the third 
node, shall be 'naturally' disabled and therefore moved back to the 
first two nodes by the cluster software.

- Install Xen (and drbd) on the third node, so the cluster software 
get's a change to initialise some commands and get a proper answer to 
see that the resources don't belong here.


Kind regards,

Gerard.

Andrew Beekhof wrote:
> On Thu, Jul 24, 2008 at 16:29, Gerard Petersen <gerard at gp-net.nl> wrote:
>   
>> Hi all,
>>
>> I'm trying to add a third node to a two node working cluster withresources
>> in the form of mirrored Xen (and underlying drbd) virtual servers. The two
>> node setup works great and as expected. (On failure, the drbd mirrors
>> switch master/slave roles, XenU's migrate automatically, etc). The goal is
>> to manually spread master slave combinations of the XenU's over the three
>> pysical nodes.
>>
>> The third node is already added to heartbeat config, and in standby mode.
>> We have contraints in place (full log and config will follow), that work
>> with the +INF, 'zero' and -INF values, respectively as Master location,
>> Slave location and  'Never' location constraints.
>>
>> When we take the third node online, where the current XenU's according to
>> the constraints are not allowed, the resources somehow all are moved to
>> the third node, where no xen or drbd is present yet. It seems some of the
>> constraints are completely ignored. We have tried this, among other
>> things, with the symmetric_cluster value True and False, but no luck.
>>
>> Furthermore the log shows that the resources become 'to active', and after
>> that they become unmanaged.
>>     
>
> When a new node joins the cluster, we check to see if its running any
> of the cluster resources.
> These checks occur regardless of any location constraints (precisely
> so that we can enforce them for you).
>
> What can happen however, is that these checks may fail.
> Sometimes they fail because the service was unexpectedly found to be
> active on the node.
> Sometimes its because the resource agent (or the software it tries to
> talk to) isnt installed.
>
> in your case, it seems the RA is misbehaving and incorrectly telling
> the cluster that the resources are active
> eg.
>              <lrm_rsc_op id="server128_monitor_0" operation="monitor"
> crm-debug-origin="build_active_RAs"
> transition_key="15:10:c195d63f-e91f-4162-8454-f6dde2c71ef1"
> transition_magic="0:0;15:10:c195d63f-e91f-4162-8454-f6dde2c71ef1"
> call_id="6" crm_feature_set="2.0" rc_code="0" op_status="0"
> interval="0" op_digest="78122685b830dcb8197c65561be6d6a5"/>
>
> rc_code="0" being the relevant piece of information
>
> The cluster then thinks that the service is active on more than one
> node and tries to recover.
> But the RA then compounds the initial problem by failing to stop the service:
>
>              <lrm_rsc_op id="server128_stop_0" operation="stop"
> crm-debug-origin="build_active_RAs"
> transition_key="25:11:c195d63f-e91f-4162-8454-f6dde2c71ef1"
> transition_magic="0:1;25:11:c195d63f-e91f-4162-8454-f6dde2c71ef1"
> call_id="12" crm_feature_set="2.0" rc_code="1" op_status="0"
> interval="0" op_digest="78122685b830dcb8197c65561be6d6a5"/>
>
> again, rc_code="1" being the part indicating failure.
>
> at which point the cluster can do nothing (since stonith is disabled)
>
>   
>> Some notes to clearify the setup (and make the log more readable):
>>
>> We run heartbeat version 2.1.3-5~bpo40+1 from debian backports. At the
>> time of testing, one node was still on 2.1.3-2~bpo40+1.
>>
>> Fysical nodes:
>> server010 (still to be added)
>> server011
>> server012
>>
>> Virtual servers (the resources):
>> server128 - server133
>>
>> All resources have contraints allowing a primary role on server011 and
>> secondary role on server012 (or viceversa). And are not allowed on
>> server010.
>>
>> # Attached files are:
>>
>> - cleancib.xml
>> The one we started of with.
>>
>> - fullcib.xml
>> The most recent full dump (with counters etc. added by the cluster
>> software itself).
>>
>> - syslog.clusterlog.080722.full(.tgz)
>> A cleaned up syslog wherein, with different values for symmetric_cluster,
>> the trail can be followed how all resources became to active, and end up
>> unmanaged on server010
>>
>> - syslog.clusterlog.080722.part(.tgz)
>> A stripped version of the previous one with only one trail, hopefully
>> isolation enough information, for easier analyses.
>>
>> It looks like the behaviour deviates from what the docs describe in
>> relation to the symmetric_cluster directive, or it's just a very ugly typo
>> somewhere .. :-)
>>
>> I sincerely hope somebody can pinpoint the weakspot.
>>
>> Thanx a lot!!
>>
>>
>> Kind regards,
>>
>> Gerard.
>>
>>
>>
>> --
>> ~
>> ~
>> :wq!
>> _______________________________________________
>> Linux-HA mailing list
>> Linux-HA at lists.linux-ha.org
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
>>
>>     
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
>   


-- 
>>> urls
{'fun':  'www.zonderbroodje.nl',  'tech':  'www.gp-net.nl'} 




More information about the Linux-HA mailing list