[Linux-ha-dev] hb report - troubles on 4 node cluster (ccm chaos)

Andrew Beekhof beekhof at gmail.com
Mon Feb 11 07:18:33 MST 2008


On Feb 10, 2008, at 9:38 PM, Andreas Mather1 wrote:

> ***********************
> Warning: Your file, report_1.tar.gz, contains more than 32 files  
> after decompression and cannot be scanned.
> ***********************
>
>
>
>
> Hi all,
>
> Please find attached a hb_report for a problem I experienced when
> implementing heartbeat.

Logs are only included for one node which is odd.

>
>
> The environment:
> It's an asymmetric 4 node cluster, running heartbeat 2.1.3. All  
> nodes share
> a couple of filesystems, all GPFS formatted. Services inlcude  
> WebSphere
> (modified RA), DB2 (modified RA), vsftpd (Xinetd), samba, nfs, MCS  
> (self
> written RA), IHS and are put in 4 groups (filesvc, mcs, was, db).  
> Dejan is
> also familiar with the setup.
> OS: SLES 9.3 (x86_64)
> hearbeat: build via ./ConfigureMe package
>
>
> The Problem:
> In general, everything works fine (crm_standby works for every node,  
> etc.),
> but, when I simulate a power loss of one node (via IBM RSA)*, a  
> cluster
> split occurs when this node rejoins. Suddenly, on every node,  
> crm_mon shows
> the node it is running on as 'online' while reporting the other  
> nodes as
> 'OFFLINE'. After 1 - 2 min. the cluster is fully operational again  
> (all
> nodes found themself again), but it seems as every resource gets  
> restarted.

I can easily believe this would happen.

With all the membership changes I can see in the logs, you lost quorum  
at which point
            <nvpair id="cib-bootstrap-options-no_quorum-policy"  
name="no_quorum-policy" value="stop"/>
kicked in and all your resources would have been stopped.

Then later, when the ccm sorted itself out and the membership returned  
to normal, the resources would have been started again.


No idea why the CCM went haywire though.
Probably best to log a bug.

>
>
> Please let me know, if I can provide further information.
>
> Thanks,
>
> Andreas
>
>
> * Sorry, I forgot to test what happens, when I just stop and start
> heartbeat on that node - would be useful too, I think... :(
>
>
>
>
> (See attached file: report_1.tar.gz)
>
> Mit freundlichen Grüßen / Best regards
>
> Andreas MATHER
> ESLT - Enterprise Services for Linux Technologies
>
> IBM Austria, Obere Donaustrasse 95, 1020 Vienna
> Phone : +43-1-21145/4799
> Fax: +43-1-21145/8888
> e-mail: andreas.mather at at.ibm.com
>
> IBM Österreich Internationale Büromaschinen Gesellschaft m.b.H.
> Sitz: Wien
> Firmenbuchgericht: Handelsgericht Wien, FN 80000y
> < 
> report_1 
> .tar.gz>_______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/



More information about the Linux-HA-Dev mailing list