[Linux-ha-dev] hb report - troubles on 4 node cluster
Andreas Mather1
andreas.mather at at.ibm.com
Mon Feb 18 11:55:16 MST 2008
Hello,
I justed wanted to let you know that my issues with the cluster are solved
now.
Here's what I did:
*) raising debug to 1
*) put everything to one logfile instead of two (still using sylog)
*) changing mcast entries to ucast entries in ha.cf
*) cleaning up my customized db2 and WAS_generic RAs
(they return now OCF_NOT_RUNNING on monitor operation instead of
OCF_ERR_INSTALLED on nodes which don't run the resources)
None of these changes sound like beeing able to prevent the strange
behaviour I had before, but something helped...
The strange messages (late heartbeats, 'link down' when link should be
still up) in my logs also vanished...
Thanks for your time and hints!
Andreas
IBM Österreich Internationale Büromaschinen Gesellschaft m.b.H.
Sitz: Wien
Firmenbuchgericht: Handelsgericht Wien, FN 80000y
linux-ha-dev-bounces at lists.linux-ha.org wrote on 02/11/2008 07:25:46 PM:
> Hi Andreas,
>
> On Sun, Feb 10, 2008 at 09:38:45PM +0100, Andreas Mather1 wrote:
> > ***********************
> > Warning: Your file, report_1.tar.gz, contains more than 32 files
> after decompression and cannot be scanned.
> > ***********************
> >
> >
> >
> >
> > Hi all,
> >
> > Please find attached a hb_report for a problem I experienced when
> > implementing heartbeat.
> >
> > The environment:
> > It's an asymmetric 4 node cluster, running heartbeat 2.1.3. All nodes
share
> > a couple of filesystems, all GPFS formatted. Services inlcude WebSphere
> > (modified RA), DB2 (modified RA), vsftpd (Xinetd), samba, nfs, MCS
(self
> > written RA), IHS and are put in 4 groups (filesvc, mcs, was, db). Dejan
is
> > also familiar with the setup.
> > OS: SLES 9.3 (x86_64)
> > hearbeat: build via ./ConfigureMe package
> >
> >
> > The Problem:
> > In general, everything works fine (crm_standby works for every node,
etc.),
> > but, when I simulate a power loss of one node (via IBM RSA)*, a cluster
> > split occurs when this node rejoins. Suddenly, on every node, crm_mon
shows
> > the node it is running on as 'online' while reporting the other nodes
as
> > 'OFFLINE'. After 1 - 2 min. the cluster is fully operational again (all
> > nodes found themself again), but it seems as every resource gets
restarted.
> >
> > Please let me know, if I can provide further information.
>
> >From the log on rbxw02:
>
> Feb 10 19:10:16 rbxw02 heartbeat: [22769]: info: Link rbxd01:eth0 up.
> Feb 10 19:10:16 rbxw02 heartbeat: [22769]: info: Link rbxw01:eth0 up.
> Feb 10 19:10:32 rbxw02 heartbeat: [22769]: info: Link rbxd02:eth2 dead.
> Feb 10 19:13:16 rbxw02 heartbeat: [22769]: info: Link rbxd01:eth0 dead.
> Feb 10 19:13:16 rbxw02 heartbeat: [22769]: info: Link rbxd01:eth2 dead.
> Feb 10 19:13:17 rbxw02 heartbeat: [22769]: info: Link rbxw01:eth0 dead.
> Feb 10 19:13:17 rbxw02 heartbeat: [22769]: info: Link rbxw01:eth2 dead.
> Feb 10 19:15:06 rbxw02 heartbeat: [22769]: CRIT: Cluster node rbxw01
> returning after partition.
> Feb 10 19:15:06 rbxw02 heartbeat: [22769]: info: Link rbxw01:eth2 up.
> Feb 10 19:15:06 rbxw02 heartbeat: [22769]: info: Link rbxd02:eth2 up.
> Feb 10 19:15:07 rbxw02 heartbeat: [22769]: CRIT: Cluster node rbxd01
> returning after partition.
> Feb 10 19:15:07 rbxw02 heartbeat: [22769]: info: Link rbxd01:eth2 up.
>
> Strange timestamps. Which node went down? And when? Also,
> rbxd02:eth0 was not reported as down and rbxw01:eth0 rbxd01:eth0
> not as up: probably at some point rbxw02:eth0 went down. It would
> be interesting to see logs from the other nodes. Don't know why
> hb_report didn't pack them.
>
> Two extra nodes went DC around 19:13 for about two minutes, which
> means that there were three partitions: w02,d02 and w01 and d01.
> Note that none of them had quorum.
>
> Looks like a network problem, but an awkward one. Don't know how
> it got disrupted this much. Perhaps you could try with unicast:
> replace each mcast directive with four ucast directives.
>
> Cheers,
>
> Dejan
>
> > Thanks,
> >
> > Andreas
> >
> >
> > * Sorry, I forgot to test what happens, when I just stop and start
> > heartbeat on that node - would be useful too, I think... :(
> >
> >
> >
> >
> > (See attached file: report_1.tar.gz)
> >
> > Mit freundlichen Gr??en / Best regards
> >
> > Andreas MATHER
> > ESLT - Enterprise Services for Linux Technologies
> >
> > IBM Austria, Obere Donaustrasse 95, 1020 Vienna
> > Phone : +43-1-21145/4799
> > Fax: +43-1-21145/8888
> > e-mail: andreas.mather at at.ibm.com
> >
> > IBM ?sterreich Internationale B?romaschinen Gesellschaft m.b.H.
> > Sitz: Wien
> > Firmenbuchgericht: Handelsgericht Wien, FN 80000y
>
>
> > _______________________________________________________
> > Linux-HA-Dev: Linux-HA-Dev at lists.linux-ha.org
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> > Home Page: http://linux-ha.org/
>
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
More information about the Linux-HA-Dev
mailing list