[Linux-ha-dev] hb report - troubles on 4 node cluster
Dejan Muhamedagic
dejanmm at fastmail.fm
Tue Feb 12 08:15:53 MST 2008
Hi,
On Tue, Feb 12, 2008 at 10:51:04AM +0100, Andreas Mather1 wrote:
> Hi Dejan,
>
> As suggested, I've filed a bug report and also attached the remaining logs
> there:
> http://developerbugs.linux-foundation.org/show_bug.cgi?id=1831
OK. Though I don't think it's a bug. And definitely not a ccm
bug. If there's a bug, then it's in the comm layer.
> > Strange timestamps. Which node went down? And when? Also,
> > rbxd02:eth0 was not reported as down and rbxw01:eth0 rbxd01:eth0
> > not as up: probably at some point rbxw02:eth0 went down. It would
> > be interesting to see logs from the other nodes. Don't know why
> > hb_report didn't pack them.
> >
> > Feb 10 19:10:16 rbxw02 heartbeat: [22769]: info: Link rbxw01:eth0 up.
> > Feb 10 19:10:32 rbxw02 heartbeat: [22769]: info: Link rbxd02:eth2 dead.
> > Feb 10 19:13:16 rbxw02 heartbeat: [22769]: info: Link rbxd01:eth0 dead.
>
> The stopped node was rbxd02, which was shortly after heartbeat was started
> (after I've experienced the problem, I've cleaned up the /var/lib/crm/
> directory, copied the cib.xml there again, started up the whole cluster and
> tried the same procedure again).
> Maybe I've messed up with the times given to hb_report, but when I look at
> this, I guess rbxw01 was the last node which joined to cluster (19:10).
> Then I shut down rbxd02 (19:10) and it rejoined at 19:13 - when things
> started to mess up.
>
> Additionally, it's hard to believe for me that the rejoining rbxd02 can
> cause real network problems between, say, rbxd01 and rbxw02...
I understand that, but those Heartbeat (here I mean the
communication layer) messages are unequivocal. If at all a bug,
this could only be a heartbeat communication problem, but then if
it is manifested only as links going down or up it is really hard
to say if the links really changed state or heartbeat went
berserk. Unless there's further evidence, I'd go with the former.
Did you check dmesg and other logs for unusual signs? In
particular if there was anything about ethernet.
Notes:
1. Late heartbeats: This is further evidence that there's
probably sth about with the network. These delays are
extraordinary long. But note that they are not at 19:13.
/var/tmp/all_logs/rbxd01/ha-log:Feb 10 18:25:06 rbxd01 heartbeat: [24686]: WARN: Late heartbeat: Node rbxw01: interval 82200 ms
/var/tmp/all_logs/rbxd01/ha-log:Feb 10 18:25:06 rbxd01 heartbeat: [24686]: WARN: Late heartbeat: Node rbxw02: interval 81700 ms
/var/tmp/all_logs/rbxd01/ha-log:Feb 10 19:15:06 rbxd01 heartbeat: [24686]: WARN: Late heartbeat: Node rbxw02: interval 129310 ms
/var/tmp/all_logs/rbxd01/ha-log:Feb 10 19:15:06 rbxd01 heartbeat: [24686]: WARN: Late heartbeat: Node rbxw01: interval 128810 ms
/var/tmp/all_logs/rbxw01/ha-log:Feb 10 18:25:06 rbxw01 heartbeat: [1435]: WARN: Late heartbeat: Node rbxd01: interval 81100 ms
/var/tmp/all_logs/rbxw01/ha-log:Feb 10 18:25:06 rbxw01 heartbeat: [1435]: WARN: Late heartbeat: Node rbxw02: interval 82110 ms
/var/tmp/all_logs/rbxw01/ha-log:Feb 10 19:15:06 rbxw01 heartbeat: [1435]: WARN: Late heartbeat: Node rbxw02: interval 129120 ms
/var/tmp/all_logs/rbxw01/ha-log:Feb 10 19:15:07 rbxw01 heartbeat: [1435]: WARN: Late heartbeat: Node rbxd01: interval 130380 ms
/var/tmp/all_logs/rbxw02/ha-log:Feb 10 18:25:06 rbxw02 heartbeat: [22769]: WARN: Late heartbeat: Node rbxd01: interval 81200 ms
/var/tmp/all_logs/rbxw02/ha-log:Feb 10 18:25:06 rbxw02 heartbeat: [22769]: WARN: Late heartbeat: Node rbxw01: interval 82110 ms
/var/tmp/all_logs/rbxw02/ha-log:Feb 10 19:15:06 rbxw02 heartbeat: [22769]: WARN: Late heartbeat: Node rbxw01: interval 129170 ms
/var/tmp/all_logs/rbxw02/ha-log:Feb 10 19:15:07 rbxw02 heartbeat: [22769]: WARN: Late heartbeat: Node rbxd01: interval 130270 ms
2. "Not installed" failures:
rbxd01/ha-log:Feb 10 14:44:47 rbxd01 crmd: [17577]: ERROR: process_lrm_event: LRM operation was-inst_monitor_0 (call=12, rc=5) Error not installed
rbxd01 is not supposed to run WAS? Then it should really return
"not running" on monitor. The same for db2 on rbxw hosts. We
already had discussion about that.
3. Xinetd problem:
rbxw01/ha-log:Feb 10 15:05:52 rbxw01 Xinetd[1512]: [1523]: ERROR:
Service descriptor /etc/xinetd.d/vsftpd not found!
vsftpd never runs on rbxw hosts? I guess that here Xinetd should
also report stopped instead of an error. I'll take a look at
that.
4. Unknown sybsystem?:
Feb 10 18:21:00 rbxw01 crmd: [1450]: ERROR: send_msg_via_ipc:
Unknown Sub-system (29070_crm_resource)... discarding message.
Whatever happened here. Perhaps Andrew could say a bit more.
5. apache interesting stuff, how this came up:
rbxw02/ha-log:Feb 10 15:02:08 rbxw02 apache[22080]: [22114]:
ERROR: /opt/IBMIHS/bin/httpd: error while loading shared
libraries: libaprutil-0.so.0: cannot open shared object file: No
such file or directory
> Two questions:
> 1.) When I get the next maintenance window, what can I do, to get more
> debug/helpful information? ('debug' in ha.cf, hb_report, additional logs,
> ...)?
debug's always good. hb_report should collect everything that
matters.
> 2.) Now there's a mailing list discussion and a bug report. To ease the
> discussion, one should be closed I think - which one?
It's better to keep the discussion here until we find out if the
suspicions are substantiated.
Cheers,
Dejan
> Mit freundlichen Gr??en / Best regards
>
> Andreas MATHER
> ESLT - Enterprise Services for Linux Technologies
>
> IBM Austria, Obere Donaustrasse 95, 1020 Vienna
> Phone : +43-1-21145/4799
> Fax: +43-1-21145/8888
> e-mail: andreas.mather at at.ibm.com
>
> IBM ?sterreich Internationale B?romaschinen Gesellschaft m.b.H.
> Sitz: Wien
> Firmenbuchgericht: Handelsgericht Wien, FN 80000y
>
>
>
> Dejan Muhamedagic
> <dejanmm at fastmail
> .fm> To
> Sent by: High-Availability Linux Development
> linux-ha-dev-boun List
> ces at lists.linux-h <linux-ha-dev at lists.linux-ha.org>
> a.org cc
>
> Subject
> 02/11/2008 07:25 Re: [Linux-ha-dev] hb report -
> PM troubles on 4 node cluster
>
>
> Please respond to
> High-Availability
> Linux Development
> List
> <linux-ha-dev at lis
> ts.linux-ha.org>
>
>
>
>
>
>
> Hi Andreas,
>
> On Sun, Feb 10, 2008 at 09:38:45PM +0100, Andreas Mather1 wrote:
> > ***********************
> > Warning: Your file, report_1.tar.gz, contains more than 32 files after
> decompression and cannot be scanned.
> > ***********************
> >
> >
> >
> >
> > Hi all,
> >
> > Please find attached a hb_report for a problem I experienced when
> > implementing heartbeat.
> >
> > The environment:
> > It's an asymmetric 4 node cluster, running heartbeat 2.1.3. All nodes
> share
> > a couple of filesystems, all GPFS formatted. Services inlcude WebSphere
> > (modified RA), DB2 (modified RA), vsftpd (Xinetd), samba, nfs, MCS (self
> > written RA), IHS and are put in 4 groups (filesvc, mcs, was, db). Dejan
> is
> > also familiar with the setup.
> > OS: SLES 9.3 (x86_64)
> > hearbeat: build via ./ConfigureMe package
> >
> >
> > The Problem:
> > In general, everything works fine (crm_standby works for every node,
> etc.),
> > but, when I simulate a power loss of one node (via IBM RSA)*, a cluster
> > split occurs when this node rejoins. Suddenly, on every node, crm_mon
> shows
> > the node it is running on as 'online' while reporting the other nodes as
> > 'OFFLINE'. After 1 - 2 min. the cluster is fully operational again (all
> > nodes found themself again), but it seems as every resource gets
> restarted.
> >
> > Please let me know, if I can provide further information.
>
> >From the log on rbxw02:
>
> Feb 10 19:10:16 rbxw02 heartbeat: [22769]: info: Link rbxd01:eth0 up.
> Feb 10 19:10:16 rbxw02 heartbeat: [22769]: info: Link rbxw01:eth0 up.
> Feb 10 19:10:32 rbxw02 heartbeat: [22769]: info: Link rbxd02:eth2 dead.
> Feb 10 19:13:16 rbxw02 heartbeat: [22769]: info: Link rbxd01:eth0 dead.
> Feb 10 19:13:16 rbxw02 heartbeat: [22769]: info: Link rbxd01:eth2 dead.
> Feb 10 19:13:17 rbxw02 heartbeat: [22769]: info: Link rbxw01:eth0 dead.
> Feb 10 19:13:17 rbxw02 heartbeat: [22769]: info: Link rbxw01:eth2 dead.
> Feb 10 19:15:06 rbxw02 heartbeat: [22769]: CRIT: Cluster node rbxw01
> returning after partition.
> Feb 10 19:15:06 rbxw02 heartbeat: [22769]: info: Link rbxw01:eth2 up.
> Feb 10 19:15:06 rbxw02 heartbeat: [22769]: info: Link rbxd02:eth2 up.
> Feb 10 19:15:07 rbxw02 heartbeat: [22769]: CRIT: Cluster node rbxd01
> returning after partition.
> Feb 10 19:15:07 rbxw02 heartbeat: [22769]: info: Link rbxd01:eth2 up.
>
> Strange timestamps. Which node went down? And when? Also,
> rbxd02:eth0 was not reported as down and rbxw01:eth0 rbxd01:eth0
> not as up: probably at some point rbxw02:eth0 went down. It would
> be interesting to see logs from the other nodes. Don't know why
> hb_report didn't pack them.
>
> Two extra nodes went DC around 19:13 for about two minutes, which
> means that there were three partitions: w02,d02 and w01 and d01.
> Note that none of them had quorum.
>
> Looks like a network problem, but an awkward one. Don't know how
> it got disrupted this much. Perhaps you could try with unicast:
> replace each mcast directive with four ucast directives.
>
> Cheers,
>
> Dejan
>
> > Thanks,
> >
> > Andreas
> >
> >
> > * Sorry, I forgot to test what happens, when I just stop and start
> > heartbeat on that node - would be useful too, I think... :(
> >
> >
> >
> >
> > (See attached file: report_1.tar.gz)
> >
> > Mit freundlichen Gr??en / Best regards
> >
> > Andreas MATHER
> > ESLT - Enterprise Services for Linux Technologies
> >
> > IBM Austria, Obere Donaustrasse 95, 1020 Vienna
> > Phone : +43-1-21145/4799
> > Fax: +43-1-21145/8888
> > e-mail: andreas.mather at at.ibm.com
> >
> > IBM ?sterreich Internationale B?romaschinen Gesellschaft m.b.H.
> > Sitz: Wien
> > Firmenbuchgericht: Handelsgericht Wien, FN 80000y
>
>
> > _______________________________________________________
> > Linux-HA-Dev: Linux-HA-Dev at lists.linux-ha.org
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> > Home Page: http://linux-ha.org/
>
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
>
>
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
More information about the Linux-HA-Dev
mailing list