[Linux-HA] tracking down reason for crash?
Miles Fidelman
mfidelman at meetinghouse.net
Tue Jun 8 11:46:13 MDT 2010
Hi Folks,
I just finished constructed a 2-node HA cluster, with this basic
configuration:
Debian Lenny
Xen xen-3.0-x86_32p (Debian package: 2.6.26-2-xen-686)
Disk stack:
-- md-based RAID - RAID1 for /boot and Dom0's / and swap; RAID6 for LVM
based volumes for DomUs
-- LVM
-- DRBD 8.0.14 (current package for Debian Lenny)
Heartbeat (current package for Debian Lenny - looks like heartbeat
2.1.3), running w/ crm on
Several Debian Lenny DomUs (all PVs) - one of which is semi-production,
the others are experimental
I've pretty much got everything working 3 days ago, and all seems to be
working, EXCEPT that 1 or 2 times per day, the system crashes - and the
crash of one node seems to take down the 2nd node (not exactly what one
wants in failover environment).
The nodes both auto-restart, and the production server comes back up,
but still... I'd like to track down what's happening.
A few data points:
- for a couple of crashes, one of the RAID6 arrays started resyncing on
reboot -- the LVM and DRBD volumes above it came up during resync, but
things were a lot slower; after the most recent crash (about an hour
ago), the RAID array was fine after roboot
- I've been running sar to track load, and there does not seem to have
been any noticeable change in system load, leading up to the crashes (on
either Dom0 or on the production DomU)
- unfortunately, there was not much in my logs that looked helpful,
here's what I've reconstructed
This is my backup machine, and it seems to have crashed first - these
are the Dom0 syslog entries that bracket the crash and reboot:
Jun 8 12:10:01 server3 /USR/SBIN/CRON[25253]: (root) CMD (if [ -x
/usr/bin/vnst
at ] && [ `ls /var/lib/vnstat/ | wc -l` -ge 1 ]; then /usr/bin/vnstat
-u; fi)
Jun 8 12:11:42 server3 smartd[4216]: Device: /dev/sda, SMART Usage
Attribute: 1
94 Temperature_Celsius changed from 116 to 117
Jun 8 12:11:42 server3 smartd[4216]: Device: /dev/sdb, SMART Usage
Attribute: 1
94 Temperature_Celsius changed from 121 to 122
Jun 8 12:11:42 server3 smartd[4216]: Device: /dev/sdc, SMART Usage
Attribute: 1
94 Temperature_Celsius changed from 118 to 119
Jun 8 12:11:42 server3 smartd[4216]: Device: /dev/sdc, SMART Usage
Attribute: 1
94 Temperature_Celsius changed from 118 to 119
------ crash seems to have happened here ----------
Jun 8 12:15:46 server3 kernel: imklog 3.18.6, log source = /proc/kmsg
started.
These are the syslog entries from Dom0 on the "production server"
12:14:11-21 <lots of kernel messages re. DRBD losing connection,
changing device states>
12:14:40 - <bunches of messages from hearbeat, crmd, cib - ending with
the next three lines>
Jun 8 12:14:41 server2 cib: [4676]: info: cib_ccm_msg_callback: LOST:
server3
Jun 8 12:14:41 server2 cib: [4676]: info: cib_ccm_msg_callback: PEER:
server2
12:14:41 server2 crmd: [4680]: info: do_election_count_vote: Updated vote
d hash for server2 to vote
------ looks like the primary node crashed here --------
Jun 8 12:16:14 server2 kernel: imklog 3.18.6, log source = /proc/kmsg
started.
So..... looks like something happened on my backup node, heartbeat
noticed it properly on the primary node, but instead of simply
continuing along, it crashed and restarted. At that point everything
came back up, but.....
So... several questions to the group:
1. Any thoughts on why the crash of one node led to the other node
crashing?
1.a. anything I might look at to glean more details (though the logs
seem sort of sparse)
1.b. any kind of logging and/or diagnostics I should turn on to capture
more details the next time around?
2. Not quite a heartbeat question, but any thoughts on diagnostics I can
turn on to try to capture the original crash event?
3. Right now, my production server (DomU) will normally run on one
server, then come up on the backup server if the primary server fails.
But... as soon as the primary server comes back up, the DomU migrates
back - and in these events, the timing is such that it only partially
comes up on the backup server before the migration back starts. Somehow
this doesn't seem that healthy. So....
3.a. How do I set things so that, after a primary-node crash, the DomU
comes up on the backup machine, and stays there.
3.b. As above, but if the backup node fails, and the primary node comes
backup, it goes back.
I.e., the desired state is: run where you are, migrate on a crash, stay
there unless that node crashes or you're told to migrate
Thanks very much,
Miles Fidelman
--
In theory, there is no difference between theory and practice.
In<fnord> practice, there is. .... Yogi Berra
More information about the Linux-HA
mailing list