[Linux-HA] heartbeat failed ... split brain?
Jan-Frode Myklebust
janfrode at parallab.uib.no
Wed Mar 17 01:58:09 MST 2004
We just did our first production test of our heartbeat setup, and it
failed, and ended up with the same disks mounted on two nodes. Luckily
we were able to shut down one of the nodes before the filesystem got
corrupted.
Our setup:
2x Dell 2650, connected to a NexSan ATABoy2 via separat scsi channels.
We're running Whitebox Enterprise linux (RHEL 3.0 clone), and
heartbeat 1.2.0, with the meatware stonith-plugin. The heartbeats are
going via serial cable and the public ethernet interface.
# grep -v ^# ha.cf
logfacility local0
deadtime 30
warntime 5
baud 19200
serial /dev/ttyS0
udpport 694
bcast eth0
auto_failback off
stonith_host * meatware hanfs1.ii.uib.no hanfs2.ii.uib.no
node hanfs1.ii.uib.no
node hanfs2.ii.uib.no
# grep -v ^# haresources
hanfs1.ii.uib.no 129.177.16.239 Filesystem::-LHANFS::/export/hanfsstud::ext3::usrquota nfslock nfs adsmsched
hanfs2.ii.uib.no 129.177.16.242 Filesystem::-LHASMB::/export/hasmb::ext3::usrquota smb
hanfs1 was serving the NFS resource, and hanfs2 was serving the SMB
resource. Then we wanted to nicely hand over the NFS resource from
hanfs1 to hanfs2 by running 'hb_standby all' on hanfs1, but that
failed, and hanfs2 mounted the /export/hanfsstud filesystem anyway!
Heartbeat logs from hanfs1:
heartbeat[2273]: info: hanfs1.ii.uib.no wants to go standby [all]
heartbeat[2273]: info: standby: hanfs2.ii.uib.no can take our all resources
heartbeat[3332]: info: give up all HA resources (standby).
heartbeat: info: Releasing resource group: hanfs1.ii.uib.no 129.177.16.239 Filesystem::-LHANFS::/export/hanfsstud::ext3::usrquota nfslock nfs adsmsched
heartbeat: info: Running /etc/init.d/adsmsched stop
heartbeat: info: Running /etc/init.d/nfs stop
rpc.mountd: Caught signal 15, un-registering and exiting.
nfs: rpc.mountd shutdown succeeded
kernel: lockd: couldn't shutdown host module!
kernel: nfsd: last server has exited
kernel: nfsd: unexporting all filesystems
nfs: nfsd shutdown succeeded
nfs: rpc.rquotad shutdown succeeded
nfs: Shutting down NFS services: succeeded
heartbeat: info: Running /etc/init.d/nfslock stop
rpc.statd[3119]: Caught signal 15, un-registering and exiting.
nfslock: rpc.statd shutdown succeeded
rpc.statd[3119]: Caught signal 15, un-registering and exiting.
nfslock: rpc.statd shutdown succeeded
heartbeat: info: Running /etc/ha.d/resource.d/Filesystem -LHANFS /export/hanfsstud ext3 usrquota stop
sshd(pam_unix)[3201]: session closed for user root
heartbeat: ERROR: Couldn't unmount /export/hanfsstud
heartbeat: ERROR: Return code 1 from /etc/ha.d/resource.d/Filesystem
heartbeat: info: Running /etc/ha.d/resource.d/IPaddr 129.177.16.239 stop
heartbeat: info: /sbin/route -n del -host 129.177.16.239
heartbeat: info: /sbin/ifconfig eth0:0 down
heartbeat: info: IP Address 129.177.16.239 released
heartbeat: info: Releasing resource group: hanfs2.ii.uib.no 129.177.16.242 Filesystem::-LHASMB::/export/hasmb::ext3::usrquota smb
heartbeat: info: Running /etc/init.d/smb stop
smb: smbd shutdown failed
smb: nmbd shutdown failed
heartbeat: ERROR: Return code 1 from /etc/init.d/smb
heartbeat: info: Running /etc/ha.d/resource.d/Filesystem -LHASMB /export/hasmb ext3 usrquota stop
heartbeat: WARNING: Filesystem /export/hasmb not mounted?
heartbeat: info: Running /etc/ha.d/resource.d/IPaddr 129.177.16.242 stop
heartbeat[3332]: info: all HA resource release completed (standby).
heartbeat[2273]: info: Local standby process completed [all].
heartbeat[2273]: WARN: 1 lost packet(s) for [hanfs2.ii.uib.no] [422547:422549]
heartbeat[2273]: info: remote resource transition completed.
heartbeat[2273]: info: No pkts missing from hanfs2.ii.uib.no!
heartbeat[2273]: info: remote resource transition completed.
heartbeat[2273]: info: No pkts missing from hanfs2.ii.uib.no!
heartbeat[2273]: info: Other node completed standby takeover of all resources.
When we noticed that /export/hanfsstud was mounted on both nodes, we
quickly shut down hanfs2, and luckily were able to do so without
damaging the /export/hanfsstud filesystem (it seems).
So, any ideas why this could happen? I have no idea why it was unable
to unmount the filesystem, but shouldn't it rather panic than continue
when that happened?
Any more information I could provide?
-jf
More information about the Linux-HA
mailing list