[Linux-HA] deadtime, warntime, and drbd
Jason Joines
support at bus.okstate.edu
Fri Mar 4 08:43:21 MST 2005
I recently experienced the "Cluster node returning after partition"
problem described in FAQ #12. I have two nodes and two resource groups,
one is the prefered node for each. Nodea is the prefered node for
drbd0, it's filesystem, an ip address, and samba. Nodeb is the prefered
node for drbd1. Both are connected to a public 100 Mbps switch via eth0
and a private 1 Gbps switch via eth1.
At the time this occurred, nodea was serving smb requests to a large
number of clients via eth0. I had mounted drbd1 on nodeb, exported it
via NFS, and was rapidly copying the entire filesystem of another box to
it via eth1. Apparently the load got high enough on nodeb that
communication between the nodes failed and mass confusion ensued (at
least that's what I can make of the logs). Eventually nodeb rebooted
itself, the drbds went into either StandAlone or Disconnected mode and I
had to manually tell nodea to take the smb resource group back.
My timing settings in ha.cf at the time were
keepalive 1
deadtime 16
Following the FAQ suggestion I have upped deadtime to 64 and set
warntime to 16 so I can watch the logs for a while. However, I'm unsure
how my drbd timing settings are interacting with this. They were, and
at the moment still are, connect-int 8
ping-int 4
timeout 20
Any suggestions for modifying these settings to be more in tune with
heartbeat?
Jason Joines
=================================
More information about the Linux-HA
mailing list