[Linux-ha-dev] How to get EVMS failover to work with heartbeat 1.2.3 for a 2-node cluster
yixiong.zou at intel.com
Tue Dec 14 15:00:37 MST 2004
Someone in the chat room asked how to get the ccm quorum and evms
to work. And a couple of days ago we happened to have done a demo
using EVMS in our lab. We had to solve the exact same problems
related to quorum. So here are some tips that I'd like to share
and hope these are useful.
1) heartbeat 1.2.3. Because the evms was written using glib
instead of glib2. And last I checked they have no plans to
port to glib2.
2) EVMS. I think the latest release should work.
3) a two node cluster.
2. original ha.cf config file
stonith ssh /etc/ha.d/rpc.conf
3. our problem
we know that evms requires ccm to have quorum to work. But
ccm would not gain quorum when you only have one node started.
That means if you have only the primary or stand-by node
started, the evms_failover would not work. However, once
you have both nodes started, the fail over and fail back
4. the definition of quorum
To solve the problem of not having quorum when only one node
is started, I looked into the ccm code. The definition of quorum
in ccm is this:
1) MORE THAN half of the nodes are in this membership
2) You know FOR SURE that you are the only node that is alive
This definition is correct, and it explains why once you have
both nodes started, the evms failover and failback all worked
fine. Because ccm received the T_STONITH message that was
sent out by the heartbeat and ccm knows that the other node
IS in fact dead.
So we found the first bug: when you first start the heartbeat
in either active or the stand-by node, the ccm was only
respawned by heartbeat after the other node was STONITHed, and
the T_STONITH message was lost.
5. Our "hack" solution.
We do not have the time and resource to really fix this bug
since we are pressured to have this demo done. So here's
what we did: we took out the "respawn ccm" from the ha.cf
and started it on our own.
/etc/init.d/heartbeat start; sleep 1; /usr/lib/heartbeat/ccm &
We thought that by doing so that ccm would receive the
T_STONITH and give us quorum. And we did get quorum, only once
in 10 times.
6. The second bug
Looking into ccm's code we found out that after ccm receives
the T_STONITH msg, it does not re-calculate the membership.
All it does is update the node status. So unless there is
some event that triggers the membership to change, you would
not gain quorum. Fortunately we found a work around for our
demo. That is to have ccm receive this T_STONITH msg before
ccm converge and finalize the membership in the first time.
/etc/init.d/heartbeat start; sleep 3; /usr/lib/heartbeat/ccm &
In our test, we can get quorum very consistently by sleeping
either 3 or 4 seconds before starting ccm.
Notice the number of seconds you sleep is closely related
to the initdead time. You want to make sure that you don't
sleep longer than initdead otherwise the stonith would have
happened and ccm would not get quorum.
Also how fast ccm converges depends on the keepalive timer.
You want to make sure that you do receive the T_STONITH
before your membership converges.
I filed both bugs in linux-ha's bugzilla now,
so they need to be fixed eventually.
I hope this is helpful for those that had the same problem.
Yixiong Zou (yixiong.zou at intel.com)
Open Source Technology Center
All views expressed in this email are those of the individual sender.
More information about the Linux-HA-Dev