[Linux-ha-dev] tracking resource groups in heartbeat
Wed, 29 Mar 2000 11:59:36 -0800
I have been thinking of a way of getting around the problem of multiple
nodes owning a resource after a media failure. It seems that for this
problem the main issue is that though nodes are aware of whether or not
they have contact with each other, they are not necessarily aware of what
resources the other nodes currently have.
I think that nodes having omniscient knowledge of what resources are on
what nodes of the cluster is bad, to much state, to much of the time.
Rather I am thinking of a mechanism where if there is a chance of a
resource being in an unknown state the status of the resource can be
requested. This might occur when node A notices node B has just come back
up. In the case of a media failure, node B would also notice that node A
has just come up.
Here is a first cut.
To find out which node owns a resource a resource-request could be sent.
This would contain the name of the resource, as well as auth, timestamp and
sequence number information.
All nodes should reply (except the originating node - presumably it knows
the status of the resources it owns).
The resource-reply should contain, the resource name, status as well as
auth, timestamp and sequence number information. In addition information
for tie-breaking should be included, who the node thinks is the master,
time since the resource was last obtained/given up, and perhaps a random
number as a last resort tie-breaker.
A resource-request would be sent out for each resource a node is eligible
to own (from haresources) when it sees another node come on line. Once
resource-reply's are received from all nodes that are up (the node should
know which nodes it thinks are up) then it should be able to decide weather
or not to give up or take over the resource. Once this decision is made a
resource-reply should be sent out, so all nodes can know the state of the
resource. If the state of the resource is still inconsistent (in particular
owned more than once) then the other nodes effected should notice this and
send a fresh resource-request.