[Linux-ha-dev] A possible approach for i/o fencing

Stephen C. Tweedie sct@redhat.com
Mon, 6 Mar 2000 23:24:53 +0000 (GMT)


Hi,

On Mon, 06 Mar 2000 11:43:48 -0800, David Brower <dbrower@us.oracle.com>
said:

>> First, if we could rely 100% on nodes outside the quorate cluster to
>> enforce this, there wouldn't be any need for fencing.

> I'm not sure I go along with this.  The only parties that can be issuing
> fence commands are those that are already in cluster transition modes
> of operation.  

Yes, but that doesn't mean they are all running in the _same_
transition.

> They are already admitting that something is going on, and fencing may
> need to be done, and that someone is going to issue fences.  What we
> are trying to protect ourselves from is the insane nodes that have not
> enterered a transition mode, and may be doing writes, or sleeping with
> writes queueing.  

Yes, the example usually given being a node which goes to sleep for a
while for whatever reason, then comes back suddenly, submitting whatever
it had in its write queue to the disk.  And...

> They will not ever be issuing fencing commands, because they aren't in
> the transition code.  

...there is no reason why such a node sleep cannot occur during
transition processing.  After all, problems tend to come in bunches.
What happens if a node is doing cluster recovery, it dies, the rest of
the cluster recovers in turn from that death (bumping the cluster
incarnation of course), and then the dead node suddenly recovers,
sending forth bogus fence commands?

> Thus, I don't believe we need to particularly protect ourselves from
> grossly incorrect fencing by non quorum members, 

I think we do, since we simply cannot make guarantees about faulty
nodes.

>> Secondly, the cluster model I outlined at the January cluster meeting
>> explicitly separates local membership from quorum: quorum is just
>> another resource that the cluster manages.  
...

> I think I've tried to support that model, but not called it out quite
> so clearly.  This is why GRITS is agnostic about the number of groups,
> and the resources controlled by the group.  There is the hidden
> assumption (that ought to be explicit) that the group and gritty
> resources are always a quorum group and the instantiation of the
> access policy for members and non-members to those resources.  

I'm more concerned about the implicit assumption that quorum group ==
membership group, and quorum incarnation == membership incarnation.  We
can make a guarantee that quorum incarnations increase monotonically,
but group membership incarnation identifiers have to behave rather
differently if you allow non-quorate partitions to have incarnation
numbers.

>> Agreed, but that raises a different question: do you need generation
>> numbers at all, then?  If cluster software (a) only ever performs
>> fencing during cluster recovery, and then (b) only if it holds quorum,
>> then the only risk (other than of buggy cluster software) is if cluster
>> recovery takes so long on one faulty node that its own fencing request
>> gets overtaken by another cluster transition elsewhere, and it gets
>> evicted before it notices.

See my argument above: we have to be prepared for fence commands coming
from non-quorate nodes in certain failure modes. :(

>> Nasty.  If this happens, it's not clear that you _can_ do the right
>> thing, unless you can rely on persistent generation numbers in the
>> fenced resource.  If the resource is (say) a network switch with no
>> persistent state, the only alternative is for it to broadcast for
>> generation numbers on startup and take the highest one offered, before
>> accepting any fencing instructions.

> Yes; this is where I am unconvinced that non-persistent store is 
> adequate.  We need to keep thinking about this.

Indeed.  However, if our switch does the broadcast for the active
quorate generation number and gets a number of replies representing
enough votes for quorum, then it has a pretty good idea that things are
OK!  That's not hard to do: the "what's my generation number?" message
just has to have a reply like "it's xyzzy, and I have N votes, and
quorum is M."  That's enough to bootstrap the generation numbers
reliably.

Of course, we'd have to have a similar kind of behaviour even when
setting fencing during a cluster transition, since the first quorate
transition after a cluster power cycle may find the grits resource
without a prior generation number, and so it will have to "authenticate"
the new generation number with the same sort of quorum evidence.  Once
the resource has a generation number, the monotonic advance assumption
is enough to authenticate future quorate transitions.

>> Actually, shared scsi has the rather nice property that you can use a
>> combination of reservation and on-disk storage to do a lot of this
>> _without_ direct negotiation between hosts.  ...

> Yes this is true.  I have been hoping, however, to keep dancing until
> we decide that the persistant storage is an essential requirement.  

Either that or presenting evidence of quorum as above, I guess.  It
starts to feel unnecessarily complex, but I can't see a simpler way of
making the guarantees solid.

--Stephen