cluster layering

Stephen C. Tweedie sct@redhat.com
Wed, 24 Mar 1999 16:53:00 +0000 (GMT)


Hi,

On Wed, 24 Mar 1999 11:08:18 -0500, "Peter J. Braam" <braam@cs.cmu.edu>
said:

> Can we talk a bit more about the layering.   I'm thinking about a lock
> manager and a caching protocol for  SAN file systems - ala VMS with some
> changes.  Clearly the DLM will sit above the connection manager. Both the
> DLM and the connection manager will sit on top of a communications layer
> (note that UDP could be a substrate exporting the communications layer
> interface, but should not be idetified with it).  So I suppose that while a
> large part of the connection manager can work in user land, a component will
> live in the kernel.

The intelligence of the connection manager should live in user space.  I
don't want those complex protocols in the kernel, and I want to be able
to just kill the membership process and let the cluster reform if
something really bad happens.  There may be some comms code in the
kernel, though.

> I'd like to understand the following example in some more detail.  Suppose
> our cluster has three hosts as members,
> A,B and C connected by, for example memory channel.  Suppose that A and B
> share a disk over SCSI, and that host C is working with files on the shared
> disk, and C is doing its communication through A.  So A is mastering the
> locks on some part of the file system on the shared disk, and C holds such a
> lock.

OK

> Now a state change happens: A dies.  We would like for B to take over from
> A, so that C can continue using the disk.  From Coda I know that C can find
> out about A's disappearance in multiple ways:

> 1.  the membership component connection manager notices it first

> 2.  an error is returned by the lower layers of the file & I/O system on C
> when doing I/O between C and  A and the disk

The way I'm dealing with this is that the connection manager is layered
on top of the cluster communications mechanism.  Any IO error reported
by the comms necessarily triggers an immediate cluster transition.  All
further IO over _any_ local cluster comms channels is suspended until
things resolve.

Each cluster-internal service has a priority; we only allow them to
resume comms once all higher priorities have already completed the
transition.  The connection manager as highest priority: on any comms
error, all other comms is suspended until the connection manager
determines that the cluster has reached a new equilibrium.  We can then
recover lower priority services one by one: the namespace service, the
lock manager, and then any shared data services (cluster RAID and
cluster config databases).

When we start the transition process, we generate a transition-begin
event, which may be seen by the actual applications running on top of
all of this.  Only once the transition of the cluster-internal services
is complete do we send them a full transition event, and at that point
we can failover, migrate or restart any other user services that need to
be changed after the transition.

> A. How do you envision that the connection manager gains control when the
> retries have failed a few times?

In general, that depends on whether or not it is a core cluster service
or not.  If it is: if, for example, it is a cluster filesystem or
cluster-RAID device, then its communications are necessarily bound to
the cluster state anyway.

If it is not, then it just keeps retrying, confident that it will get
told by the cluster services when any transition is complete.

> B. How can the connection manager restart the operation initiated by C's I/O
> subsystem, in effect replacing A by B, after the transition in the cluster
> has completed?

On a cluster transition begin, the core cluster services are all told
"Stop updating NOW".  When each core service begins recovery (and
remember that we necessarily have a new, stable cluster state at this
point because the connection manager has already recovered), it is up to
the service at that point to checkpoint any transactions it had in
progress, based on the new cluster state.

> I envision something like the following.   Each resource (think of the disk)
> has a name and a storage group associated wtih it (the storage group would
> be {A,B}).  When we get a lock, we also get a preferred server for the
> resource.  If I/O fails, with ETIMEOUT, we (i) trap the error, (ii) detect
> it is a cluster resource (iii) ask the connection manager to give us a new
> preferred server, and retry.

No, we really don't want the error recovery stuff to spill over between
layers like this.  If the resource fails, we necessarily initiate
controlled recovery, and at the recovery time, we know precisely what
the new cluster state is.

> Where do we trap the error?  In the buffer cache which fails during
> flushing? It probably cannot be done in the file system above it,
> since that merely writes to the buffer cache.  Also, it seems like the
> context in which this happens is possibly not the context of (e.g.)
> the writer in the file system, but instead the context of another
> process which needs memory.  So what is the layer here?  It looks like
> the communications layer or the lock manager exports state (namely the
> preferred servers) to the buffer cache.

The cluster block device is responsible for all of this.  For HA, we
want cluster-RAID1 anyway, and the same mechanism which is required to
implement that will work just perfectly for the single-disk case as well
as the multiple-copy case.  

--Stephen