cluster layering
David Brower
dbrower@us.oracle.com
Wed, 24 Mar 1999 08:40:34 -0800
The situation Peter is describing here is often called
"proxy i/o", and is the sort of thing done by IBM's
VXD or VSD layer on AIX. In AIX-land, this sort of
failure is also handled by the _event_ subsystem,
which propagates information about important cluster-wide
transitions. I tried to send a picture of the IBM
reference model out a couple of weeks ago, but don't
know if anyone got it.
Essentially, there's a way for interested
subsystems to get callbacks when something happens. It's
probably the case that you need to handle errors correctly,
but it's less clear that driving cluster transition on
receipt of an error is the preferable thing to do. I
could see an i/o error generating an event into the CM
to tell it to take another look around quickly to see
what might be broken.
-dB
"Peter J. Braam" wrote:
>
> Stephen,
>
> Can we talk a bit more about the layering. I'm thinking about a lock
> manager and a caching protocol for SAN file systems - ala VMS with some
> changes. Clearly the DLM will sit above the connection manager. Both the
> DLM and the connection manager will sit on top of a communications layer
> (note that UDP could be a substrate exporting the communications layer
> interface, but should not be idetified with it). So I suppose that while a
> large part of the connection manager can work in user land, a component will
> live in the kernel.
>
> I'd like to understand the following example in some more detail. Suppose
> our cluster has three hosts as members,
> A,B and C connected by, for example memory channel. Suppose that A and B
> share a disk over SCSI, and that host C is working with files on the shared
> disk, and C is doing its communication through A. So A is mastering the
> locks on some part of the file system on the shared disk, and C holds such a
> lock.
>
> Now a state change happens: A dies. We would like for B to take over from
> A, so that C can continue using the disk. From Coda I know that C can find
> out about A's disappearance in multiple ways:
>
> 1. the membership component connection manager notices it first
>
> 2. an error is returned by the lower layers of the file & I/O system on C
> when doing I/O between C and A and the disk
>
> Let's look at case 2 (I believe that 1 is slightly simpler). C's I/O
> subsystem will do some retries before it decides that something pretty bad
> has happened. I'd like to raise the following two questions for discussion:
>
> A. How do you envision that the connection manager gains control when the
> retries have failed a few times?
>
> B. How can the connection manager restart the operation initiated by C's I/O
> subsystem, in effect replacing A by B, after the transition in the cluster
> has completed?
>
> I envision something like the following. Each resource (think of the disk)
> has a name and a storage group associated wtih it (the storage group would
> be {A,B}). When we get a lock, we also get a preferred server for the
> resource. If I/O fails, with ETIMEOUT, we (i) trap the error, (ii) detect
> it is a cluster resource (iii) ask the connection manager to give us a new
> preferred server, and retry.
>
> Where do we trap the error? In the buffer cache which fails during
> flushing? It probably cannot be done in the file system above it, since that
> merely writes to the buffer cache. Also, it seems like the context in
> which this happens is possibly not the context of (e.g.) the writer in the
> file system, but instead the context of another process which needs memory.
> So what is the layer here? It looks like the communications layer or the
> lock manager exports state (namely the preferred servers) to the buffer
> cache.
>
> We could also make a clearer separation, and build a disk "class driver".
> The file system talks with that device and the disk class driver is in turn
> a customer of the buffer cache. Is this perhaps preferrable in a future
> with NAS etc?
>
> Also note that we are asking for a lot of action while flushing buffers -
> namely to reconfigure the cluster and lock database and then try again. In
> particular, we need to have sufficient memory to spare to run your user
> level programs to reconfigure stuff.
>
> Just some thoughts. Are yours going in the same direction?
>
> - Peter -
>
> ----- Original Message -----
> From: Stephen C. Tweedie <sct@redhat.com>
> To: <alanr@bell-labs.com>
> Cc: Tom Vogt <tv@wlwonline.de>; Linux-HA mailing list <linux-ha@muc.de>;
> Stephen Tweedie <sct@redhat.com>
> Sent: Wednesday, March 24, 1999 9:07 AM
> Subject: Re: udp broadcast
>
> > Hi,
> >
> > On Tue, 23 Mar 1999 07:11:15 -0700, alanr@bell-labs.com said:
> >
> > >> em... can anyone tell me how I listen on the broadcast address without
> a
> > >> need for root priviledges? is that possible? if not, what's the
> > >> recommended solution?
> >
> > > I *think* that's required.
> >
> > No, the udp side of my cluster-comms code already does neighbourhood
> > discovery automatically using broadcast, all from an unprivileged
> > daemon. It just sets the SO_BROADCAST socket option for sending, and
> > binds to the local host adapter's IP address for receiving.
> >
> > > However, you should note that the HA subsystem needs lots of privileges
> > > anyway because it has to do things only trusted users can do (like
> > > change IP configurations, reboot machines, mount filesystems, etc.)
> >
> > The HA subsystem needs to be layered very carefully. The layer which
> > keeps track of the cluster state needs no privileges, but the layer
> > which runs user-level startup/failover scripts obviously needs to be
> > able to run as the appropriate user for each service.
> >
> > --Stephen
> >