cluster layering
Peter J. Braam
braam@cs.cmu.edu
Wed, 24 Mar 1999 11:48:29 -0500 (EST)
Hi Dave,
On Wed, 24 Mar 1999, David Brower wrote:
> The situation Peter is describing here is often called
> "proxy i/o", and is the sort of thing done by IBM's
> VXD or VSD layer on AIX. In AIX-land, this sort of
> failure is also handled by the _event_ subsystem,
> which propagates information about important cluster-wide
> transitions. I tried to send a picture of the IBM
> reference model out a couple of weeks ago, but don't
> know if anyone got it.
Got it. Thanks!
>
> Essentially, there's a way for interested
> subsystems to get callbacks when something happens. It's
> probably the case that you need to handle errors correctly,
> but it's less clear that driving cluster transition on
> receipt of an error is the preferable thing to do.
Uhm. Well, I think we probably all agree that
1. recovery from the situation I desccribed is desirable
2. that the node mastering the lock will have to change
If we accept 2. then we have no choice to rebuild the DLM state - I think
that's a cluster transition.
What do you have in mind when you say "look around quickly"?
- Peter -
> I
> could see an i/o error generating an event into the CM
> to tell it to take another look around quickly to see
> what might be broken.
>
> -dB
>
> "Peter J. Braam" wrote:
> >
> > Stephen,
> >
> > Can we talk a bit more about the layering. I'm thinking about a lock
> > manager and a caching protocol for SAN file systems - ala VMS with some
> > changes. Clearly the DLM will sit above the connection manager. Both the
> > DLM and the connection manager will sit on top of a communications layer
> > (note that UDP could be a substrate exporting the communications layer
> > interface, but should not be idetified with it). So I suppose that while a
> > large part of the connection manager can work in user land, a component will
> > live in the kernel.
> >
> > I'd like to understand the following example in some more detail. Suppose
> > our cluster has three hosts as members,
> > A,B and C connected by, for example memory channel. Suppose that A and B
> > share a disk over SCSI, and that host C is working with files on the shared
> > disk, and C is doing its communication through A. So A is mastering the
> > locks on some part of the file system on the shared disk, and C holds such a
> > lock.
> >
> > Now a state change happens: A dies. We would like for B to take over from
> > A, so that C can continue using the disk. From Coda I know that C can find
> > out about A's disappearance in multiple ways:
> >
> > 1. the membership component connection manager notices it first
> >
> > 2. an error is returned by the lower layers of the file & I/O system on C
> > when doing I/O between C and A and the disk
> >
> > Let's look at case 2 (I believe that 1 is slightly simpler). C's I/O
> > subsystem will do some retries before it decides that something pretty bad
> > has happened. I'd like to raise the following two questions for discussion:
> >
> > A. How do you envision that the connection manager gains control when the
> > retries have failed a few times?
> >
> > B. How can the connection manager restart the operation initiated by C's I/O
> > subsystem, in effect replacing A by B, after the transition in the cluster
> > has completed?
> >
> > I envision something like the following. Each resource (think of the disk)
> > has a name and a storage group associated wtih it (the storage group would
> > be {A,B}). When we get a lock, we also get a preferred server for the
> > resource. If I/O fails, with ETIMEOUT, we (i) trap the error, (ii) detect
> > it is a cluster resource (iii) ask the connection manager to give us a new
> > preferred server, and retry.
> >
> > Where do we trap the error? In the buffer cache which fails during
> > flushing? It probably cannot be done in the file system above it, since that
> > merely writes to the buffer cache. Also, it seems like the context in
> > which this happens is possibly not the context of (e.g.) the writer in the
> > file system, but instead the context of another process which needs memory.
> > So what is the layer here? It looks like the communications layer or the
> > lock manager exports state (namely the preferred servers) to the buffer
> > cache.
> >
> > We could also make a clearer separation, and build a disk "class driver".
> > The file system talks with that device and the disk class driver is in turn
> > a customer of the buffer cache. Is this perhaps preferrable in a future
> > with NAS etc?
> >
> > Also note that we are asking for a lot of action while flushing buffers -
> > namely to reconfigure the cluster and lock database and then try again. In
> > particular, we need to have sufficient memory to spare to run your user
> > level programs to reconfigure stuff.
> >
> > Just some thoughts. Are yours going in the same direction?
> >
> > - Peter -
> >
> > ----- Original Message -----
> > From: Stephen C. Tweedie <sct@redhat.com>
> > To: <alanr@bell-labs.com>
> > Cc: Tom Vogt <tv@wlwonline.de>; Linux-HA mailing list <linux-ha@muc.de>;
> > Stephen Tweedie <sct@redhat.com>
> > Sent: Wednesday, March 24, 1999 9:07 AM
> > Subject: Re: udp broadcast
> >
> > > Hi,
> > >
> > > On Tue, 23 Mar 1999 07:11:15 -0700, alanr@bell-labs.com said:
> > >
> > > >> em... can anyone tell me how I listen on the broadcast address without
> > a
> > > >> need for root priviledges? is that possible? if not, what's the
> > > >> recommended solution?
> > >
> > > > I *think* that's required.
> > >
> > > No, the udp side of my cluster-comms code already does neighbourhood
> > > discovery automatically using broadcast, all from an unprivileged
> > > daemon. It just sets the SO_BROADCAST socket option for sending, and
> > > binds to the local host adapter's IP address for receiving.
> > >
> > > > However, you should note that the HA subsystem needs lots of privileges
> > > > anyway because it has to do things only trusted users can do (like
> > > > change IP configurations, reboot machines, mount filesystems, etc.)
> > >
> > > The HA subsystem needs to be layered very carefully. The layer which
> > > keeps track of the cluster state needs no privileges, but the layer
> > > which runs user-level startup/failover scripts obviously needs to be
> > > able to run as the appropriate user for each service.
> > >
> > > --Stephen
> > >
>