cluster layering

Peter J. Braam braam@cs.cmu.edu
Wed, 24 Mar 1999 11:48:29 -0500 (EST)


Hi Dave,


On Wed, 24 Mar 1999, David Brower wrote:

> The situation Peter is describing here is often called
> "proxy i/o", and is the sort of thing done by IBM's
> VXD or VSD layer on AIX.  In AIX-land, this sort of
> failure is also handled by the _event_ subsystem, 
> which propagates information about important cluster-wide
> transitions.  I tried to send a picture of the IBM 
> reference model out a couple of weeks ago, but don't
> know if anyone got it.

Got it. Thanks!

> 
> Essentially, there's a way for interested
> subsystems to get callbacks when something happens.  It's
> probably the case that you need to handle errors correctly,
> but it's less clear that driving cluster transition on
> receipt of an error is the preferable thing to do.  

Uhm. Well, I think we probably all agree that 

1. recovery from the situation I desccribed is desirable
2. that the node mastering the lock will have to change

If we accept 2. then we have no choice to rebuild the DLM state - I think
that's a cluster transition. 

What do you have in mind when you say "look around quickly"?

- Peter -

> I
> could see an i/o error generating an event into the CM
> to tell it to take another look around quickly to see 
> what might be broken.
> 
> -dB
> 
> "Peter J. Braam" wrote:
> > 
> > Stephen,
> > 
> > Can we talk a bit more about the layering.   I'm thinking about a lock
> > manager and a caching protocol for  SAN file systems - ala VMS with some
> > changes.  Clearly the DLM will sit above the connection manager. Both the
> > DLM and the connection manager will sit on top of a communications layer
> > (note that UDP could be a substrate exporting the communications layer
> > interface, but should not be idetified with it).  So I suppose that while a
> > large part of the connection manager can work in user land, a component will
> > live in the kernel.
> > 
> > I'd like to understand the following example in some more detail.  Suppose
> > our cluster has three hosts as members,
> > A,B and C connected by, for example memory channel.  Suppose that A and B
> > share a disk over SCSI, and that host C is working with files on the shared
> > disk, and C is doing its communication through A.  So A is mastering the
> > locks on some part of the file system on the shared disk, and C holds such a
> > lock.
> > 
> > Now a state change happens: A dies.  We would like for B to take over from
> > A, so that C can continue using the disk.  From Coda I know that C can find
> > out about A's disappearance in multiple ways:
> > 
> > 1.  the membership component connection manager notices it first
> > 
> > 2.  an error is returned by the lower layers of the file & I/O system on C
> > when doing I/O between C and  A and the disk
> > 
> > Let's look at case 2 (I believe that 1 is slightly simpler).  C's I/O
> > subsystem will do some retries before it decides that something pretty bad
> > has happened.  I'd like to raise the following two questions for discussion:
> > 
> > A. How do you envision that the connection manager gains control when the
> > retries have failed a few times?
> > 
> > B. How can the connection manager restart the operation initiated by C's I/O
> > subsystem, in effect replacing A by B, after the transition in the cluster
> > has completed?
> > 
> > I envision something like the following.   Each resource (think of the disk)
> > has a name and a storage group associated wtih it (the storage group would
> > be {A,B}).  When we get a lock, we also get a preferred server for the
> > resource.  If I/O fails, with ETIMEOUT, we (i) trap the error, (ii) detect
> > it is a cluster resource (iii) ask the connection manager to give us a new
> > preferred server, and retry.
> > 
> > Where do we trap the error?  In the buffer cache which fails during
> > flushing? It probably cannot be done in the file system above it, since that
> > merely writes to the buffer cache.   Also, it seems like the context in
> > which this happens is possibly not the context of (e.g.) the writer in the
> > file system, but instead the context of another process which needs memory.
> > So what is the layer here?  It looks like the communications layer or the
> > lock manager exports state (namely the preferred servers) to the buffer
> > cache.
> > 
> > We could also make a clearer separation, and build a disk "class driver".
> > The file system talks with that device and the disk class driver is in turn
> > a customer of the buffer cache.  Is this perhaps preferrable in a future
> > with NAS etc?
> > 
> > Also note that we are asking for a lot of action while flushing buffers -
> > namely to reconfigure the cluster and lock database and then try again.  In
> > particular, we need to have sufficient memory to spare to run your user
> > level programs to reconfigure stuff.
> > 
> > Just some thoughts.  Are yours going in the same direction?
> > 
> > - Peter -
> > 
> > ----- Original Message -----
> > From: Stephen C. Tweedie <sct@redhat.com>
> > To: <alanr@bell-labs.com>
> > Cc: Tom Vogt <tv@wlwonline.de>; Linux-HA mailing list <linux-ha@muc.de>;
> > Stephen Tweedie <sct@redhat.com>
> > Sent: Wednesday, March 24, 1999 9:07 AM
> > Subject: Re: udp broadcast
> > 
> > > Hi,
> > >
> > > On Tue, 23 Mar 1999 07:11:15 -0700, alanr@bell-labs.com said:
> > >
> > > >> em... can anyone tell me how I listen on the broadcast address without
> > a
> > > >> need for root priviledges? is that possible? if not, what's the
> > > >> recommended solution?
> > >
> > > > I *think* that's required.
> > >
> > > No, the udp side of my cluster-comms code already does neighbourhood
> > > discovery automatically using broadcast, all from an unprivileged
> > > daemon.  It just sets the SO_BROADCAST socket option for sending, and
> > > binds to the local host adapter's IP address for receiving.
> > >
> > > > However, you should note that the HA subsystem needs lots of privileges
> > > > anyway because it has to do things only trusted users can do (like
> > > > change IP configurations, reboot machines, mount filesystems, etc.)
> > >
> > > The HA subsystem needs to be layered very carefully.  The layer which
> > > keeps track of the cluster state needs no privileges, but the layer
> > > which runs user-level startup/failover scripts obviously needs to be
> > > able to run as the appropriate user for each service.
> > >
> > > --Stephen
> > >
>