[ENBD] Re: NBD Observations

Peter T. Breuer ptb@it.uc3m.es
Fri, 8 Sep 2000 18:43:29 +0200 (MET DST)


"A month of sundays ago Paul Flinders wrote:"
> "Peter T. Breuer" <ptb@it.uc3m.es> writes:
> > "A month of sundays ago Paul Flinders wrote:"
> > > I also note that I can access a block device which has no client
> > > running on it, mke2fs reported trying to build a 1073741823 block
> > 
> > 10 gigabytes? 
> 
> No, 1 giga*block* (i.e about 512 gigabytes)

Aha. That probably is the default size.  It's always been that way.
You want all read/write accesses to die when the client is in this
state ... what has happened is that the device has registered with the
kernel, so the kernel is passing requests to its queue, but the
device has no clients able to treat requests, so they just stay there.

I probably agree with you that under certain conditions the device
should error out requests, but it is hard to decide what those are.
Since that's policy, surely it should be outside the kernel! There
are ioctls that will error out the device. The client uses
them in response to USR1 and USR2 calls (I seem to recall they're
NBD_RESET and NBD_HARD_RESET).

I think that the device ougt to error out requests when it is loaded.
Let's see.  The NBD_ENABLED flags are both unset on every slot.  You can
see by catting /proc/nbdinfo.  It'll give the state at the top. But
I think the device will set the INITIALISED flag. That's bad, because
that enables the device queue and the kernel will pass us requests.
Oh great. The INITIALISED flag is set by the first open of the
device. This probably a strategic error. I wonder why that is. Can't 
I tell that it should be opened otherwise? In any case, I can error
out requests coming from the kernel that are aimed at a device that
has no clients:

        if (!(lo->flags & NBD_INITIALISED)) {
           NBD_FAIL ("Device not initialised.");
        }
+       if (lo->nslot <= 0) {
+          NBD_FAIL ("Device has no clients registered yet.");
+       }
+       if (lo->aslot <= 0) {
+          NBD_FAIL ("Device presently has no active clients.");
+       }

> > driver loaded).  Do you mean accessing it when there is a driver loaded
> > but before a client has told the kernel that it's ready to start up?
> 
> Yes, that's the state that I intended to be in, I had just insmodded
> nbd.o, but not started a client.

OK. I think the above driver change addresses that. I can't avoid the
kernel putting the requests on the queue, but I can error them.

> Due to some set of circumstances that I don't (yet) understand the
> main client thread had gotten itself into an unkillable state

I don't really understand how that can hapen either. I suspect it's
a send or receive on a socket.

> What I'm driving at is that although I can get nbd to work under
> fairly ideal circumstances it is alarmingly easy to do things in the
> wrong order or with the wrong command line parameters and get a
> totally stuck machine. I'm fairly keen to get this working and put

Well, I agree. But then it's a kernel driver (though a lot of it is in
userspace). And there truly are a lot of error paths. I agree that
work should be done to remove the error situations, whether they be
human or machine caused, but it comes after getting the thing working
in the first place! (I still have to find out what causes buffer
poisoning after a while on 2.4.0test1 kernels .. maybe I have to update
my kernel and it will stop doing that ...)

> some effort in (coding if needs be - although "officially" I can't
> spend time on it) but I think that the whole thing needs to be more
> robust and "user friendly".

I would heartily welcome any contributions! I've implemented the
code changes you've suggested thus far.

> > > command lines were
> > > server: nbd-server 4017 /dev/hda3 -b 2048 -i nbd
> > > client: nbd-client -i nbd rat-1 4017 rat-1 /dev/nda
> > > 
> > > I guess that the signature might be something to do with it - have I
> > > got the command lines correct?
> > 
> > It looks right, except that I thought options were supposed to come at
> > the end. Let me see ... well, the code looks like it accepts them
> > anywhere. It tries to interpret each arg as an int, then if it fails
> > and the arg does not begin with a -, it thinks it's a host or device
> > name. If it begins with a - it treats it as an option.
> 
> If I put the -i nbd at the end of the client command line I get
> 
> # nbd-client rat-1 4017 rat-1  /dev/nda -i nbd
> nbd-client: client says host name -1 is rat-1
> nbd-client: client says port -1 is 4017
> nbd-client: client says host name 0 is rat-1
> nbd-client: client says port 0 is 4017
> nbd-client: client says host name 1 is /dev/nda
> nbd-client: client says port 1 is 4017
> nbd-client: client cmdline returned error

Ah .. OK. I see. It can't tell the difference between a hostname
and a device name except by position. It thought the /dev/nda
was a hostname. The device name has to be last.

> Usage: host port .. [host port ..] .. [-b blksize] [-t timeout]nbd_device [mirror_device ..]
> (error 9)

Incidentally, does the client really accept a signature option? From
that it doesn't seem to!  No.  Looking at the code, the client does not
have a signature option.  The signature is sent over from the server,
being an identifier for the resource.  So ...  that looks like a manpage
error.  Right.  Fixed.  Argh.  Also there weas a -l01 option listed.
Also wrong and gone.  At least none of them appeared in the text, "just"
in the synopsis.

So in this case the response is right. What happened when you put the
-i elsewhere? It should also have told you to go away ...

But quite right, the usage message needs to have more in it ...
changed to reflect reality.

        printf(
            "Usage: host port alias [alias ..] .. "
            "[-b blksize] [-t timeout] [-j journal] [-r] "
#ifdef USING_SSL
            "[-cert certfile] [-key keyfile] "
            "[-CApath CApath] [-CAfile CAfile] "
            "[-verify depth] "
#endif
            "nbd_device [mirror_device ..]\n"


> > > It should produce a warning message and exit (you have to kill it
> > > anyway so I can't see why it shouldn't do that cleanly itself).
> > 
> > I'll have to check this.  There's something wrong here.
> 
> OK

Still looking.

> > > If there is an nbd-client then the device driver should have a timeout 
> > > and return EIO if it expires.
> > 
> > Well, each request is timed and retracted back to the main queue if
> > not serviced. Then I would prefer to hold them until a client is
> > reconnected.
> 
> My experience is that mke2fs is still hung, even if I re-start a
> client.

Well, I know that I have noticed the mke2fs thing, and after checking
what it was doing with strace, realized that it was waiting for
responses and just kept plugging away at echo -n 0 >/proc/nbdinfo
until it came back complaining. It was a while ago, because afterwards
I implemented the "stay disabled for 5s" rule.

> If the device can queue requests until a client connects then it may
> well be inappropriate to send back an error immediately but it might
> be a safe default setting (i.e you have to explicitly turn on a "wait
> for a client" option.

The trouble is that a client may be there but something else (the net?)
may not. Yes, I agree that there should be a protocol for erroring
out all remaining requests, but what? 

I've now got a check of the count of active clients in the kernels
request queue. But there is a second queue - the device queue - to
which requests are moved by the driver. Clients look at that queue
when asking for requests to treat. It's possible that there will be old
requests on that queue that won't be serviced when there are no
clients. What should I do? Perhaps error them out when the active
client count falls to zero? That I can do. But maybe it's only a
temporary glitch! Somebody may restart the daemons. In that case
erroring out the pending requests seems to be the wrong thing.

In any case, they can be errored out from userland with a USR1 signal
or a echo -n 0 >/proc/nbdinfo.

> > > What about the case that the server device/file is read/write - does
> > > the journal file hold only transactions that have not been confirmed
> > > by the server or all writes.
> > 
> > At the moment the journal file holds all writes. If the resource is
> > RW, writes are also passed across to it. If it is RO they are not.
> > 
> > > Can the journal file be used to queue writes while the server is
> > > offline - that would be really useful if the server has to be rebooted
> > > for any reason.
> > 
> > That is the intention. I'll have to give some thought to how to
> > implement it though (well, I just have. Mark the block as transfered
> > in the bitmap when it's been acked by the server. And maintain a list
> > of outstanding blocks). I believe I have to get the hashmap (i.e. map
> > indirection) working first. The codes there, waiting for me to link it
> > in.
> 
> I think that the journalling changes tat you describe will be useful.

So do I. It's just work. And it is very careful work.

Peter