[ENBD] Re: NBD Observations
Peter T. Breuer
ptb@it.uc3m.es
Fri, 8 Sep 2000 20:40:39 +0200 (MET DST)
"A month of sundays ago Paul Flinders wrote:"
> I agree that the policy should be outside the kernel where possible
>
> How about:
> - If the nbd module has been loaded but nbd-client never run
> I/O requests, geometry requests, whatever should always return
> EIO (we don't know basic things about the device at this point
> so accepting a request seems a bad idea and we don't have a user
> space to defer to).
This is (now) so for read/write. ioctls will still return sensibly.
Well, hang on, let me make some size requests return INVAL or
NODEV (which?) .... arrgh. I see why NBD_INITIALISED was set on the
first successful open request and not earlier - I can't block all
ioctls because set_size and set_blksize and set_sig have to be able to
run. Grr ... this is impossible. Whichever measure I use, it blocks
the ioctl that sets the final brick in place. Hmm. OK, I've blocked
BLKGETSIZE until set_size has run, and blocked GET_BLKSIZE until
set_blksize has run. There is a geometry ioctl that will still run
but return nonsense.
> - If nbd-client has run (thus setting device size etc) but
> subsequently detached there should be two options - either
> return EIO (the default) or wait (with timeout) for a client
> to connect. Secondary to this nbd-client should detach and
> warn if invoked with no routes
Well. If nbd-client exits gracefully it should probably shut down the
device OK (I know that I can get it shut down just fine). The problem
is what if no graceful exit happens. Then the kernel probably will
never find out about it. It didn't really know that a particular
process was interested in it .. it just knows that some process
connected to the appropriate minor with the right signature. It doesn't
know when it's gone away for good. Uh .. wait. Maybe it can tell. The
daemon registered a buffer with it. It can check the buffer to see if
it's still a valid memory area. That might be possible. Bug me about
that.
Another approach is that when the last slot is cleared properly
and there are no outstanding requests, to error them out as you
suggest.
> - If a client is atached pass the request to it and wait for
Clients are the proactive elements in this exchange. The kernel
is "just" code waiting to be executed by them, when they make an
ioctl call.
> a response. A timeout should "protect" the kernel but should
> be longer than any of the default client timeouts so it is
> normally nbd-client that sets policy.
This is problematic.
> That way the situation should be that the block device will function
> as expected (asuming that the network is OK and there is a server etc)
> OR will at least tell you that more work is required.
>
> I agree mke2fs will be hard to dissuade because it doesn't distinguish
> from EINVAL (which is what I think should be returned if you try to
> seek past the end of a block device) and EIO so it will resort to a
How about ENODEV! NBD returns that under some circumstances.
> binary search, however it should only do 10 iterations or so on
> examination of the code. Possibly it would be worth filing a bug
> report against e2fsprogs. In fact if we can reliably get an error out
> I think I may do just that.
> I think anything that needs info from user space should error at this
> point though.
Well, it's a little more complicated.
> > I don't really understand how that can hapen either. I suspect it's
> > a send or receive on a socket.
>
> Most likely it did something that made the kernel want to flush a
> block out on an nbd device (I'd run mke2fs so there should have been
> plenty of dirty pages/writes queued).
This doesn't hang a client. A client enters the kernel with an ioctl
looking for a request to treat. If there isn't one, it waits for at
most 5s in the kernel and exits or treats a new arrival. If there is
one, it picks it up and resurfaces and sends it across the net. I see
no opportunity to hang except in a send/recv on the socket.
> Actually isn't there a general possibility of a deadlock here. If
> memory is low the kernel will want to flush some buffers some of which
> may be nbd blocks. That wakes the client which may need to page in
> code and/or data (we're low on memory so it could have been paged out)
> in which case it will be slept waiting for buffers to be released
> which isn't going to happen because there's an "unresponsive" block
> device gumming up the works.
I superstitiously believe that you cannot deadlock between kernel and
userspace. But, yes, the above argument makes me very ill. However,
I believe the nbd-client protocol code is always paged in. It's doing
i/o, so it gets priority, and it dives into the kernel and out again
continuously. Yes, it would be better to pin the code pages. Urrrr ..
I think one can do that with mmap. But wait, so long as the kernel can
find 4096 bytes of memory at a time, it can execute the nbd code. It
must be able to page out 4096 bytes of something .. it flushes buffers
long before coming under severe memory pressure.
> desire/pressure to "get the thing working" but one of the things that
> distinguishes a generally useable piece of software from "a neat hack"
> is handling errors sensibly and gracefully. Unfortunately doing so
> can often double the size of the code and quadruple it's complexity.
Yes .. I am looking for generic solutions, not particular ones. That is
why I don't protect against "0 connections", for example. In my
opinion, 0 is a perfectly valid number. One should not treat it like an
error. Possibly one should shout at the user: you requested N
connections!
> > Incidentally, does the client really accept a signature option? From
> > that it doesn't seem to! No. Looking at the code, the client does not
> > have a signature option. The signature is sent over from the server,
> > being an identifier for the resource. So ... that looks like a manpage
> > error. Right. Fixed. Argh. Also there weas a -l01 option listed.
> > Also wrong and gone. At least none of them appeared in the text, "just"
> > in the synopsis.
>
> No, I don't think that it does in which case nbd-client should
> probably not accept the switch.
As far as I can tell, it should complain:
if (*argv[i] == '-' ) {
/* option */
int j; int success = 0;
for (j = 0; options[j].x ; j++) {
...
} /* end for j */
if (!success)
return 9;
and back at the caller ...
if (err = cmdline (self, argc, argv), err) {
MSG("client cmdline returned error\n");
usage(err);
return err;
}
> I guess that the basic problem is that the kernel is written from the
> perspective that block reads/write rarely, if ever, fail.
This is true. You'll find equally untested things happening when you
cut a few disk cables.
> > The trouble is that a client may be there but something else (the net?)
> > may not. Yes, I agree that there should be a protocol for erroring
> > out all remaining requests, but what?
>
> Not sure. However if there's a user space nbd-client hooked up it
> should have the job of coping with network outages etc.
It does. It's just hard to test the error paths.
> > clients. What should I do? Perhaps error them out when the active
> > client count falls to zero? That I can do. But maybe it's only a
> > temporary glitch! Somebody may restart the daemons. In that case
> > erroring out the pending requests seems to be the wrong thing.
>
> Make it a settable option.
It could be done. I'll think about it. But it needs a better approach.
Essentially the client wants to set a dead-man switch. But what if
it triggers wrongly? then we have a live client trying to talk to a
dead kernel.
> > In any case, they can be errored out from userland with a USR1 signal
> > or a echo -n 0 >/proc/nbdinfo.
>
> So long as that works reliably.
USR1 works fine. USR2 is risky.
Peter