[ENBD] New to ENBD, having troubles

Peter T. Breuer enbd@lists.community.tummy.com
Wed, 23 Jan 2002 00:53:29 +0100 (MET)


"A month of sundays ago Christopher Eveland wrote:"
> Sorry for the long post, but I'm trying to get enbd set up and running on
> some machines I have, and can't quite figure out whats going wrong.  I can
> build, and do the basic make test, but I seem to hang the machine (or hose
> it anyway) when I try to do more complex things like mkfs.

There have been a couple of reports of this lately. Maybe it's
something I've done.

> beowulf1:~/nbd-2.4.27> uname -a
> Linux beowulf1 2.4.16 #4 SMP Fri Dec 21 16:18:42 EST 2001 i686 unknown

OK.

> I'm running version 2.4.27, I also tried 2.4.26.  I set the SMP flag in the
> makefile to match the kernel.

You SURE you also tried 2.4.26? If you see the symptoms in BOTH 2.4.26
and 2.4.27 then that rules out something I've done wrong in 2.4.27.

> Anyway, after doing the make, I do the make test, and go through the
> checklist.  The module is loaded, I can see the server and client procs on

Can you run mke2fs at that point?

> I can even do some small things with the device (I'm using /dev/ndf if this

ndf? Hmmm ... I have never tried that. It is quite possible that there
is a bug with higher numbers of device, simply _because_ I have never
tried that There was once such a bug, I recall.

> case, since I had set up the others to auto mount on boot, obviously getting
> ahead of myself... anyway, a-e are all turned off), like use dd to copy 512
> bytes onto a device (such as /dev/hda3) and then compare to the original 512
> bytes: they match.  But if I try to do something "big", like mkfs, it seems
> to hang up.

mke2fs is unique in that it tries to write _outside_ a partition, in
order to do a binary search for its limits.

> For instance, after doing "make test", I try "mke2fs /dev/ndf" as per the
> online instructions.  As soon as I try mke2fs, I get the following on the
> console:

You are on the server side? And the server dies? I don't understand how
you can be on the server side ...

> nbd-server  1316: slavesighandler server (0) activates slave sighandler for
> signal 11
> nbd-server  1318: slavesighandler server (2) activates slave sighandler for
> signal 11
> nbd-server  1318: server (2) sighandler terminates slave 1318 safely
> nbd-server  1316: server (0) sighandler terminates slave 1316 safely
> nbd-server  1315: server (-1) relaunches child after SIGCHLD
> nbd-server  1315: server (-1) slave pid 1318 is down, launching new
> nbd-server  1324: server (2) set default signal handlers for slave server
> 1324
> nbd-server  1315: server (-1) launched slave pid 1324
> nbd-server  1315: server (-1) slave pid 1316 is down, launching new
> nbd-server  1325: server (0) set default signal handlers for slave server
> 1325
> nbd-server  1315: server (-1) launched slave pid 1325

> nbd-netserver  1178: recv_reply client (1) net_recv_reply reports bad magic
> 0x0 instead of 0x67446698 with error 0x0 handle 0xbffff1ec flags 134544497
> cmd 1 len 0 sector 4294967295


Well, the server died. Getting nonsense back from a dead socket is a
bit strange, but I've seen stranger things.

> magic 0xbffff31c error 0xbffff320 handle 0xbffff324 flags 0xbffff328
> nbd-server  1317: newproto Not enough magic in packet. Breaking off.
> nbd-server  1319: newproto Not enough magic in packet. Breaking off.

And very sensible too.


> And at this point my load shoots up to about 3 or 4, and while I can

Well, load is "waiting on i/o", so it's not really significant. It
might be 4 client daemons struggling to complete i/o transactions to a
dead server. It might be mke2fs.

> interact somewhat with the machine, I can't seem to shut it down nicely, to
> get the load to go back down, I have to pretty much reset the machine.

echo 0 > /proc/nbdinfo. Isn't this prominent enough in the man page?


> So I'm having trouble interpreting this.  If anyone has some suggestions, or
> can point me to something to look at, I'd appreciate it.  Thanks,

It looks like a straightforward mistake of mine in the 2.4.27 server,
dying instead of erroring out of range requests. May I ask WHAT you are
serving? And why is the 2.4.26 server behaving the same way? Are you
sure it is.

Can you make sure that the device is some small size, like 8MB. If it's
as straightforward as I believe, then its a simple case of a one line
change in a user-level routine. What I'm surprised about is the lack of
diagnostics .... say! Are you using 2.4.26a or 2.4.26? Because 2.4.26a
shares the driver with 2.4.27. So if both behave the same way, then
the indication is that it is the driver. And I DO recall removing
a range check from the driver ...

ummm, it would be in the code that takes stuff off the kernel queue ...
do_nbd_request ... yes. It's still there!

          
        PARANOIA_BEGIN;
        if (lo->magic != NBD_DEV_MAGIC) {
            NBD_DEBUG (1, "nd%s is not magical!\n", lo->devnam);
            NBD_FAIL ("nbd[] is not magical!\n");
        }
        if (req->nr_sectors > lo->max_sectors) {
            NBD_FAIL ("oversize request\n");
        }
        PARANOIA_END;
        if (req->sector + req->nr_sectors > lo->sectors) {
                  NBD_FAIL ("overrange request\n");
        }

Now I am stumped. Would you mind turning on the paranoia just above
that range check? Then nothing too big can even get out of the client.
But it should be just fine as it is, because the kernel guarantees
not to pass anything bigger than the max_sectors we registered, and in
any case mke2fs only passes overrange requests, not oversize requests.
Maybe it's an ext2fs underrun?

Anyway, I need a bit more data AND I'll have a look at it tomorrow.

You might try 2.4.26 instead of 2.4.26a (or vice versa). Any difference
would give me a lead.


Peter