[ENBD] nbd 2.4.20 successful but with some problems...

Peter T. Breuer ptb@it.uc3m.es
Fri, 9 Mar 2001 10:45:57 +0100 (MET)


"A month of sundays ago Kai Chen wrote:"

> Thank you for your quick reply.  Yes, I was indeed using nbd-2.2.29.
> I just switched to nbd-2.4.20 (BTW, your nbd-2.4.21.tgz or nbd-current.tgz 
> has problems and cannot be unzipped correctly...  Could you please check?), 

There was a bad tar there yesterday because the file system had filled. I 
fixed it yesterday (before talking to you) so perhaps you picked it up
beforehand.

> and it worked.  However, I did encounter other problems.
> 
> On the server side, when serving large partitions (e.g., 5 GB or more), the 
> system failed (during client-side's mkfs) with:
> 
> file: Can not seek locally to offset 2149299200!
> nbd-client-server: writenet exits FAIL

Well, it means what it says. You compiled without large file support,
or your system does not have large file support. Did you run the
configure script ("make config")?

I'm also interested to hear that the 2.4.20 code is working on the
2.2.18 kernel .. I haven't had a chance to do the regression testing on
2.2.18 since completing the port to 2.4.0 in nbd 2.4.20.

Which kernel are you actually using?

> On the client side, the system just spit out some error messages then froze 
> up completely.  The only way to fix it was to restart/reboot the machine 
> using the reset button.

This should not happen - if it receives out of range requests then they
should be rejected quite early on. But that said, I do not yet know all
the possible kernel interactions in the 2.4 kernels, if you are using
that, or perhaps I accidently got rid of the range check in the
do_nbd_request loop. I'll look and see if the range check is still active.

When you say "froze", are you sure that it just was not the process you
were using to write to the client device that froze? That would be
normal and natural. It should block. The rest of the system would be OK
(unless you tried a "sync", of course). You should be able to error out
the remaining requests with a "echo -n 0 > /proc/nbdinfo". Or send
USR1 to the daemons.

> When the server is serving small partitions (e.g., less than 3GB), it seemed 
> to survive during mkfs.  After making the file system, however, the client 
> side again froze up completely during a cp operation (I tried to copy the 
> entire /usr tree onto /mnt on which /dev/nda was mounted).

Are you sure that the file system is mounted sync? It really sounds as
though you are hitting VFS deadlocks. That also suggests a localhost
mount, but I suspect that one can still get a VFS deadlock occasionally
even against a remote server when the fs is not mounted sync ... this
is because the client daemon needs to write in order to free up 
buffers. in the kernel. That's close to a deadlock even though
everybody assures me that it's really VFS/TCP contention, which
is safe in kernel 2.2.18. If you can reproduce the deadlock I
would like to know .. and if it goes away when you use a sync mount (or
run while sync; do : ; done in the background!) then you know the
problem.

> And the server side gave:
> 
> nbd-server: Read returned 0 after select predicted read!
> nbd-server: Read returned 0 after select predicted read!


The client dies. The server notices.


Peter