[ENBD] can't write to nb device - nbd_get_req blocking req

Peter T. Breuer enbd@lists.community.tummy.com
Fri, 25 Apr 2003 01:38:30 +0200 (MET DST)


"A month of sundays ago repass@cymphony.net wrote:"
> When I orginally tried to build the package using make config all I got

Make config is not the thing to do .. you should edit the makefile as
indicated in the readme, in order to set the kernel source, then run
make (it'll do a config along the way).

> errors, but I noticed the debian directory and dpkg-buildpackage -b -uc

That does nothing other than make.

> seemed to build both of the .deb's correctly so I installed those, and
> then rebuilt the kernel using make-kpkg after applying your patch.
> here are the build errors for make config all:

Did you edit the makefile first?  You must tell it where
the kernel source is.

> export CONFIG_SITE=/usr/local/build/nbd-2.4.31/conf/config.Linux; \
> cd /tmp;  ./configure --srcdir=/usr/local/build/nbd-2.4.31/nbd; \
> make VPATH=/usr/local/build/nbd-2.4.31/nbd \
>                 CFLAGS="-D_LARGEFILE64_SOURCE=1 -D_LARGEFILE_SOURCE=1
>                 -D_GNU_SOURCE=1 -D_XOPEN_SOURCE=1 -D_FILE_OFFSET_BITS=64
>                 -Wall -O2 \                      -I/tmp \
>                       -I/usr/local/build/nbd-2.4.31/kernel/linux/include \
>                        -DDEBUG=0" \
>                 EXTRA_LIBS="  " \
>                 config
> make[1]: Entering directory `/tmp'
> gcc -D_LARGEFILE64_SOURCE=1 -D_LARGEFILE_SOURCE=1 -D_GNU_SOURCE=1
> -D_XOPEN_SOURCE=1 -D_FILE_OFFSET_BITS=64 -Wall -O2  -I/tmp 
> -I/usr/local/build/nbd-2.4.31/kernel/linux/include  -DDEBUG=0   
> /usr/local/build/nbd-2.4.31/nbd/config.c   -o config /usr/lib/crt1.o: In

Ah!  You have a dirty /tmp.  Remove your config* and Make* stuff first!
This line above is nonsensical!  "config" is not a target file!  Clear
out /tmp before building and all will be fine.  The debian build is no
different - it simply moves into a clean tmp directory!

> here are some warnings from the kernel compilation:
> 
> enbd.c:789: warning: function declaration isn't a prototype
> enbd.c:794: warning: function declaration isn't a prototype
> enbd.c: At top level:
> enbd.c:3377: warning: function declaration isn't a prototype
> enbd.c: In function `add_blockmap':
> enbd.c:6702: warning: function declaration isn't a prototype
> enbd.c: In function `del_blockmap':
> enbd.c:6731: warning: function declaration isn't a prototype
> enbd.c: In function `nbd_set_enabled':
> enbd.c:6806: warning: function declaration isn't a prototype
> enbd.c: In function `nbd_zero_counters':
> enbd.c:6838: warning: function declaration isn't a prototype
> enbd.c: In function `nbd_proc_hotadd':
> enbd.c:6868: warning: function declaration isn't a prototype
> enbd.c: In function `nbd_proc_hotremove':
> enbd.c:6901: warning: function declaration isn't a prototype
> enbd.c: In function `nbd_proc_setfaulty':
> enbd.c:6933: warning: function declaration isn't a prototype
> enbd.c: In function `getarg':
> enbd.c:7914: warning: function declaration isn't a prototype

Not important. () -> (void) in internal declarations. The only puzzle
is how come this doesn't match my current code, which does not have
those declarations, and hasn't had for a long time.


> gcc -D__KERNEL__ -I/usr/src/linux-2.4.20/include -Wall -Wstrict-prototypes
> -Wno-trigraphs -O2 -fno-strict-aliasing -fno-common -fomit-frame-pointer
> -pipe -mpreferred-stack-boundary=2 -march=i686 -DMODULE -DMODVERSIONS
> -include /usr/src/linux-2.4.20/include/linux/modversions.h  -nostdinc
> -iwithprefix include -DKBUILD_BASENAME=enbd_ioctl  -c -o enbd_ioctl.o
> enbd_ioctl.cgcc -D__KERNEL__ -I/usr/src/linux-2.4.20/include -Wall -Wstrict-prototypes

That line is alos strange ... oh - you are missing \n'!

> -Wno-trigraphs -O2 -fno-strict-aliasing -fno-common -fomit-frame-pointer
> -pipe -mpreferred-stack-boundary=2 -march=i686 -DMODULE -DMODVERSIONS
> -include /usr/src/linux-2.4.20/include/linux/modversions.h  -nostdinc
> -iwithprefix include -DKBUILD_BASENAME=enbd_bufferwr  -c -o
> enbd_bufferwr.o enbd_bufferwr.c
> -------------
> 
> and finally, here is the output from make test.  I copied the executables
> and modules to /tmp after installed the .deb's.  Everything appears to

Well done - but it would have been simpler to just clean /tmp and then
do the make!

> have worked correctly.
> -----------------

Looks fine.

> sh -c  " /tmp/enbd-test /dev/ndb -s 1M -i 1:2:3:4:5"
> /dev/ndb has 1048576 bytes in 1024 blocks of 1024 bytes each
> flushing buffers..done
> writing....5%....10%....15%....20%....25%....30%....35%....40%....45%....50%....55%....60%....65%....70%....75%....80%....85%....90%....95%....donetest 1 succeeded:  0 incorrect blocks
> flushing buffers..done
> reading....5%....10%....15%....20%....25%....30%....35%....40%....45%....50%....55%....60%....65%....70%....75%....80%....85%....90%....95%....donetest 2 succeeded:  0 incorrect blocks
> flushing buffers..done
> writing....5%....10%....15%....20%....25%....30%....35%....40%....45%....50%....55%....60%....65%....70%....75%....80%....85%....90%....95%....donetest 3 succeeded:  0 incorrect blocks
> flushing buffers..done
> reading....5%....10%....15%....20%....25%....30%....35%....40%....45%....50%....55%....60%....65%....70%....75%....80%....85%....90%....95%....donetest 4 succeeded:  0 incorrect blocks
> ioctl 1 succeeded
> ioctl 2 succeeded
> ioctl 3 succeeded
> ioctl 4 succeeded
> ioctl 5 succeeded
> echo done
> done

So if I were you I would run with -n 4!


> I was encouraged by the successful test and tried:

You should be.

> insmod enbd
> enbd-server -a 1111 /dev/hda8
> enbd-client localhost:1111 -n 4 /dev/nda

You can't run to localhost!  Mind you it could work - but you would have
to be very careful!

> enbd-test /dev/nda
> 
> --
> /dev/nda has 435907584 bytes in 425691 blocks of 1024 bytes each
> flushing buffers..done
> writing....5%....10%....15%....20%....25%....30%....35%....40%....45%....50%....55%....60%....65%....70%....75%....80%....85%....90%....95%....donetest 1 succeeded:  0 incorrect blocks
> flushing buffers..
> --
> which is as far as it got.  I've tried this with the client and server on
> different machines with similar results.

That's flat impossible.


> here is some debugging output:
> 
> -------
> 
> enbd-client   278: client (-1) childminder launched pid 312 (2)
> enbd-client   278: client (-1) manager launched daemon 0.3 (313) for
> localhost:1112enbd-client   278: client (-1) childminder launched pid 313 (3)
> enbd-client   313: client (3) found device /dev/nda4 ok
> enbd-server   306: server (2) opened port 1112 (socket 4) for client
> 127.0.0.1enbd-server   306: server (2) sent hello ok
> enbd-server   306: server (2) sent passwd ok
> enbd-server   306: server (2) got cliserv magic ok
> enbd-server   306: server (2) sent sig ok
> enbd-server   306: server (2) set new signal handlers for slave server 306
> enbd-client   313: client (3) opened socket 5 to localhost:1112
> enbd-client   313: client (3) read passwd ok from localhost:1112
> enbd-client   313: client (3) got cliserv magic ok from localhost:1112
> enbd-client   313: client (3) got a signature ok from localhost:1112
> enbd-client   313: client (3) begins main loop
> nbd-shmem   306: <# 278> unlock_req nobody locked req 26, erroring request!
> enbd-server   306: <#1141> do_srv_write errored request!

Oh - OK. It runs afoul of some experimental locking code. Weird though.

> nbd-shmem   317: <# 278> unlock_req nobody locked req 29, erroring request!
> enbd-server   317: <#1141> do_srv_write errored request!

yes - well, perhaps that should be a softer error. I've never
encountered it though .. it's trying to unlock a request that nobody
owns. Maybe you could just make it complain ...

Change the lines around line 278 of nbd/shmem.c to just moan a bit ...


      if (!hdata.dptr) {
                // didn't have an entry! Error it!
                data->lock.up(&data->lock);
-               PERR("nobody locked req %d, erroring request!\n", seqno);
-               return -EINVAL;
+               PERR("nobody locked req %d!\n", seqno);
+               return 0;
      }

and hope the request leaves the cache in a bit. This really needs a bit
more fixing, but I don't see how it got in that state anyway! That's
the cache of past requests, and we're being told that we're releasing a
request that nobody has locked!

> nbd-shmem   315: <# 278> unlock_req nobody locked req 31, erroring request!
> nbd-shmem   314: <# 278> unlock_req nobody locked req 34, erroring request!
> nbd-shmem   324: <# 278> unlock_req nobody locked req 37, erroring request!

Hmm .. every three!

> nbd-shmem   323: <# 278> unlock_req nobody locked req 30, erroring request!
> nbd-shmem   322: <# 278> unlock_req nobody locked req 39, erroring request!
> nbd-shmem   325: <# 278> unlock_req nobody locked req 42, erroring request!
> nbd-shmem   333: <# 278> unlock_req nobody locked req 45, erroring request!
> nbd-shmem   332: <# 278> unlock_req nobody locked req 38, erroring request!
> nbd-shmem   331: <# 278> unlock_req nobody locked req 47, erroring request!
> nbd-shmem   330: <# 278> unlock_req nobody locked req 50, erroring request!
> nbd-shmem   340: <# 278> unlock_req nobody locked req 53, erroring request!
> nbd-shmem   339: <# 278> unlock_req nobody locked req 46, erroring request!

etc.

This is the developmental snapshot code. I recently added a cache of
recently treated requests, so that if a request was resubmitted we can
guarantee to treat it consistently without race conditions and without
violating write orderings. But maybe after a timeout things can go a
little wonky. What puzzles me is the low numbers.  You wrote thousands
of requests in the test before! Why stop at 30 now?

Anyway - just carry on normally.


Peter