[ENBD] nbd 2.4.27 recommended kernel

Peter T. Breuer enbd@lists.community.tummy.com
Wed, 13 Mar 2002 02:54:46 +0100 (MET)


"A month of sundays ago Cicero Mota wrote:"
> Here is what I did, hope you can give some help
> 
> dd if=/dev/zero of=/tmp/delme bs=1024 count=5000000

5GB?

> # /tmp/nbd-server 1025 /tmp/delme
> fourier:/tmp # nbd-server  2597: main server (-2) cannot find nbd-cstatd in 
> /etc/services

You might want to do something about that. Or not.

>  # insmod /tmp/nbd.o
>  # /tmp/nbd-client localhost:1025 -n 1 /dev/nda
> 
> server output was ok, I think:

Looks fine.

> nbd-server  2608: server (-1) sent/negotiated pulse interval 10 ok
> nbd-server  2608: server (-1) agreed 1 channels ok
> nbd-server  2608: server (-1) selected free port at 1026
> nbd-server  2608: server (-1) posted port 1026 ok
> nbd-server  2608: server (-1) manager started new process group 2608
> nbd-server  2610: server (0) set default signal handlers for slave server 2610
> nbd-server  2610: server (0) opened port 1026 (socket 6) for client 127.0.0.1
> nbd-server  2610: server (0) set new signal handlers for slave server 2610
> nbd-server  2610: newproto net errored on packet. Breaking off.

Except here we get a basic communication problem.

That's possible with mismatched server/client. Did you compile both of
them on the same machine? They must match the kernel module code
and each other.

> nbd-server  2608: server (-1) set new signal handlers for session server 2608
> nbd-server  2610: slavesighandler server (0) activates slave sighandler for 
> signal 15
> nbd-server  2610: server (0) sighandler terminates slave 2610 safely
> nbd-server  2608: server (-1) relaunches child after SIGCHLD
> nbd-server  2608: server (-1) slave pid 2610 is down, launching new
> nbd-server  2611: server (0) set default signal handlers for slave server 2611
> nbd-server  2608: server (-1) launched slave pid 2611

Well, doesn't look good.

> client output also was ok: 

> nbd-client  2606: client (-1) manager opened NBD device /dev/nda (2b00)
> nbd-client  2606: client (-1) set kernel bdflush sync boundary to 80% from 60%
> nbd-client  2606: client (-1) set kernel bdflush async boundary to 25% from 
> 30%
> fourier:~ # nbd-client  2607: client (-1) starts introduction sequence on 
> port 1025
> nbd-client  2607: client (-1) got session port 1026 ok
> nbd-client  2607: client (-1) introduction sequence ends ok
> nbd-client  2609: ok
> nbd-client  2609: client (0) opened socket 5 to port 1026
> nbd-client  2609: client (0) read passwd ok from port 1026
> nbd-client  2609: client (0) got cliserv magic ok from port 1026
> nbd-client  2609: client (0) got a signature ok from port 1026
> nbd-client  2609: client (0) begins main loop

Well, the other end wan't that happy with it. Doesn't it die?

> #mke2fs /dev/nda

But what does /proc/nbdinfo say? It's pointless proceding until the
device has been set up.

> mke2fs 1.24a (02-Sep-2001)
> Filesystem label=
> OS type: Linux
> Block size=1024 (log=0)
> Fragment size=1024 (log=0)
> 123464 inodes, 493764 blocks
> 24688 blocks (5.00%) reserved for the super user
> First data block=1
> 61 block groups
> 8192 blocks per group, 8192 fragments per group
> 2024 inodes per group
> Superblock backups stored on blocks:
>         8193, 24577, 40961, 57345, 73729, 204801, 221185, 401409
> 
> Writing inode tables: done
> Segmentation fault
> ^^^^^^^^^^^^^^^

Looks like mismatched protocols to me. But we're already at points of the
test where the results are invalid, because things never really got
set up properly. I suspect the client caused a kernel oops, and after
that all bets are off.

> Mar 12 19:41:34 fourier kernel: NBD #2765[0]: nbd_set_sock setting unsigned 
> device nda! But harmless.
> Mar 12 19:41:34 fourier kernel: NBD #2824[0]: nbd_set_sock increased socket 
> count to 1

That was the client registering. Note that that is the 0'th time
through that line. It never went through there again. That means that
only one client registered. So how come your server died once and
reconnected?

> Mar 12 19:41:34 fourier kernel: Unable to handle kernel NULL pointer 

You'd have to decode the oops for me to be sure where it's from. But
it looks to me as though this happened at registration time. Please try
to confirm that the oops happens before any packets are exchanged. If
you find it happens on exchanging packets, see if you can tell me if
its on read or write.

> dereference at virtual address 00000a33

0 plus 2600 or so. I don't have any structure that big.

> Mar 12 19:41:34 fourier kernel:  printing eip:
> Mar 12 19:41:34 fourier kernel: c0114528
> Mar 12 19:41:34 fourier kernel: *pde = 00000000
> Mar 12 19:41:34 fourier kernel: Oops: 0002
> Mar 12 19:41:34 fourier kernel: CPU:    0
> Mar 12 19:41:34 fourier kernel: EIP:    
> 0010:[interruptible_sleep_on_timeout+44/104]

Could be anywhere.

> Mar 12 19:41:34 fourier kernel: EFLAGS: 00210086
> Mar 12 19:41:34 fourier kernel: eax: 00000a2f   ebx: 00200286   ecx: c4ccac28
> edx: c5779f00
> Mar 12 19:41:34 fourier kernel: esi: 000003e8   edi: c4cca3b0   ebp: c5779f08
> esp: c5779ef0


It's the trace part that you haven't shown that I need .. decoded.

So, in summary ... confirm that you compiled module server and client
on the same machine, from the same package. Then check when the oops
happens .. on registration or on read or on write. That'll tell me
more.

Peter