[ENBD] 20G dd hang up!

Peter T. Breuer enbd@lists.community.tummy.com
Wed, 9 Jan 2002 16:29:18 +0100 (MET)


"A month of sundays ago Kuniyasu SUZAKI wrote:"
> I see. I attach the LOG file of "make test" which is done with remote
> machine. I used ENBD 2.4.26 and it was success.

OK. 2.4.26 or 2.4.26a (actually, the difference is only in the kernel
driver)?

> 
>   Server machine: Dell Precision 220 
>                   Pentium III 933Mhz
>                   Memory 512M
>                   100M Ether (NIC 3Com 3c905C )
>                   RedHat 6.2J
>                   Kernel 2.2.18
> 
>  Client Machine: IBM ThinkPAD X20
>                   Pentium III 600Mhz

So the server is faster than the client.  That sounds good.  Both are
single cpu.  I see you are using your laptop for testing!  Good plan.  I
do that too!  At least your home files stay safe on the server and you
can always mail for help.

I thought 3c905's tended to max out bandwidth at about 75Mb/s.

>                   Memory 320M
>                   100M Ether
>                   RedHat 6.2J
>                   Kernel 2.2.18
>                   DISK 20G

OK, well you are using kernel 2.2.18 which I haven't been testing on
lately.  That kernel (the whole 2.2 series?) was one on which I had
never been able to get some things such as request-merging working (by
the way).  There may still be issues.  I simply don't know right now.
All I can probably say about it is that I am no longer seriously testing
on it, but that I _did_ test enbd 2.4.16 (not 16a) on it.  I think I
would advise you to move up to the 2.4 series which I am beginning to
seriously like from the performance point of view.  At best your results
will be slow-ish on 2.2 kernels.

>  >>"hung up" is also not a precisely defined term! Do you mean the network
>  >>became sluggish and tcp began to time out? Or do you mean that the
>  >>client machine locked solid - i.e. deadlocked or became stuck in some
>  >>internal kernel loop?
> 
> "dd" worked about 10-20 seconds. At the time the hard disk of the
> server machine also drove.  After that the console of client machine
> was frozen. I could not use keyboard and console. The "ping" form

This kind of thing is a kernel oops. There is no other cause for it.
That is a gross error that should be  tracked down and beaten.

I am pretty certain that there is no possible oops with the enbd
kernel driver and the 2.4 series kernels, but for lack of testing I
don't know precisely what the state is with respect to the 2.2 kernels.
I "believe" that it compiles! Try and get some logs. A useful trick
is to use the magic sysreq facilty to put all kernel messages on the 
console. It's either sysreq-0 or sysreq-9. I forget which.

> other machine couldn't get any answer.

Your kernel died. Dead. Goodbye. Gone. It shoudl say samething about
it.

May I ask what compiler you compiled with? gcc 2.7.3 2.81 2.91 2.95 2.96
3.03..?

And did you compile for SMP or not (it doesn't matter, I just want to
know just in case the answer tells me something I didn't expect).

> On nbd-client
> # ./nbd-client macineA:5058 -n 4 -b 1024 -t 120   /dev/nda
> # dd if=/dev/zero of=/dev/nda bs=1048576 count=19539
> 
> The nbd-client and nbd-server told the following messages when the
> client machine was frozen.
> 
>   On the nbd-client
> NBD #968[0]: nbd_rollback rollback req c0272260 from slot 0!
> NBD #968[1]: nbd_rollback rollback req c02727d8 from slot 1!
> NBD #968[2]: nbd_rollback rollback req c0272378 from slot 2!
> NBD #968[3]: nbd_rollback rollback req c0272308 from slot 3!

This indicates the kernel is alive. The client was late in picking up 
its work.

> nbd-server: mainloop [RANGE! (+14858796828755968)]

Oh, that's interesting. In fact the clients sent out garbage just
before they died or stuck. They asked for nonsense. I think networking
died. 

> nbd-server: mainloop [RANGE! (+17110596642441216)]
> nbd-server: mainloop [RANGE! (+15984696735598592)]
> nbd-server: mainloop [RANGE! (+40799591842744320)]
> nbd-server: server (-1) relaunches child after SIGCHLD 

> Are there anybody who can transfer the Giga byte data with "dd"
> command via ENBD?


Well, in fact you have performed a very interesting experiment. I have
not stress tested the 2.4.16 code on kernel 2.2.18. I think you should
tell me what your test is.

Can you pick up the "nbd-test" utility from the nbd-2.4.27pre1 code
(it's a slightly more tuned version of dd) and tell be the result of:

   writing memory in 1024 blocks for twice the size of ram with it
   
that should be something like

   nbd-test -b 1024 -s blah_in_bytes -t 1:2 /dev/nda

this is the equivalent of writing with dd bs=1024 count=blah_in_kb,
and then reading it back after flushing caches. The reason  I ask is
merely in order to control the size associated with each request. 

Alternatively, run dd with bs=1024, and make sure it reads from a slow
device. You may even like to generate its data at a controlled rate.

Personally, I think you're stuck with some kind of unknowable memory
deadlock under 2.2.18 (between tcp and other resources). I am most
surprised ... tell me, can you slow down the processor on your TP?

This is my theory...

I rather suspect that what happens is that the processor throws stuff
at the network faster than it can handle, causing VM buffers to back up
and fill available memeory, until they collide with expanding tcp
stacks from the network bottleneck at 10MB/s. At that point, to relieve
pressure the VM has to push those buffers out through nbd to the
network, which backs up tcp buffers. Bang!!!!!


Peter