[ENBD] 20G dd hang up!

Peter T. Breuer enbd@lists.community.tummy.com
Fri, 11 Jan 2002 21:54:35 +0100 (MET)


A bit more ...


"ptb wrote:"
> "Kuniyasu SUZAKI wrote:"
> OK. 2.4.26 or 2.4.26a (actually, the difference is only in the kernel
> driver)?

Yes.

> Personally, I think you're stuck with some kind of unknowable memory
> deadlock under 2.2.18 (between tcp and other resources). I am most

Or some i/o deadlock.

> surprised ... tell me, can you slow down the processor on your TP?
> 
> This is my theory...
> 
> I rather suspect that what happens is that the processor throws stuff
> at the network faster than it can handle, causing VM buffers to back up
> and fill available memeory, until they collide with expanding tcp
> stacks from the network bottleneck at 10MB/s. At that point, to relieve
> pressure the VM has to push those buffers out through nbd to the
> network, which backs up tcp buffers. Bang!!!!!

The deadlock I described isn't quite right, because people who see this
deadlock seem to report that buffers are not occupying all of the
memory.  Nevertheless, there is _some_ resource that is contended for
between VM buffers and tcp stacks in 2.2.18. 

We know that because it is networking (the client, some user space
stuff, ..) that dies, not the kernel driver.  The driver just sits there
wondering why the kernel isn't giving it anything to do and why the
client daemon isn't responding like it ought to.

What could the deadlocking resource be? The i/o spinlock? How? The 
evidence is that no new requests ever arrive at nbd to be treated.
What can the kernel be doing? I suspect it's running up and down lists
of buffers crazily looking for one it can get rid of because tcp
stacks want more memory. Why doesn't it push some buffer to swap? Swap
must be nearly empty in these cases, surely? Anyway, tcp wants
stack memory, and the kernel won't release buffers, I think. Why? It
works in 2.4!

In 2.2.17 a tcp/buffer contention was found, and was changed so that
tcp always wins. In theory this is the correct thing to do, but it
doesn't help when tcp is the second guest to the party. Then the
resource is already taken.

If this theory - vague contention between tcp and VM buffers is
correct - then there are several things that one could do which should
in theory release the pressure.


  1) the driver can, instead of rolling back requests after 5s
  untreated, it can error them. After all, some requests treated are
  better than none. Erroring is a valid tactic. The sender can retry 
  if he sees a media error. To force this behaviour, I think all you
  have to do is set "show_errs" flag in the module:

          // PTB error too old reqs if show_errs is set, else roll them back

          if (!(atomic_read(&lo->flags) & NBD_SHOW_ERRS) || lo->aslot > 0) {
            NBD_DEBUG (1, "rollback old on device nd%s\n", lo->devnam);
            nbd_rollback_old (lo);
          } else {
            NBD_DEBUG (1, "error old on device nd%s\n", lo->devnam);
            nbd_error_old(lo);
          }
 
  and you may want to remove that "|| lo->aslot > 0" !! In fact I may.
  It prevents requests being errored if any of the daemons are still
  alive because they may still do the work for us in a moment, but in
  this case they are alive but useless.  Take it away (in nbd.c).
  
  You can set the show_errs flag by

       echo show_errs=1 > /proc/nbdinfo

  (amongst other things). Tell me if this avoids the deadlock you
  experience. Then we can take it from there.

  2) you may want to avoid using VM buffers altogether. This is easy in
  the 2.4 kernels, I believe, because all you have to do is

     /sbin/raw /dev/raw1 /dev/nda

  (for example). At least I THINK that is all you have to do. If
  somebody can confirm, that would be good.

  In the 2.2 kernels, you may have to apply a patch to get raw i/o.
  Then you will have to figure out how to make it work! At least the
  oracle people know how to do that. I think the patch exists, and it 
  is probably worth searching for it.

  No buffer use implies no memory contention.


  3) you may want to change the stream.c code in the nbd directory to
  use UDP instead of TCP, which would avoid TCP buffer growth.
  
That's all I can think of for the moment. I would of course be grateful
if you could identify the cause of your lockup more precisely than I
have been able to indicate above. 

Personally, I'd try a 2.4 kernel and see if that fixes it. It would
confirm the hypothesis that the 2.2 kernel is the problem, because the
driver code would be the same still.


Peter