[ENBD] status report

Peter T. Breuer ptb@it.uc3m.es
Wed, 1 Nov 2000 16:51:21 +0100 (MET)


I'm tempted to issue a status report just to indicate that I'm not
doing nothing :-).

With several reports of difficulties with writing large files I
was able to reproduce the problem. It appears to be a VFS/tcp problem.
VFS fills memory with dirty buffers, tcp needs memory as buffer in
order to empty the buffers to the net, bang! Deadlock.

The problem appears on reasonably fast machines running an async
mounted filesystem on the NBD device. They can fill memory faster
from disk than they can empty it to the net.

I've been talking with several kernel hackers over this, and the
general opinion is that it is a tcp bug. There are lots of patches in
2.2.18-pre for this sort of thing. If someone with this problem
can try an 18.pre I would be grateful to know if the symptoms are
ameliorated in any way.

That said, you can almost completely eliminate the problem by
mounting the file systm -o sync and running "while sleep 1; do sync;
done" in the background for good luck.

I say "almost" because there is a flaw in VFS that will cause the same
symptom to reappear when two or more processes write at the same time to
the file system.  To the same file in particular.  Then one process will
run to buffer while the other runs to disk (i.e., to nbd).  The process
that runs to buffers will fill them dirtily and quickly, which will lead
to the same problem as with an async file system, if the file is big
enough.

The file should be greater than the size of ram to evoke the symptoms.
Maybe ram+swap. I'm not sure.

I've concentrated on making sure that the simple asynchronous case (1
writing thread) does not lead to any problems. I've investigated two
ways of syncing the VFS to the speed at which we can write to the net:

   1) "plugging" the device in the kernel while we are not ready to
      take on more work.
   2) blocking the do_request loop (that serves the kernels call to handle
      a request list) to match the network speed.

I've got 1) working just fine but it doesn't seem to avoid deadlock in
all cases.  Pavel Machek says his original comment in the kernel source
"cannot plug with loop and nbd" was an observation, not based on theory.
I don't see any deadlock mechanism wrt plugging.

2) works, but only on non-smp machines. On SMP machines it locks
the machine tight. The problem seems to be any kind of delay in the
do_request loop, which I believe is executed in the "disk" task queue
context with interrupts off. I've tried semaphores and wait/signal
semantics and simple delays. All stop the machine dead.

I suppose the problem is the io_lock.  I've tried releasing it when
waiting, but, no, still deadlock.  It must be that interrupts are off
too.  OK ...  I'll try switching interrupts back on when waiting next.

Peter