[ENBD] stability issues, big red neon sign

Peter T. Breuer enbd@lists.community.tummy.com
Wed, 27 Mar 2002 20:33:15 +0100 (MET)


"A month of sundays ago Arne Wiebalck wrote:"
> On Wed, 27 Mar 2002, Peter T. Breuer wrote:
> with 4 channels 11% of the device has been write (this is shortly after
> reaching the point where the kernel starts flushing the buffers. I could
> change this point by changing the bdflush parameters)

What was the nature of the relationship between the parameters and the
time to lockup?

> I just did and tried to write to the device with nbd-test.
> First I see
> 
> > Mar 27 18:49:42 e001 kernel: NBD #2557[844]: do_nbd_request delaying NBD
> > request fn at jiffies 231815

This is OK. Did I leave the message as an NBD_ALERT and not an NBD_ERROR? The
latter only speaks three times. Change it.

> > Mar 27 18:49:43 e001 kernel: NBD #2557[845]: do_nbd_request delaying NBD
> > request fn at jiffies 231877

It's OK. 

> and nbd-test is in state "D"

That's not OK. Didn't it make progress? It does here! But I'm not
testing on an SMP machine. I would imagine that it's progressing but
slowly.

> after echo "1" > /proc/nbdinfo we have
> 
> > Mar 27 18:51:56 e001 kernel: NBD #2364[17]: nbd_clr_queue unqueued 0
> > reqs

That's kind of strange. I would have imagined there were some requests
queued!

> > error (0) writing block 119402 on fd 3: Input/output error
> > nbd-test  2237: dowrite

Well, that's OK too. The device was set to error out requests.

> so the device is unusable after the very first accesses and simply

Well, I doubt that it's stuck, simply slowed. Try adding the
following, so that the kernel thread can come back in more often ..

Dealing with SMP is difficult.  I am careful to make sure that the
processes time out if they wait too long for the semaphore,  so in
theory we are sure that the modificatons can't cause deadlock. But
they make a huge slowdown :-).

I should have tried this on my SMP machine. I've been testing on my
laptop, which is running an SMP kernel, but only has one processor, of
course.

I have a better idea about how to serialize using the existing locks.
I'll try it if this doesn't allow you to make progress either.

Peter



--- nbd-2.4.28/linux/drivers/block/nbd.c	Mon Mar 25 00:22:52 2002
+++ nbd-2.4.29/linux/drivers/block/nbd.c	Wed Mar 27 18:51:33 2002
@@ -1895,10 +1917,24 @@
 	atomic_inc (&lo->cwaiters);
 	slot->flags |= NBD_SLOT_WAITING;
 
+        // PTB release DEBUG smp sem while sleeping
+        up(&lo->sem);
 	interruptible_sleep_on_timeout (&lo->wq,
 					start_time + timeout - jiffies);
 	NBD_DEBUG (3, "(%d): I wake up %d\n", islot, atomic_read (&count));
 
+        // PTB now we have to try and regain the DEBUG smp semaphore
+        while (down_trylock(&lo->sem) != 0) {
+            // PTB serializing for DEBUG smp
+	    interruptible_sleep_on_timeout (&lo->wq, 1);
+            if (jiffies > start_time + timeout) {
+	        slot->flags &= ~NBD_SLOT_WAITING;
+	        atomic_dec (&lo->cwaiters);
+	        result = -ETIME;
+                goto error_out;
+            }
+        }
+
 	slot->flags &= ~NBD_SLOT_WAITING;
 	atomic_dec (&lo->cwaiters);
 	atomic_inc (&count);