[ENBD] stability issues, big red neon sign
Peter T. Breuer
enbd@lists.community.tummy.com
Wed, 27 Mar 2002 20:33:15 +0100 (MET)
"A month of sundays ago Arne Wiebalck wrote:"
> On Wed, 27 Mar 2002, Peter T. Breuer wrote:
> with 4 channels 11% of the device has been write (this is shortly after
> reaching the point where the kernel starts flushing the buffers. I could
> change this point by changing the bdflush parameters)
What was the nature of the relationship between the parameters and the
time to lockup?
> I just did and tried to write to the device with nbd-test.
> First I see
>
> > Mar 27 18:49:42 e001 kernel: NBD #2557[844]: do_nbd_request delaying NBD
> > request fn at jiffies 231815
This is OK. Did I leave the message as an NBD_ALERT and not an NBD_ERROR? The
latter only speaks three times. Change it.
> > Mar 27 18:49:43 e001 kernel: NBD #2557[845]: do_nbd_request delaying NBD
> > request fn at jiffies 231877
It's OK.
> and nbd-test is in state "D"
That's not OK. Didn't it make progress? It does here! But I'm not
testing on an SMP machine. I would imagine that it's progressing but
slowly.
> after echo "1" > /proc/nbdinfo we have
>
> > Mar 27 18:51:56 e001 kernel: NBD #2364[17]: nbd_clr_queue unqueued 0
> > reqs
That's kind of strange. I would have imagined there were some requests
queued!
> > error (0) writing block 119402 on fd 3: Input/output error
> > nbd-test 2237: dowrite
Well, that's OK too. The device was set to error out requests.
> so the device is unusable after the very first accesses and simply
Well, I doubt that it's stuck, simply slowed. Try adding the
following, so that the kernel thread can come back in more often ..
Dealing with SMP is difficult. I am careful to make sure that the
processes time out if they wait too long for the semaphore, so in
theory we are sure that the modificatons can't cause deadlock. But
they make a huge slowdown :-).
I should have tried this on my SMP machine. I've been testing on my
laptop, which is running an SMP kernel, but only has one processor, of
course.
I have a better idea about how to serialize using the existing locks.
I'll try it if this doesn't allow you to make progress either.
Peter
--- nbd-2.4.28/linux/drivers/block/nbd.c Mon Mar 25 00:22:52 2002
+++ nbd-2.4.29/linux/drivers/block/nbd.c Wed Mar 27 18:51:33 2002
@@ -1895,10 +1917,24 @@
atomic_inc (&lo->cwaiters);
slot->flags |= NBD_SLOT_WAITING;
+ // PTB release DEBUG smp sem while sleeping
+ up(&lo->sem);
interruptible_sleep_on_timeout (&lo->wq,
start_time + timeout - jiffies);
NBD_DEBUG (3, "(%d): I wake up %d\n", islot, atomic_read (&count));
+ // PTB now we have to try and regain the DEBUG smp semaphore
+ while (down_trylock(&lo->sem) != 0) {
+ // PTB serializing for DEBUG smp
+ interruptible_sleep_on_timeout (&lo->wq, 1);
+ if (jiffies > start_time + timeout) {
+ slot->flags &= ~NBD_SLOT_WAITING;
+ atomic_dec (&lo->cwaiters);
+ result = -ETIME;
+ goto error_out;
+ }
+ }
+
slot->flags &= ~NBD_SLOT_WAITING;
atomic_dec (&lo->cwaiters);
atomic_inc (&count);