[ENBD] kernel message "non existant block device"
Peter T. Breuer
ptb@it.uc3m.es
Fri, 27 Oct 2000 20:44:34 +0200 (MET DST)
Would people who are having problems with NBD dying when they
do a large file copy to a file system mounted over NBD please tell
me the result of the following changes when they try them:
1) mount the file system with the sync option. I.e.
mount -o sync /dev/nda /mnt
or mount -o remount,sync /dev/nda
2) run "while sleep 1; do sync; done &" in the background
Then do your test. In theory you need to build up pressure on the
virtual file system of about ram+swap, so writing a file
of 2*(ram+swap) to the NBD system should be a good test.
For example:
cimbalo:/mnt/tmp% dd if=/dev/zero of=zeros bs=4096 count=102400 & [1] 2602
cimbalo:/mnt/tmp% 102400+0 records in 102400+0 records out
[1] Done dd if=/dev/zero of=zeros bs=4096 count=102400
cimbalo:/mnt/tmp% df .
Filesystem 1k-blocks Used Available Use% Mounted on
/dev/nda 4069096 3847348 201748 95% /mnt
cimbalo:/mnt/tmp% free
total used free shared buffers cached
Mem: 257264 253844 3420 10284 225572 6768
-/+ buffers/cache: 21504 235760
Swap: 265032 9876 255156
cimbalo:/mnt/tmp% ls -l zeros
-rw-r--r-- 1 ptb 419430400 Oct 27 20:00 zeros
Do this test with only one daemon running. I did it to localhost to
double the pressure. Vary between (1), (2), and (1+2) and tell me what
results. I believe (1)+(2) should always work on a machine that takes
longer than 1s to read more than the total sum of ram plus swap from
disk to its own memory, and (1) should be enough if you write many
small files rather than one large one.
(1) on its own should be enough if only ext2fs were synchronous at the block
level, not file level.
What I believe is happening is that the VFS makes the tcp network stack
explode when the stack needs more buffers to send with and VFS is full
with dirty buffers that need to be sent out via the net. Throttling
the VFS so that it doesn't fill buffers so fast when they're destined
for the network seems the best bet.
I am trying to throttle the VFS but there is no feedback mechanism in
linux that allows me to do it easily. What worked on my laptop seems
not to be good enough for my more powerful workstations, which are
fast enough overall to fill memory with dirty buffers before I can
react, and if I react then they start dying from interrupt starvation
while I hold off the kernel block subsystems.
One should observe the same problems in NFS over tcp, by the way (and
swapping over tcp :-). If not there, then in a loopback mount of a file
that is on an NFS over TCP partition.
I'll explain what I've tried after a small repeat of the explanation
below:
"A month of sundays ago Peter T. Breuer wrote:"
> "A month of sundays ago pbmonday@imation.com wrote:"
> A lock up cannot happen (if it does, it's a "kernel problem").
>
> Consider: IF you run a single daemon then all accesses to the device
> are serialized through it. The sequence goes: send ack send ack send
> ack. This happens under all circumstances. Changing the size of the
> file does not change what the driver does. It just gets a request from
> the kernel for a SINGLE BLOCK, places it on the internal queue, sends it
> to the client daemon, waits for an ack, tells the kernel that buffer is
> now free. Repeat. That's all.
> If I guess rightly as to what happens on your machine, your problem
> will disappear if you run "while sleep 1; do sync; done" in the
> background. I surmise that your machine locks when all of available ram
> is filled with dirty buffers that haven't yet been transferred to the net,
> and that when one of them is transferred by NBD, more kernel memory
> is used for the task, causing the lockup. You will also avoid the
> problem by mounting the FS -o sync. And of course you will avoid it by
> mounting it readonly.
OK I've looked at kernel "plugging" and I'll look at it again. It
appears it is a mechanism implemented deep in the block layers and
I don't get much chance at it in 2.2. In kernel 2.4 it is more
accessible. But in both loop and NBD are specifically excepted from the
plugging mechanism, with a kernel note that it'll make them deadlock.
I don't see the deadlock mechanism, but I agree that in general an
unsyncronized attempt to control a deadlock only leads to the deadlock
moving or happening even earlier.
Let me explain: there are three zones of memory involved:
a) kernel/nbd general request queue. The kernel fills this queue
under pressure from the VFS and then calls nbd to treat it.
(you see it marked under KThreads in the nbdinfo output).
b) nbd private queue. When the kernel tells NBD it has something on
queue (a) we swipe it as fast as possible and put it on this
queue and tell the kernel we're done.
c) zone in which the daemons pick up requests. They take them
off queue (b) and put them here. They stay here until they're
acked, when we tell the kernel taht the associated buffer is now
free and destroy the request. If not acked they are "rolled back"
to queue (b).
I have tried doing three things this morning with respect to queue (a):
i) nothing. Ignore the kernel's call when we think queue (b) is too
full and we've got enough work to be going on with
ii) block the kernel until we're ready to treat the new request
(i.e. queue (b) shrinks).
iii) take the request
iii) is what happens usually. On a fast cpu the VFS buildup is too
great to sustain continuously. Solution: speed up the net, slow down
the cpu, buy more memory, fix the VFS/net deadlock.
ii) seems to work fine on my laptop but drops my workstation dead in
its tracks. Maybe it's an SMP thing.
i) causes the kernel to lose the request, very happily. No error and
everyone happy, but no result. Apparently we are supposed to react
fast and take the request. I don't know what happens to it. The code
pathways are obscure. Either it's a memory leak or the kernel notices
we missed it and eats it.
I am experimenting with ii) by reacting at about the rate that the
net is taking the requests, which should have the synchronizing effect
I want. It works fine on my laptop. If anyone wants to try it, add
to nbd_do_req after blk_dev_dequeue ..; NBD_spin_unlock ... in the
normal sequence within the loop:
if (throttle) {
/* PTB try throttling here! (A bit late, but better than * never) */
int retry_count = 0;
while ((req->cmd & 0x03) == WRITE && lo->count >= throttle + lo->aslot) {
if (retry_count++ > throttle * HZ) {
retry_count = 0;
NBD_FAIL ("nbd reaction timeout in throttle control");
}
interruptible_sleep_on_timeout(&lo->wq, 1);
}
}
and then after every occurrence elsewhere in the code of lo->count--
(the number of requests on queue (b)), add
wake_up_interruptible(& lo->wq);
and set "throttle" to some positive value. It will delay the kernel
every time queue (b) grows by the set amount above the number of
available daemons. There is in any case a limit of 42 imposed by the
kernel through another mechanism .. it may be worth cutting it further
- it's the following line in ll_rw_blk.c make_request:
if ((major == LOOP_MAJOR) || (major == NBD_MAJOR))
max_req >>= 1;
Change 1 to 2. I can't access the symbol from the driver itself.
They should have made it settable per device.
Peter