[ENBD] stability issues
Peter T. Breuer
enbd@lists.community.tummy.com
Fri, 22 Mar 2002 13:25:56 +0100 (MET)
"A month of sundays ago Arne Wiebalck wrote:"
> did anyone ever try to test an nbd device, with nbd-test for instance,
> for 24 hours with _permanent_ writing and reading, e.g. something similar
Not me, at least! That kind of punishment is enormous.
> to
>
> while true; do nbd-test /dev/nda -b 1024 (-y) -t 1:2:3:4; done ?
>
> I tried to do this with 2.4.26a and 2.4.27 in the last days with an nbd
> device of 1 GB in several configs (sync, -y option, different number of
> channels, ...) and up to now none of my tests survived more than 9-10
> hours.
Can you identify the conditions of failure? As you know, what the
driver does is merely "the same thing, again and again", so it's hard
to imagine bugs that don't manifest _every_ time you do something.
Nevertheless, they can exist - their being hard to imagine is one reason
why they might exist! In particular I notice that the kernel has on
occasion supplied requests to the driver with missing buffers (req->bh =
NULL) when under heavy memory pressure. Does widening the distance
between the entries in /proc/sys/vm/bdflush help? The first entry marks
the onset of buffer transfers to disk in terms of % memory pressure.
The 7th entry marks the passing to synchronous buffer treatment, which
is bad. I don't see why those shouldn't be 0% and 100% for nbd, but
25% and 95% is probably very safe.
> so, if anyone succeeded in doing such a test with success, please let me
> know your configuration and what tests you did, so that I can blame my
> hardware or the vm system or anything else ...
Umm ... I doubt if I own a scsi system that would survive such a test
for 24 hours without a hardware timeout.
Can you identify commonalities in the breakdowns? Such as "always on
read" or "always on write". Or always on SMP and never on UP.
Concentrate on 1 channel.
BTW, if you run the test with -y, then what happens is that every
write is followed by a sync on the device. This makes it impossible for
the kernel to store more than one write at a time for the device. It
makes imagining a bug even harder!
> btw, I also see instabilites when using SCI, at least with more than 1
> channel. so I think I can't blame tcp/ip in this case ...
If it doesn't occur with one channel and tcp/ip, and it does
with one channel and sci, then it's dubious.
What I would like you to do is to eliminate the VM buffer system as
a possible cause of problems. Can you do
/sbin/raw /dev/raw1 /dev/hda
(for example)
and then perform the nbd-test suite against /dev/raw1? This is
a character device, but it understands read and write. You may have to
adapt nbd-test slightly - I'll help (or do it).
The "raw" utility comes from util-linux (2.11b in my case).
Peter