[ENBD] (no subject)

Peter T. Breuer enbd@lists.community.tummy.com
Thu, 11 Apr 2002 19:00:15 +0200 (MET DST)


"A month of sundays ago HUDZIA Bertrand wrote:"
> On Thu, 11 Apr 2002 13:01:26 +0200 (MET DST)
> "Peter T. Breuer" <ptb@it.uc3m.es> wrote:
> Ok , i made twin clean install on both pc.
> 
> They both compiled well enbd ;)

OK. The universe has returned to sanity.

> I got Spectrum . But as i may have told u , i got it also at the begining of the other .

Something is wrong if things start disappearing. Deeply wrong. It's
a simple kprintf!

> > 25/80 is the important thing. Change to 10/90 and see if it helps.
> > (makes buffers start to flush earlier, and makes the point later at
> > which the system starts to flush synchronously instead of asynchronously).
> 
> The client :
> 	[root@bd2 root]# cat /proc/sys/vm/bdflush 
> 	10      0       0       0       500     3000    90      0       0
> 
> With this setting and the merge_resuest=30 , the crash test ran beautifully ;)
> Doesn't crashed ;)

Well, what I am doing with that request is asking you to change the
kernel memory management balance so as to remove buffering.  It is my
opinion that something is slightly wrong with VM in 2.4.18, and that the
less VM interference one has the better. Unfortunately, I cannot find
a way to turn it off altogether, and I cannot find a way to tune it
_only_ for the nbd device.

I am also asking for fewer and bigger requests to be sent. This will
reduce the number of times that the code is exercised for a given load,
and thus reduce the probability of hitting an existing bug!

> I'll make the iozone test and the nbd-test (fullsize) tonight.
> 
> > > The traffic was not very hudge, it was jumping for 0 to 800K. I don't
> > > have the Spectrum value here , duno why .
> > 
> > It's very peculiar. It always used to appear here! There is definitely
> > something wrong, because things ought not to be like that, but I don't
> > know what .. what is your compiler?
> > 
> 
> First try at enbd was compiled with :
> 	[root@bd3 root]# gcc -v
> 	Reading specs from /usr/lib/gcc-lib/i386-redhat-linux/2.96/specs
> 	gcc version 2.96 20000731 (Red Hat Linux 7.1 2.96-98)

I was afraid of that. I do not trust that compiler one inch. If you
could use gcc 2.95 I would be much happier, immediately. At least
make sure you are compiling with at most -O2.

> Second try (working with option , passing test tonight :=) :
> 	[root@bd1 root]# gcc -v
> 	Reading specs from /usr/lib/gcc-lib/i386-redhat-linux/2.96/specs
> 	gcc version 2.96 20000731 (Red Hat Linux 7.1 2.96-98)
> 
> Well , they are same ;)
> 
> > I've never tried iozone. I'd prefer it if you could reproduce the
> > problem with nbd-test first. If it turns out to be peculiar to iozone,
> > then I'd begin to suspect an ioctl.
> 
> The crash test i made was fist with iozone, but i made it crash faster with a
> kernel detar // a make of the kernel -j 10.

Oh ..  untarring large tars are known to hit a kernel VM bug that is
noit connected with nbd.  AFAIR it makes the memory become very
unresponsive as some internal routine desperately searches for free
pages.  The resultant slowness probably made the server give up on the
client.  Then the client didn't have anywhere to send the requests ..
this situation is OK, but normally the client blocks the device in those
circumstances, because it expects reconnects to happen, and I suspect
with the VM balance thrown to hell, the kernel cannot get enough memory
to make a new tcp connect, etc.  ...

.. you might want to use the module option "show_errs=1". Or echo that
to /proc/nbdinfo. This will make the driver start erroring back to the
kernel when things go wrong, instead of sitting there blocking. This
should release enough requests to get you out of jail.

It will ruin your test, however. But I think your test is ruined by VM
behaviour, not by the device itself.

> Ok , so i'll wait for the result tomorow morning and tell u ;)

I'm also surprised by your kernel compile test (I've compiled kernels
on nbd devices!). Much of this tends to convince me that you have not
very much memory or else are exerting huge memory pressure on your
kernel by other means, and it is responding badly. This affects 
nbd (and probably other networking and other things too!), because
it has a userspace client. Yes, by all means push the client priority
up (but it won't help, because it'll be blocked in i/o, which already
has the highest priority) and lock the client in memory with "-s".

Other things you could try ..

  1) apply the -aa memory management patches to the kernel
  2) use a pre-2.4.10 kernel (I like 2.4.9)
  3) try a RH kernel, which may be different enough to be, well,
     different.
  4) run with "-a" on the client. This is experimental. It means
     that it trusts the medium not to fail, and will ack the kernel
     before it gets an ack from the server..

Peter