[ENBD] stability issues, big red neon sign

Peter T. Breuer enbd@lists.community.tummy.com
Wed, 27 Mar 2002 11:45:16 +0100 (MET)


"A month of sundays ago Arne Wiebalck wrote:"
> I am now very sure that the instabilities I saw are due to an smp problem
> somewhere:

OK.

>  nbd over TCP/IP with 4 channels on a UP kernel has been stable under a
>  24-hour-"nbd-test".
> 
>  nbd over TCP/IP with 1 channel on an smp machine with maxcpus=1 as the
>  boot option also seems stable under the same test conditions as above

Just to be sure .. "maxcpus=1" should mean that the ID of the second
cpu can be at most "1", which means that the kernel should be running
SMP under normal conditions, since the cpus are numbered "0" and "1".
(some motherboards cause the cpus to be numbered strange things, like
"0" and "15", in which case maxcpus=1 would also be effective in
causing the kernel to go UP).

I believe that "maxcpus=0" (not "maxcpus=1") is equivalent to "nosmp".
But the documentation may be out of date!

So, can you resolve my confusion? Are you running with two cpus active
or not? /proc/cpuinfo should tell you.

>  (nbd over SCI with 4 channels on a UP kernel survives also - it
>   did not an smp machine - but 3 out of 4 connections are lost and
>   couldn't be reconnected. so last the connection transferred the

This looks like segfaults cause deaths, and then the driver can't
recover in the normal way, due to the lack of reconnection under SCI.

You may want to look at the "lives" info in the State line in
/proc/nbdinfo.  This should tell you the number of successful reconnects
over time.  If you see lots of reconnects, then that is an indication
that something is wrong and it is being masked by the recovery
mechanisms.

>   data alone, I think I have to work on my reconnection mechanism ...)
> 
> currently I am testing an smp kernel with maxcpus=1 and 4 channels over
> TCP/IP. but I am quite sure this will also be stable.

Again, I'm not sure if you mean maxcpus=0.

> peter, if you provide code diffs to serialize the driver with semaphores I
> would be glad to apply these patches to my setup and see whether we can
> determine what causes the instabilities.

I'll do that as soon as you resolve my minor confusion above!

Peter