[ENBD] 2.4.26a crashes
Peter T. Breuer
enbd@lists.community.tummy.com
Mon, 7 Jan 2002 18:56:34 +0100 (MET)
"A month of sundays ago Jerome Petazzoni wrote:"
>
> > I've had two crashes like this in the last week. [...]
> > [...] the machine locks. I can't login,
> > although I can still ping the network interfaces and I can hit enter on
> > the screeen and it scrolls.
>
> that might mean a deadlock or a similar error condition inside the kernel.
> when I was playing with ATM, I had the same problem on my SMP box : the
> machine pings, the console echoes my keystrokes, but all processes are
> completely screwed. If I try to telnet to an open port (which was
This is slightly different .. his console is OK (-ish), but not his
login via ssh. I suspect on console he couldn't _start_ new processes,
but could run existing ones, like his shell. That, if true, hints
strongly at the generic kernel i/o lock being taken. Or at least some
generic kernel lock.
Enbd does _not_ take the i/o lock in any code that can sleep. I am
absolutely positive on that (you can check by scanning the driver code
and seeing what there is between successive down and up calls on
io_request_lock!). And it does _not_ take any lock apart from the i/o
lock and its own unique lock.
So my opinion is that if there is a generic lock held, it is not in the
enbd driver.
> running a service before the crash), the TCP handshakes starts but does
> not go to an end. In the ATM case, it was a mutual exclusion problem.
Exactly. This can foul up networking no end.
> If you are running an SMP kernel, try running an UP kernel (recompile it,
> so spinlocks and so on are completely disabled). else, the eepro100
This is also a good diagnostic. But _I_ am running enbd on dual
processor, with eepro100 driver! nevertheless, I have seen eepro's
(particularly "A" revision), lock up frequently on uniprocessor under
2.2 kernels, and earlier. I have machines where there are scripts to
unload and reload the eepro100 module if we lose contact with the
router, and to do it every hour anyway, for good luck. You can get
messages like "Tx ring full" at such times. The card just stops because
of a timing issue (I believe).
> driver might be buggy, but I suspect you would have seen other effects...
> my wild, random guess : deadlock somewhere within enbd, perhaps when it
> tries to "reschedule" an I/O operation to another channel ? but it's
It doesn't put the request back on the kernel i/o queue but on an
internal queue (under a spinlock for the queue ops).
All I can think of concerning enbd itself is that _if_ the network
locks up then it _may_ make things worse when it times out, because
it will, after trying everything else, try and comb the generic
kernel i/o queue for requests directed at it, and vamoosh those too.
For that it takes the i/o lock .. it's the nbd_clr_kernel_queue()
function. It takes the io spinlock, turns of irqs, and spins on the
the head of queue, releasing the lock and scheduling itself after every
op, until the queue is empty.
The only other code that has anything to do with the io spinlock is the
generic enbd do_nbd_request function called by the kernel when it wants
to dump the queue on us, which runs _under_ the io spinlock, which is
held by the kernel at that time, and the enbd code doesn't release it.
I would be very doubtful that the driver code can lock. But there might
be more subtle deadlocks, between tcp buffers and system memory needs, for
example.
> just a very wild guess.
It's a viable guess. But first one has to find out exactly what kind
of a locking condition he has, and that needs some research into what
is working and what is not. Can he launch processes? Run df? Use
tcp? Udp?
> regards,
> Jerome Petazzoni <skaya at enix dot org>
Peter