[ENBD] 2.4.26a crashes

Edward Muller enbd@lists.community.tummy.com
07 Jan 2002 18:23:20 -0500


On Mon, 2002-01-07 at 12:56, Peter T. Breuer wrote:
> "A month of sundays ago Jerome Petazzoni wrote:"
> > 
> > > I've had two crashes like this in the last week. [...]
> > > [...] the machine locks. I can't login,
> > > although I can still ping the network interfaces and I can hit enter on
> > > the screeen and it scrolls.
> > 
> > that might mean a deadlock or a similar error condition inside the kernel.
> > when I was playing with ATM, I had the same problem on my SMP box : the
> > machine pings, the console echoes my keystrokes, but all processes are
> > completely screwed. If I try to telnet to an open port (which was
> 
> This is slightly different ..  his console is OK (-ish), but not his
> login via ssh.  I suspect on console he couldn't _start_ new processes,
> but could run existing ones, like his shell.  That, if true, hints
> strongly at the generic kernel i/o lock being taken.  Or at least some
> generic kernel lock.

Nope, kernel is NOT OK (-ish). All I can do is echo characters. That's
it. No prompt, no nothing.

> 
> Enbd does _not_ take the i/o lock in any code that can sleep.  I am
> absolutely positive on that (you can check by scanning the driver code
> and seeing what there is between successive down and up calls on
> io_request_lock!).  And it does _not_ take any lock apart from the i/o
> lock and its own unique lock.
> 
> So my opinion is that if there is a generic lock held, it is not in the
> enbd driver.
> 
> > running a service before the crash), the TCP handshakes starts but does
> > not go to an end. In the ATM case, it was a mutual exclusion problem.
> 
> Exactly. This can foul up networking no end.
> 
> > If you are running an SMP kernel, try running an UP kernel (recompile it,
> > so spinlocks and so on are completely disabled). else, the eepro100
> 
> This is also a good diagnostic.  But _I_ am running enbd on dual
> processor, with eepro100 driver!  nevertheless, I have seen eepro's
> (particularly "A" revision), lock up frequently on uniprocessor under
> 2.2 kernels, and earlier.  I have machines where there are scripts to
> unload and reload the eepro100 module if we lose contact with the
> router, and to do it every hour anyway, for good luck.  You can get
> messages like "Tx ring full" at such times.  The card just stops because
> of a timing issue (I believe).
> 

UP kernel, UP machine. Athlon 1.3 Ghz if anyone cares.

> > driver might be buggy, but I suspect you would have seen other effects...
> > my wild, random guess : deadlock somewhere within enbd, perhaps when it
> > tries to "reschedule" an I/O operation to another channel ? but it's
> 
> It doesn't put the request back on the kernel i/o queue but on an
> internal queue (under a spinlock for the queue ops).
> 
> All I can think of concerning enbd itself is that _if_ the network
> locks up then it _may_ make things worse when it times out, because
> it will, after trying everything else, try and comb the generic
> kernel i/o queue for requests directed at it, and vamoosh those too.
> 
> For that it takes the i/o lock .. it's the nbd_clr_kernel_queue()
> function. It takes the io spinlock, turns of irqs, and spins on the
> the head of queue, releasing the lock and scheduling itself after every
> op, until the queue is empty.
> 
> The only other code that has anything to do with the io spinlock is the
> generic enbd do_nbd_request function called by the kernel when it wants
> to dump the queue on us, which runs _under_ the io spinlock, which is
> held by the kernel at that time, and the enbd code doesn't release it.
> 
> I would be very doubtful that the driver code can lock. But there might
> be more subtle deadlocks, between tcp buffers and system memory needs, for
> example.
> 
> > just a very wild guess.
> 
> It's a viable guess. But first one has to find out exactly what kind
> of a locking condition he has, and that needs some research into what
> is working and what is not. Can he launch processes? Run df? Use
> tcp? Udp?

Can't launch processes, Can't run df, Can't use TCP ... Can ping though.
That's it. Can't even log in locally to test anything. Hit power switch
and reboot.


> 
> > regards,
> > Jerome Petazzoni <skaya at enix dot org>
> 
> 
> Peter
> _______________________________________________
> ENBD mailing list
> ENBD@lists.community.tummy.com
> http://lists.community.tummy.com/mailman/listinfo/enbd
-- 
-------------------------------
Edward Muller
Director of IS

973-715-0230 (cell)
212-487-9064 x115 (NYC)

http://www.learningpatterns.com
-------------------------------