[ENBD] 2.4.26a crashes
Peter T. Breuer
enbd@lists.community.tummy.com
Tue, 8 Jan 2002 00:49:24 +0100 (MET)
"A month of sundays ago Edward Muller wrote:"
> On Mon, 2002-01-07 at 12:56, Peter T. Breuer wrote:
> > This is slightly different .. his console is OK (-ish), but not his
> > login via ssh. I suspect on console he couldn't _start_ new processes,
> > but could run existing ones, like his shell. That, if true, hints
> > strongly at the generic kernel i/o lock being taken. Or at least some
> > generic kernel lock.
>
> Nope, kernel is NOT OK (-ish). All I can do is echo characters. That's
> it. No prompt, no nothing.
Your existing running shell cannot proceed in its ordinary execution,
but the console driver does a local echo for you ... I don't know
the code well enough to understand that situation. It sounds like a
kernel oops just happened and you are in random-land.
> > This is also a good diagnostic. But _I_ am running enbd on dual
> > processor, with eepro100 driver! nevertheless, I have seen eepro's
> > (particularly "A" revision), lock up frequently on uniprocessor under
> > 2.2 kernels, and earlier. I have machines where there are scripts to
> > unload and reload the eepro100 module if we lose contact with the
> > router, and to do it every hour anyway, for good luck. You can get
> > messages like "Tx ring full" at such times. The card just stops because
> > of a timing issue (I believe).
>
> UP kernel, UP machine. Athlon 1.3 Ghz if anyone cares.
Well, it eliminates the deadlock theory. Looks like a kernel problem
all of your very own!
> > > just a very wild guess.
> >
> > It's a viable guess. But first one has to find out exactly what kind
> > of a locking condition he has, and that needs some research into what
> > is working and what is not. Can he launch processes? Run df? Use
> > tcp? Udp?
>
> Can't launch processes, Can't run df, Can't use TCP ... Can ping though.
Pinging indicates nothing. The card replies, not the kernel (unless you
have a winnic like an ne2000 :-). You have a dead kernel. This happens
to me on several machines when scsi goes belly up, by the way (improper
termination, probably). Cured by a kick.
> That's it. Can't even log in locally to test anything. Hit power switch
> and reboot.
Have you compiled with the magic-sysreq? A quick task list and killall
might be useful. You can also do a clean reboot (unmount all, sync,
reboot).
I think you need magic-sysreq.
If you are really interested in pinning it down, compiling with the
kernel debug code in (from sgi.com ...) might also be fun. Then you
just hit pause and start talking to the monitor and single stepping the
code. You can usually see when you're in a spinlock or something like
that and break out of it by forcibly setting some data to zero.
There's no other way if you don't get a snapshot from suslog. Make sure
that syslog is going synchronously, btw (is that minuses or lack of
minuses in syslog.conf? I forget. But make the logfile or the log
file system synchronous).
Peter