[ENBD] 2.4.26a crashes

Peter T. Breuer enbd@lists.community.tummy.com
Tue, 8 Jan 2002 01:02:03 +0100 (MET)


"A month of sundays ago Edward Muller wrote:"
> heh. But still, the box locked hard. Shouldn't nbd just error and
> continue? Shouldn't md (which is what is using the ENBD devices just
> continue?

Your later mail makes it clear that your kernel is ex, you have a dead
kernel, etc.

> > network connections (or just spray a hub with random wrong mac addresses
> > and see what happens :-). Telnet should recover first, as it is very
> > lightweight. SSh has a huge overhead. It probably revalidates you after
> > losing a packet, and makes you wait into the bargain. A lighter weight
> > solution than ssh  would be an ip/ip tunnel over cipe.
> 
> Agreed. But after the first lockup I had to travel into NYC to reboot
> the box manually (the second time I had someone on site) and it took me
> about 2 hours to get there due to the time of day and the fact that I
> could not just drop everything and head in. I still (2 hours later)
> could not get into the box, even at the console. Although I could still
> ping the interfaces.

Your later mail shows that nothing can get you out except a remote
control mechanism with its finger on the reboot switch. Try and buy
one. Ther emust be such things in NY. You cannot recover from a dead
kernel .... wait! The nmi watchdog. You can compile in the nmi
watchdog. That will reboot the machine when it detects that irqs have
been held off or ignored for 5s.

> > > Which drivers are you using? eepro100 or e100 ?
> > 
> > eepro100. Various 2.4 kernels. I didn't now the other driver existed. I
> > believe I have "B" version nics.
> 
> I think the problem with the nics is dependant on which nics you have as
> the eepro100 driver (and it's e100 Intel cousin) support various nics. I
> am using the intel Pro/100 Server (S class) nics. I think the problem
> crops up with those nics. I have a 'B' version nic in my machine at home
> and haven't had problems, ever.

I'll have a look around the dept tomorrow to see if I can get some
stats. My P3 dual testbed is running

  00:0d.0 Ethernet controller: Intel Corp. 82557 [Ethernet Pro 100] (rev 08)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  00:0e.0 SCSI storage controller: Adaptec 7892B (rev 02)
  00:10.0 VGA compatible controller: ATI Technologies Inc 3D Rage II+ 215GTB [Mach64 GTB] (rev 9a)
  00:11.0 Ethernet controller: Intel Corp. 82557 [Ethernet Pro 100] (rev 08)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

and I thinkk those are the most recent I have. It's been up too long to
have any logged info from bootup. My own machine says it has eepro100
rev 5.


> > > need to change > to >> in my init script. From what I remember by
> > > reading the server logs is that the server procesess restarted for some
> > > reason. But that's all I remember.
> > 
> > The reason would be that they lost contact with the client (or at least
> > timed out on it).
> 
> Again, shouldn't things just continue on since /home => /dev/md0 which
> is made up of /dev/sda8 (drive 0) and /dev/nd/a/0 (drive 1) instead of
> locking up?

I don't follow .. the server will probably give up on the client if the
network connection dies, which it does (because the client dies). When
the client comes back up it'll make contact and everything will start
up again.

The only possible interpretation of your question is that you are
saying that the server is the side that dies, and the client is the
side that lives .. surely you can't mean that? Can you clarify, please?


> > > Again I can't do anything on the client machine.
> > 
> > I think you'll find its a networking lockup. Try using cipe with
> > telnet, or ssl telnet maybe (never tried). SSh blocks easily, by

It's a dead kernel. This implies a kernel oops in a big way. That means
it must be a driver. I don't think it's mine, but it may be. Can you
tell me some more about this box? Scsi? Integrated or not? Etc..

Peter