[ENBD] 2.4.26a crashes
Edward Muller
enbd@lists.community.tummy.com
10 Jan 2002 11:17:47 -0500
On Mon, 2002-01-07 at 19:02, Peter T. Breuer wrote:
> "A month of sundays ago Edward Muller wrote:"
> > heh. But still, the box locked hard. Shouldn't nbd just error and
> > continue? Shouldn't md (which is what is using the ENBD devices just
> > continue?
>
> Your later mail makes it clear that your kernel is ex, you have a dead
> kernel, etc.
That's what I figured.
>
> > > network connections (or just spray a hub with random wrong mac addresses
> > > and see what happens :-). Telnet should recover first, as it is very
> > > lightweight. SSh has a huge overhead. It probably revalidates you after
> > > losing a packet, and makes you wait into the bargain. A lighter weight
> > > solution than ssh would be an ip/ip tunnel over cipe.
> >
> > Agreed. But after the first lockup I had to travel into NYC to reboot
> > the box manually (the second time I had someone on site) and it took me
> > about 2 hours to get there due to the time of day and the fact that I
> > could not just drop everything and head in. I still (2 hours later)
> > could not get into the box, even at the console. Although I could still
> > ping the interfaces.
>
> Your later mail shows that nothing can get you out except a remote
> control mechanism with its finger on the reboot switch. Try and buy
> one. Ther emust be such things in NY. You cannot recover from a dead
> kernel .... wait! The nmi watchdog. You can compile in the nmi
> watchdog. That will reboot the machine when it detects that irqs have
> been held off or ignored for 5s.
A remote serial console/power switch box is in my plans for later this
year. I'll look into the nmi watchdog as well.
>
> > > > Which drivers are you using? eepro100 or e100 ?
> > >
> > > eepro100. Various 2.4 kernels. I didn't now the other driver existed. I
> > > believe I have "B" version nics.
> >
> > I think the problem with the nics is dependant on which nics you have as
> > the eepro100 driver (and it's e100 Intel cousin) support various nics. I
> > am using the intel Pro/100 Server (S class) nics. I think the problem
> > crops up with those nics. I have a 'B' version nic in my machine at home
> > and haven't had problems, ever.
>
> I'll have a look around the dept tomorrow to see if I can get some
> stats. My P3 dual testbed is running
>
> 00:0d.0 Ethernet controller: Intel Corp. 82557 [Ethernet Pro 100] (rev 08)
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 00:0e.0 SCSI storage controller: Adaptec 7892B (rev 02)
> 00:10.0 VGA compatible controller: ATI Technologies Inc 3D Rage II+ 215GTB [Mach64 GTB] (rev 9a)
> 00:11.0 Ethernet controller: Intel Corp. 82557 [Ethernet Pro 100] (rev 08)
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
00:09.0 Ethernet controller: Intel Corporation 82557 [Ethernet Pro 100]
(rev 0c)
00:0a.0 Ethernet controller: Intel Corporation 82557 [Ethernet Pro 100]
(rev 0c)
> and I thinkk those are the most recent I have. It's been up too long to
> have any logged info from bootup. My own machine says it has eepro100
> rev 5.
>
Looking at the boxes for these cards, they are the Intel Server boards..
The 100/S models. I don't have the boxes in front of me at the moment.
>
> > > > need to change > to >> in my init script. From what I remember by
> > > > reading the server logs is that the server procesess restarted for some
> > > > reason. But that's all I remember.
> > >
> > > The reason would be that they lost contact with the client (or at least
> > > timed out on it).
> >
> > Again, shouldn't things just continue on since /home => /dev/md0 which
> > is made up of /dev/sda8 (drive 0) and /dev/nd/a/0 (drive 1) instead of
> > locking up?
>
> I don't follow .. the server will probably give up on the client if the
> network connection dies, which it does (because the client dies). When
> the client comes back up it'll make contact and everything will start
> up again.
>
> The only possible interpretation of your question is that you are
> saying that the server is the side that dies, and the client is the
> side that lives .. surely you can't mean that? Can you clarify, please?
>
No, all I was saying is that the server says it lost connection with the
client. The client most likely is the one that broken the connection
because the kernel on the client locked up. At least that's the way
things are pointing now.
>
> > > > Again I can't do anything on the client machine.
> > >
> > > I think you'll find its a networking lockup. Try using cipe with
> > > telnet, or ssl telnet maybe (never tried). SSh blocks easily, by
>
> It's a dead kernel. This implies a kernel oops in a big way. That means
> it must be a driver. I don't think it's mine, but it may be. Can you
> tell me some more about this box? Scsi? Integrated or not? Etc..
>
Epox Motherboard
UP Athalon 1.3 Ghz
512 MB RAM
3ware 7808 IDE RAID card (the hardware kind, not the cheap 68XX series)
3 IDE hard drives configured using RAID5 appear as 150GB /dev/sda
/dev/sda is broken up into several partions and swap. The partitions are
running EXT3 with data=journal.
P.S.Don't lecture me about putting the journal and swap on the same
drives. I can't help it. :-)
lspci output
00:00.0 Host bridge: Advanced Micro Devices [AMD]: Unknown device 700e
(rev 13)
00:01.0 PCI bridge: Advanced Micro Devices [AMD]: Unknown device 700f
00:07.0 ISA bridge: VIA Technologies, Inc. VT82C686 [Apollo Super] (rev
40)
00:07.1 IDE interface: VIA Technologies, Inc. VT82C586 IDE [Apollo] (rev
06)
00:07.4 SMBus: VIA Technologies, Inc. VT82C686 [Apollo Super ACPI] (rev
40)
00:08.0 RAID bus controller: 3ware Inc: Unknown device 1001 (rev 01)
00:09.0 Ethernet controller: Intel Corporation 82557 [Ethernet Pro 100]
(rev 0c)
00:0a.0 Ethernet controller: Intel Corporation 82557 [Ethernet Pro 100]
(rev 0c)
01:05.0 VGA compatible controller: Matrox Graphics, Inc. MGA G400 AGP
(rev 04)
Looking for anything else.
BTW: The machine has been up since Saturday when I replaced the eepro100
drivers with the e100 drivers. I'm not saying the problem is fixed and
it may have just not hit the particular problem that caused the kernel
to die in the first place.
--
--------------------------------------------------
Edward Muller - Director of Information Services
LearningPatterns.com Inc.
Mobile: 973.715.0230
NYC: 212.487.9064 x115
Email/Jabber: emuller@learningpatterns.com
http://www.learningpatterns.com