[ENBD] 2.4.26a crashes

Edward Muller enbd@lists.community.tummy.com
07 Jan 2002 18:15:19 -0500


On Mon, 2002-01-07 at 11:09, Peter T. Breuer wrote:
> "A month of sundays ago Edward Muller wrote:"
> > On Mon, 2002-01-07 at 10:11, Peter T. Breuer wrote:
> > > "A month of sundays ago Edward Muller wrote:"
> > > Enter is significant. That indicates that tcp is still up. A
> > > momementary brownout would upset tcp but not udp (ping, etc.). I would
> > > have expected a login over telnet or ssh to hang for a few minutes if 
> > > the line was lossy enough, and you would not have been able to hit
> > > enter.
> > 
> > I can hit enter only at the console. My ssh sessions (of which there
> 
> OK, then your symptoms are consistent with a significant
> networking failure. Tcp went down for at least 5s (to cause the 
> nbd driver to notice).

heh. But still, the box locked hard. Shouldn't nbd just error and
continue? Shouldn't md (which is what is using the ENBD devices just
continue?

> 
> > were 5 just to my workstation) locked up with no response (IIRC). At the
> > console I could (well someone elese did the keypresses) hit enter and
> > watch the text scroll up the screen.
> 
> When the network loses or misorders or corrupts some tcp packets, these
> have to time out and resync all the tcp sequences.  If the network is
> sufficiently bad, the "lockup" can persist for several minutes as the
> mess is cleared up.  This happens quite often to me on ssh over iffy
> network connections (or just spray a hub with random wrong mac addresses
> and see what happens :-). Telnet should recover first, as it is very
> lightweight. SSh has a huge overhead. It probably revalidates you after
> losing a packet, and makes you wait into the bargain. A lighter weight
> solution than ssh  would be an ip/ip tunnel over cipe.

Agreed. But after the first lockup I had to travel into NYC to reboot
the box manually (the second time I had someone on site) and it took me
about 2 hours to get there due to the time of day and the fact that I
could not just drop everything and head in. I still (2 hours later)
could not get into the box, even at the console. Although I could still
ping the interfaces.

> 
> Ssh is a sensitive indicator of network troubles.

Yep. 

> 
> > Which drivers are you using? eepro100 or e100 ?
> 
> eepro100. Various 2.4 kernels. I didn't now the other driver existed. I
> believe I have "B" version nics.

I think the problem with the nics is dependant on which nics you have as
the eepro100 driver (and it's e100 Intel cousin) support various nics. I
am using the intel Pro/100 Server (S class) nics. I think the problem
crops up with those nics. I have a 'B' version nic in my machine at home
and haven't had problems, ever.

> 
> > Once the problem occurs I can't do anything on the client machine. It's
> > locked. And sorry, I just looked I don't have the server logs anymore. I
> > need to change > to >> in my init script. From what I remember by
> > reading the server logs is that the server procesess restarted for some
> > reason. But that's all I remember.
> 
> The reason would be that they lost contact with the client (or at least
> timed out on it).

Again, shouldn't things just continue on since /home => /dev/md0 which
is made up of /dev/sda8 (drive 0) and /dev/nd/a/0 (drive 1) instead of
locking up?

> 
> > > That is the expected sequence. If it doesn't happen, you could try and
> > > accelerate it by killing daemons  that you think are stuck via -9 (to be
> > > real sure that the network connection dies) on both sides, and see if new
> > > ones are born and reconnect properly. If that helps, then it is
> > > possible that networking sockets are becoming stuck.
> > 
> > Again I can't do anything on the client machine.
> 
> I think you'll find its a networking lockup. Try using cipe with
> telnet, or ssl telnet maybe (never tried). SSh blocks easily, by
> intent.
> 
> Peter

Edward