[ENBD] 2.4.26a crashes
Peter T. Breuer
enbd@lists.community.tummy.com
Mon, 7 Jan 2002 17:09:19 +0100 (MET)
"A month of sundays ago Edward Muller wrote:"
> On Mon, 2002-01-07 at 10:11, Peter T. Breuer wrote:
> > "A month of sundays ago Edward Muller wrote:"
> > Enter is significant. That indicates that tcp is still up. A
> > momementary brownout would upset tcp but not udp (ping, etc.). I would
> > have expected a login over telnet or ssh to hang for a few minutes if
> > the line was lossy enough, and you would not have been able to hit
> > enter.
>
> I can hit enter only at the console. My ssh sessions (of which there
OK, then your symptoms are consistent with a significant
networking failure. Tcp went down for at least 5s (to cause the
nbd driver to notice).
> were 5 just to my workstation) locked up with no response (IIRC). At the
> console I could (well someone elese did the keypresses) hit enter and
> watch the text scroll up the screen.
When the network loses or misorders or corrupts some tcp packets, these
have to time out and resync all the tcp sequences. If the network is
sufficiently bad, the "lockup" can persist for several minutes as the
mess is cleared up. This happens quite often to me on ssh over iffy
network connections (or just spray a hub with random wrong mac addresses
and see what happens :-). Telnet should recover first, as it is very
lightweight. SSh has a huge overhead. It probably revalidates you after
losing a packet, and makes you wait into the bargain. A lighter weight
solution than ssh would be an ip/ip tunnel over cipe.
Ssh is a sensitive indicator of network troubles.
> Which drivers are you using? eepro100 or e100 ?
eepro100. Various 2.4 kernels. I didn't now the other driver existed. I
believe I have "B" version nics.
> Once the problem occurs I can't do anything on the client machine. It's
> locked. And sorry, I just looked I don't have the server logs anymore. I
> need to change > to >> in my init script. From what I remember by
> reading the server logs is that the server procesess restarted for some
> reason. But that's all I remember.
The reason would be that they lost contact with the client (or at least
timed out on it).
> > That is the expected sequence. If it doesn't happen, you could try and
> > accelerate it by killing daemons that you think are stuck via -9 (to be
> > real sure that the network connection dies) on both sides, and see if new
> > ones are born and reconnect properly. If that helps, then it is
> > possible that networking sockets are becoming stuck.
>
> Again I can't do anything on the client machine.
I think you'll find its a networking lockup. Try using cipe with
telnet, or ssl telnet maybe (never tried). SSh blocks easily, by
intent.
Peter