[ENBD] 2.4.26a crashes

Peter T. Breuer enbd@lists.community.tummy.com
Mon, 7 Jan 2002 16:11:42 +0100 (MET)


"A month of sundays ago Edward Muller wrote:"
> I've had two crashes like this in the last week. I'm not 100% sure they
> are being caused by ENBD but it's one of two suspects.
> 
> Twice over the last week my primary machine in a two node cluster locked
> up with the following (or similar) messages:
> 
> nbd: #10333[0]: nbd_rollback Rollback req c19090c0 from slot 0!

Client is OK, server died. Well ... could be client too. COuld be the
network.  The kernel driver is OK.  It noticed that a request went
unattended for 5s and pulled it back into the kernel for service by
another daemon.

In itself it's not danegerous, but it would be a symptom of other
problems.

> (Someone read me this over the phone, so I'm sorry if it's not 100%)
> 
> I'm not swapping over ENBD. I have two ENBD devices /dev/nd/a/0 and
> /dev/nd/b/0. /dev/nd/a0 is disk #2 in /dev/md0, which is a RAID1 array.
> /dev/nd/b/0 is disk #2 in /dev/md1, which is a RAID1 array. /dev/md0 is
> mounted as /home. /dev/md1 is mounted as /var.
> 
> When the above error message happens the machine locks. I can't login,

The error in itself cannot cause a lock. In fact, it's not an error but
an advisory. 

The worst that it can _cause_ is for operations on the nbd -mounted
filesystem to block. If you want them to error instead of blocking,
run with show_errors=1 as a module parameter, or write a 0 to
/proc/nbdinfo (when it happens).

But I suspect it's a symptom, not a cause.

> although I can still ping the network interfaces and I can hit enter on
> the screeen and it scrolls.

Enter is significant. That indicates that tcp is still up. A
momementary brownout would upset tcp but not udp (ping, etc.). I would
have expected a login over telnet or ssh to hang for a few minutes if 
the line was lossy enough, and you would not have been able to hit
enter.

> I am using eepro100 network cards (two of them, one of them dedicated to
> ENBD) on a 2.4.16 kernel. Until saturday (the crashes happened before
> that) I was using the open source eepro100 drives, now I am using the
> e100 driver from Intel. My understaning is that the eepro100 OSS drives
> is buggy and may be causing my problems (this is the other possible

I am using those drivers and those nics, and have no problems of that
kind. The problem you report needs further analysis, and more data for
that. Try and find out the state of the nbd client (and server!)
processes when this happens, as this will maybe show who is alive and
who is dead. Writing the "0" to /proc/nbdinfo will also give a clue by
the reaction it provokes.  It will error all current requests and keep
the device in an unready state for 5s - which will cause all further
requests to be errored for 5s.  This should be enough also for the
clients to sense that the kernel driver is ill, and maybe die off in
sympathy, which will cause the client watchdog to restart them.  They
will try and go through a semi-handshake with the server at the other
end.  Maybe the servers will also die off and restart.

That is the expected sequence. If it doesn't happen, you could try and
accelerate it by killing daemons  that you think are stuck via -9 (to be
real sure that the network connection dies) on both sides, and see if new
ones are born and reconnect properly. If that helps, then it is
possible that networking sockets are becoming stuck.

> problem)..


Peter