[ENBD] 2.4.26a crashes

Edward Muller enbd@lists.community.tummy.com
07 Jan 2002 10:54:36 -0500


On Mon, 2002-01-07 at 10:11, Peter T. Breuer wrote:
> "A month of sundays ago Edward Muller wrote:"
> > I've had two crashes like this in the last week. I'm not 100% sure they
> > are being caused by ENBD but it's one of two suspects.
> > 
> > Twice over the last week my primary machine in a two node cluster locked
> > up with the following (or similar) messages:
> > 
> > nbd: #10333[0]: nbd_rollback Rollback req c19090c0 from slot 0!
> 
> Client is OK, server died. Well ... could be client too. COuld be the
> network.  The kernel driver is OK.  It noticed that a request went
> unattended for 5s and pulled it back into the kernel for service by
> another daemon.

Network is a crossover cable between the two machines. I have the logs
from the server at the time (I think). I'll have to dig them up.

> 
> In itself it's not danegerous, but it would be a symptom of other
> problems.
> 
> > (Someone read me this over the phone, so I'm sorry if it's not 100%)
> > 
> > I'm not swapping over ENBD. I have two ENBD devices /dev/nd/a/0 and
> > /dev/nd/b/0. /dev/nd/a0 is disk #2 in /dev/md0, which is a RAID1 array.
> > /dev/nd/b/0 is disk #2 in /dev/md1, which is a RAID1 array. /dev/md0 is
> > mounted as /home. /dev/md1 is mounted as /var.
> > 
> > When the above error message happens the machine locks. I can't login,
> 
> The error in itself cannot cause a lock. In fact, it's not an error but
> an advisory. 

Hmm. Machine locked pretty hard. Couldn't do a thing to it. Except hit
enter to see the screen scroll up.

> 
> The worst that it can _cause_ is for operations on the nbd -mounted
> filesystem to block. If you want them to error instead of blocking,
> run with show_errors=1 as a module parameter, or write a 0 to
> /proc/nbdinfo (when it happens).

I use the following line in a script to load the module..

/sbin/modprobe nbd merge_requests=0 sync_intvl=1 \
               rahead=20 show_errs=1 plug=1

> 
> But I suspect it's a symptom, not a cause.
> 
> > although I can still ping the network interfaces and I can hit enter on
> > the screeen and it scrolls.
> 
> Enter is significant. That indicates that tcp is still up. A
> momementary brownout would upset tcp but not udp (ping, etc.). I would
> have expected a login over telnet or ssh to hang for a few minutes if 
> the line was lossy enough, and you would not have been able to hit
> enter.

I can hit enter only at the console. My ssh sessions (of which there
were 5 just to my workstation) locked up with no response (IIRC). At the
console I could (well someone elese did the keypresses) hit enter and
watch the text scroll up the screen.

> 
> > I am using eepro100 network cards (two of them, one of them dedicated to
> > ENBD) on a 2.4.16 kernel. Until saturday (the crashes happened before
> > that) I was using the open source eepro100 drives, now I am using the
> > e100 driver from Intel. My understaning is that the eepro100 OSS drives
> > is buggy and may be causing my problems (this is the other possible
> 
> I am using those drivers and those nics, and have no problems of that
> kind. The problem you report needs further analysis, and more data for
> that. Try and find out the state of the nbd client (and server!)
> processes when this happens, as this will maybe show who is alive and
> who is dead. Writing the "0" to /proc/nbdinfo will also give a clue by
> the reaction it provokes.  It will error all current requests and keep
> the device in an unready state for 5s - which will cause all further
> requests to be errored for 5s.  This should be enough also for the
> clients to sense that the kernel driver is ill, and maybe die off in
> sympathy, which will cause the client watchdog to restart them.  They
> will try and go through a semi-handshake with the server at the other
> end.  Maybe the servers will also die off and restart.

Which drivers are you using? eepro100 or e100 ?

Once the problem occurs I can't do anything on the client machine. It's
locked. And sorry, I just looked I don't have the server logs anymore. I
need to change > to >> in my init script. From what I remember by
reading the server logs is that the server procesess restarted for some
reason. But that's all I remember.

> 
> That is the expected sequence. If it doesn't happen, you could try and
> accelerate it by killing daemons  that you think are stuck via -9 (to be
> real sure that the network connection dies) on both sides, and see if new
> ones are born and reconnect properly. If that helps, then it is
> possible that networking sockets are becoming stuck.

Again I can't do anything on the client machine.

> 
> > problem)..
> 
> 
> Peter
> _______________________________________________
> ENBD mailing list
> ENBD@lists.community.tummy.com
> http://lists.community.tummy.com/mailman/listinfo/enbd
-- 
-------------------------------
Edward Muller
Director of IS

973-715-0230 (cell)
212-487-9064 x115 (NYC)

http://www.learningpatterns.com
-------------------------------