[ENBD] Strange behavior
Peter T. Breuer
ptb@it.uc3m.es
Tue, 10 Oct 2000 15:51:07 +0200 (MET DST)
"A month of sundays ago Adrian Turcu wrote:"
> > Turn the device off by sending it a "die" signal via a USR1 to the
> > daemons (that should be enough to frighten RAID, from what I have heard)
> > or echo -n 0 to /proc/nbdinfo. Either will cause all outstanding requests
> > to be errored out and the device to report itself as disabled during 5s.
>
>
> Thank you very much for remember me about user space. Yes, I am using heartbeat
> on my Linux and when the network is down (heartbeat says that)... I just put something
> like:
>
> killall -USR1 nbd-client ; killall -9 nbd-client ; rmmod nbd
The -9 is somewhat evil. That will kill the clients without giving them
a chance to unregister from the kernel first, which will leave the
module "busy" and un-rmmodable (until you echo -n 1 >/proc/nbdinfo).
You want killall -TERM at max. That will have the effect you want.
Why do you want to kill the clients, BTW? Do you not want the device to
come up again when the network is reestablished?
> in my heartbeat service script and the RAID-1 is released from its waiting
> (will declares the NBD partition down and continue to work with all remaining).
I'd rather you tried the modification to the driver I suggested (put a
nbd_soft/hard_reset call in when the aslot count goes through zero in
nbd_clr_sock - aslot is the number of known active slots, i.e. client
daemons). I'll add an option for this, but I'd like to know how it
behaves first!
I'd also like a vote on whether USR1 appears to do the right thing (or
even appears to do what I said it does). It's meant to be a fairly weak
signal. The trouble is that there are three levels of disactivation
that one might need, and only two USR signals available.
1) all daemons temporarily out of order, rollback requests,
disable driver, and wait for reenable
2) all daemons completely whacked, error out anything we
hold and turn off anything we can pretty permanently
3) emergency recover from state we shouldn't have been in so
that at least we can shut down cleanly - reset use count,
search out and kill malformed buffers, etc etc.
USR1 currently clears the daemon slots of any requests, putting them
back on the device queue, then errors out all the requests on the device
queue, and goes on to search for requests on the incoming kernel queue
that it can legitimately error out. It holds the device state as
DISABLED throughout this (so that the daemons will get pretty unhappy
too since they'll come up empty handed when they dive down for requests
to treat) in order that the kernel can't sneak any more requests in
and keeps it that way for 5s, after which it sets thinsg back to
normal.
So USR1 is a cross between 1) and 2). It's also "per NBD device". I.e.
it only effcts nda and not ndb, or only ndb and not nda, etc.
USR2 affects the whole driver. It does a USR1 on any device it is
handling, scavenges around a bit more on the kernel queue, then resets
the use counts and turns off the driver by force of arms, permamently.
If by any chance the driver was in use, you may get an oops when
you disengage whatever was holding one of its devices open, as its
kernel resources have now disappeared!
And hence USR2 is a 3). Or as close as makes no difference.
Peter