[ENBD] Kernel oops (nbd-2.4.31) or failed connections

Anders Blomdell anders.blomdell at control.lth.se
Mon May 24 06:51:38 MDT 2004


> There are always race conditions in removing a device.
Yes, bu what worries me are the race conditions in open/close.

> Thank you very very much for the code below, by the way.
You are welcome.

> While I like this (it makes opens and closes atomic and prevents an 
> open
> from following a close so closely that the open sees a device present
> just before the close kills it, and letting the open run on to use the
> device that has just been killed by the close ..), it doesn't obviously
> solve the problem globally - there are many other points in the server
> code that assume that once they have got a reference from enbd[nbd], it
> is going to stay valid while they execute and not disappear under them.
>
> Not all of these references are going to be preceded by an open.  For
> example, the request function can receive late-arriving requests, get
> the device reference, then have the device disappear under it ...  
> yeah,
> bad example, that's under io_lock, but I imagine (without looking) that
> the lock is raised briefly while going round the loop, and anyway,
> the semaphore doesn't ask the io lock for permission to up itself :(.
Aren't all operations surrounded by open/close pairs (thereby implying
that handles won't disappear underneath them)? I obviously has something
to catch up on in my linux kernel understanding.

> I'm quite willing to add these checks, but experience tells me that if
> you have a semaphore on a common thing like an open, you really really
> really do not want to be on board ship when the semaphore somehow gets
> stuck. I try and make semaphores with timeouts! Defensive programming.
> Defense against kicking the machine to death ...
OK, but it's [almost?] as bad to be killed by a kernel OOPS, leaving the
module stuck in memory.

> I notice by the way that the open code already contains a semaphore
> (with timeout :-) (in 2.4.32, for kernel 2.4, at least):
In 2.4.31 as well.

> However, you have (correctly) made the refcnt get itself incremented
> under semaphore too. I agree with that, and I will do it, but it means
> adding a decrement if the open subsequently bombs out ..  but it
> doesn't.  After negotiating the above code in open, the open is doomed
> to succeed.  So OK - I move the refcnt increment up there from lower
> down in the function.
OK, confidence restored 8^)

> There is an argument for supposing that NULL/non-NULL can act like a
> lock, since at least the operation that writes the address is atomic,
> but I do not wish to pursue that.
Nope, it does not do a barrier on a non-coherent cache machine (sparc's
come to mind).

> Is there any point to invalidating buffers when we kill the daemons?
> I suppose so. But why not whenever the open count falls below the 
> daemon
> count, instead?
Sounds very reasonable.

Should I switch to 2.4.32 for testing?

Right now I have a
'enbd-server  2052: <#1467> newproto Not enough magic in packet. 
Breaking off'
that I need to get rid of (haven't seen that one before)...

Regards

Anders Blomdell



More information about the ENBD mailing list