[ENBD] Two ENBD issues
Peter T. Breuer
ptb@it.uc3m.es
Thu, 28 Sep 2000 13:18:02 +0200 (MET DST)
I'll take the opportunity (hurray ... break from exam committees!)
to comment on a couple of other issues.
> 100% CPU usage. I had the same kind of poblem once with broken pipes,
I think that was a busy loop due to a read without error of zero bytes
after a select which returned indicating that the socket was ready to
return some bytes or error. I know we covered that but I'd be grateful for
your comment. I've made a change in my current codebase which should
deal with it, one way or another. At least it will announce it.
> Another strange thing is the "rollback" message I get when I am in an
> interactive shell. I will wait for some time and then execute a command
There are some race conditions on device removal - I can get the client
stuck .. but it's not stuck in my driver code. I think it's stuck in
an fsync_dev in nbd_release, which will be called through the block
device layers in the kernel as an open on the device is released. There
are no requests on the queue or anywhere in the devices area, and
according to the acounting it has dealt with every request it ever saw.
and the clients themselves aren't stuck. But when I replace the
blocking fsync_dev with a non-blocking sync_dev, well, I don't know ..
I think that leads to a painful death.
Also, I've tried adding a sync pulse in a kernel thread just to see if I
can avoid the bad publicity that comes from the fact that streaming
writes will eventually deadlock on a loopback connextion (it's
inevitable given the kernel VFS architecture), but that seems to mess up
e2fsck again, probably due to that utilities habit of sending out of
range requests just to see if we are awake and listening.
In other words I know about some possible or even likely race
conditions, but it is very very hard and slow work to check them out. I
suspect that the kernel as a whole is full of them. I can even get the
server stuck in a network call, and that's a purely userland
application.
> Like if the client could not resync with the server. Does the rollback
> relauch a nbd-server on the other end? If I never enter interactive
I forgot to answer this. No, it does not. A rollback is a probable
_symptom_ of a hard disconnect. The client will detect the disconnect.
It will ask the kernel to rollback any requests it has outstanding.
When the connect is made again, the client will return to asking the
kernel for requests to treat and will probably get the old requests back
again. You are probably seeing a failed reconnect and nothing more. I
don't know why it should fail, but timeouts can be long.
> happens near the 30s message in nbd-server, but it is sporadic,
> sometimes everything will be fine, sometimes after the 30s I will get a
> rollback on the next block request).
I'm pretty sure it's a timing thing, but I need a trace (you can get
one with strace -f -p ...).
> Another strange fact, if I boot several clients at the same time, the
> nbd-server wont serve more than 4 at any one time, so we have to boot
> machines by pairs of 3. More and one machine will not sync when the
I have no clue where this comes from. It's worth you looking at that,
or asking around your colleagues. It might be a restriction in the
bind, listen, accept semantics. There _are_ queues involved underneath,
but I don't know (i.e. I am abysmally ignorant) of any limit that would
reduce you to 4 listen, accept cycles on the same primary port. As far
as I can see, the code is in a loop just doing listen, accept, fork and
should not care how many times it does it. The only possibility that
occurs to me is that some socket is not being closed and we are running
out of descriptors. Is that visible via netstat?
Peter