[ENBD] nbd with an SMP kernel?
Peter T. Breuer
ptb@it.uc3m.es
Wed, 18 Oct 2000 02:14:53 +0200 (MET DST)
"A month of sundays ago Peter T. Breuer wrote:"
> "A month of sundays ago leonid@hmdc-admin.fas.harvard.edu wrote:"
> > I actually used to run the server w/ -i "NBDabcdefNBD" (I stole this from
You may even be right! I just tried your experiment. I started the
connections, then killed all servers, then restarted the servers.
You should have mentioned that they don't restart. The main server just
sits there waiting for the clients to connect to it. And they don't:
|-nbd-client---4*[nbd-client]
|-nbd-server
Let's see what they're doing. I bet they're wæiting 360s for a low
level net timeout. The servers complained about an interrupted system
call when I killed them! Must be a network send/recv:
barney:/usr/oboe/ptb/lang/c/nbd/nbd-2.4.15% sudo strace -p 14752
select(7, NULL, [6], [6], {122, 340000}
Yes. Very evidently a wait on a read. The clients don't know the
server is dead. They'd know it if only they'd listen to the alarm
signal that's been ringing in their ears from SIGALRM, but on linux
it seems that when you're in a select you don't hear SIGALRM, even
though you have a handler set. Oh good .. yes, the client
just timed out on the read, noticed the server death, and tries to
reconnect, when it gets "Connection refused" from the connect attempt.
That's clearly a confused socket state on the kernel. I see the
sequence of retrials now continuing every 10s. That's fine, they
will succeed eventually, when the kernel cleans out the socket stuff.
If ever. Here's the client getting the refused message from the
connection:
rt_sigaction(0xd, 0xbfffe308, 0, 0x8, 0xd) = 0
write(2, "nbd-client: client (0) socket co"..., 64nbd-client: client
(0) socket connect failed Connection refused) = 64
close(6)
It's doing the right thing. It's closing everything in sight and
retrying every 10s. But I believe that the kernel will hold the
socket open for a long time. Maybe I should kill the client
instead of retrying? Or kill it every threee failed retries?
Yes, netstat shows lots of sockets on port 4018 in the closed state.
tcp 0 76 localhost:1081 localhost:4018 CLOSE
tcp 0 76 localhost:1080 localhost:4018 CLOSE
tcp 0 76 localhost:1079 localhost:4018 CLOSE
tcp 1 0 localhost:1077 localhost:4018 CLOSE
tcp 1 0 localhost:1076 localhost:4018 CLOSE
tcp 1 0 localhost:1075 localhost:4018 CLOSE
tcp 1 0 localhost:1074 localhost:4018 CLOSE
tcp 1 0 localhost:1067 localhost:4018 CLOSE
tcp 1 0 localhost:1066 localhost:4018 CLOSE
tcp 1 0 localhost:1065 localhost:4018 CLOSE
It's closed, and nobody's at the other end. Why doesn't the kernel just
vamoosh it? Shall I help it along by killing the process? Why not.
No .. the socket stays around even when the process vanishes. Killing
all the client processes and restarting might help ... nope, the
sockets stay around without their owners. Ahhh. Killing the clients,
then waiting 30s, then restarting the clients does the reconnect
trick.
This looks like a systematic error of mine, but I don't know how to
cure it, because I don't know exactly what's up. I think the problem
is that communications through the elected port number are blocked
by the state of the connections at the client side. The server
clearly never gets any calls. But the client reports that connect()
fails. It's not blocked by the server, so it must be blocked by
something else. If you have any idea how to make all those unconnected
sockets go away, go ahead and let me know! It looks like I should
increase the retry interval to beyond about 30s anyway, or however
long the tcp sockets take to fall out of the close state. They're still
hanging around with no owners that I can locate:
tcp 0 0 localhost:4018 localhost:1693 ESTABLISHED
tcp 0 0 localhost:1694 localhost:4018 ESTABLISHED
tcp 0 0 localhost:1693 localhost:4018 ESTABLISHED
tcp 0 0 localhost:4018 localhost:1692 ESTABLISHED
tcp 0 0 localhost:1692 localhost:4018 ESTABLISHED
tcp 0 0 localhost:4018 localhost:1691 ESTABLISHED
tcp 0 0 localhost:1691 localhost:4018 ESTABLISHED
tcp 0 76 localhost:1409 localhost:4018 CLOSE
tcp 0 76 localhost:1408 localhost:4018 CLOSE
tcp 0 76 localhost:1407 localhost:4018 CLOSE
The connect(2) man page says:
Generally, connection-based protocol sockets may success-
fully connect only once; connectionless pro....
But that's a pretty inaccurate statement. Well, I'll go and ask some
networking experts what's happening in this state, and see if they can
suggest how to get out of it, once they've told me what it is.
Fell free to kill the clients, wait 30s, then restart them.
Peter