[ENBD] nbd with an SMP kernel?

Peter T. Breuer ptb@it.uc3m.es
Wed, 18 Oct 2000 02:14:53 +0200 (MET DST)


"A month of sundays ago Peter T. Breuer wrote:"
> "A month of sundays ago leonid@hmdc-admin.fas.harvard.edu wrote:"
> > I actually used to run the server w/ -i "NBDabcdefNBD" (I stole this from

You may even be right! I just tried your experiment. I started the
connections, then killed all servers, then restarted the servers.
You should have mentioned that they don't restart. The main server just
sits there waiting for the clients to connect to it. And they don't:


            |-nbd-client---4*[nbd-client]
            |-nbd-server

Let's see what they're doing. I bet they're wæiting 360s for a low
level net timeout. The servers complained about an interrupted system
call when I killed them! Must be a network send/recv:

   barney:/usr/oboe/ptb/lang/c/nbd/nbd-2.4.15% sudo strace -p 14752
   select(7, NULL, [6], [6], {122, 340000}

Yes. Very evidently a wait on a read. The clients don't know the
server is dead. They'd know it if only they'd listen to the alarm
signal that's been ringing in their ears from SIGALRM, but on linux
it seems that when you're in a select you don't hear SIGALRM, even
though you have a handler set. Oh good .. yes, the client
just timed out on the read, noticed the server death, and tries to
reconnect, when it gets "Connection refused" from the connect attempt.
That's clearly a confused socket state on the kernel. I see the
sequence of retrials now continuing every 10s. That's fine, they
will succeed eventually, when the kernel cleans out the socket stuff.
If ever. Here's the client getting the refused message from the
connection:

  rt_sigaction(0xd, 0xbfffe308, 0, 0x8, 0xd) = 0
  write(2, "nbd-client: client (0) socket co"..., 64nbd-client: client
  (0) socket connect failed Connection refused) = 64
  close(6)             

It's doing the right thing. It's closing everything in sight and
retrying every 10s. But I believe that the kernel will hold the
socket open for a long time. Maybe I should kill the client
instead of retrying? Or kill it every threee failed retries?

Yes, netstat shows lots of sockets on port 4018 in the closed state.

  tcp        0     76 localhost:1081          localhost:4018 CLOSE
  tcp        0     76 localhost:1080          localhost:4018 CLOSE
  tcp        0     76 localhost:1079          localhost:4018 CLOSE
  tcp        1      0 localhost:1077          localhost:4018 CLOSE
  tcp        1      0 localhost:1076          localhost:4018 CLOSE
  tcp        1      0 localhost:1075          localhost:4018 CLOSE
  tcp        1      0 localhost:1074          localhost:4018 CLOSE
  tcp        1      0 localhost:1067          localhost:4018 CLOSE
  tcp        1      0 localhost:1066          localhost:4018 CLOSE
  tcp        1      0 localhost:1065          localhost:4018 CLOSE

It's closed, and nobody's at the other end. Why doesn't the kernel just
vamoosh it? Shall I help it along by killing the process? Why not.
No .. the socket stays around even when the process vanishes. Killing
all the client processes and restarting might help ... nope, the
sockets stay around without their owners. Ahhh. Killing the clients,
then waiting 30s, then restarting the clients does the reconnect
trick.

This looks like a systematic error of mine, but I don't know how to
cure it, because I don't know exactly what's up. I think the problem
is that communications through the elected port number are blocked
by the state of the connections at the client side. The server
clearly never gets any calls. But the client reports that connect()
fails. It's not blocked by the server, so it must be blocked by
something else. If you have any idea how to make all those unconnected
sockets go away, go ahead and let me know! It looks like I should
increase the retry interval to beyond about 30s anyway, or however
long the tcp sockets take to fall out of the close state. They're still
hanging around with no owners that I can locate:

  tcp        0      0 localhost:4018          localhost:1693 ESTABLISHED
  tcp        0      0 localhost:1694          localhost:4018 ESTABLISHED
  tcp        0      0 localhost:1693          localhost:4018 ESTABLISHED
  tcp        0      0 localhost:4018          localhost:1692 ESTABLISHED
  tcp        0      0 localhost:1692          localhost:4018 ESTABLISHED
  tcp        0      0 localhost:4018          localhost:1691 ESTABLISHED
  tcp        0      0 localhost:1691          localhost:4018 ESTABLISHED
  tcp        0     76 localhost:1409          localhost:4018 CLOSE
  tcp        0     76 localhost:1408          localhost:4018 CLOSE
  tcp        0     76 localhost:1407          localhost:4018 CLOSE       

The connect(2)  man page says:
  
         Generally,  connection-based protocol sockets may success-
         fully connect only once; connectionless  pro....

But that's a pretty inaccurate statement. Well, I'll go and ask some
networking experts what's happening in this state, and see if they can
suggest how to get out of it, once they've told me what it is.

Fell free to kill the clients, wait 30s, then restart them.

Peter