[ENBD] can't restart the client after mkraid
Peter T. Breuer
enbd@lists.community.tummy.com
Tue, 17 Jun 2003 12:51:24 +0200 (MET DST)
"A month of sundays ago enbd@ttvenn.net wrote:"
> I'm using 2.4.20 kernels, nbd-2.4.20 and fr1-2.14. I've got a 120GB
> drive each of 2 machines that I want to combine in a raid 1 mirror.
>
> When the system is up, things are working OK. I'm currently testing
> what happens when things go wrong. In this case I am just killing
> the client after the mkraid. I get the same problem whether or
> not I startraid or make a filesystem on /dev/md0. I only don't get
> the problem if I don't run mkraid. Then I can kill and restart the
What is the file system?
> client with wild and heady abandon.
>
> Heres what happens when things are OK:
>
> enbd-client 17468: client (-1) got size 123522416640
> enbd-client 17468: client (-1) negotiated blksize 1024
> enbd-client 17468: client (-1) negotiated pulse_intvl 10
> enbd-client 17468: client (-1) got session port 1100 ok
> enbd-client 17468: client (-1) introduction sequence ends ok
> enbd-client 17468: client (-1) set show_errs flag on device
> enbd-client 17468: client (-1) sets session slots to 0-1
> enbd-client 17469: client (0) found device /dev/nda1 ok
> enbd-client 17470: client (1) found device /dev/nda2 ok
> enbd-client 17469: client (0) opened socket 5 to port 1100
> enbd-client 17469: client (0) read passwd ok from port 1100
> enbd-client 17469: client (0) got cliserv magic ok from port 1100
> enbd-client 17469: client (0) got a signature ok from port 1100
> enbd-client 17469: client (0) begins main loop
> enbd-client 17470: client (1) opened socket 5 to port 1100
> enbd-client 17470: client (1) read passwd ok from port 1100
> enbd-client 17470: client (1) got cliserv magic ok from port 1100
> enbd-client 17470: client (1) got a signature ok from port 1100
> enbd-client 17470: client (1) begins main loop
>
> then make the raid (otherwise everything is OK)
>
> $ mkraid --dangerous-no-resync /dev/md0
>
> break the raid by killing the enbd-client on the client:
>
> $ killall enbd-client
This is not exactly the expected mode of shutting down! The client is
part of the setup - it's not expecting to die. What is expected is that
the /connection/ to the server will die, but not the client.
You may be able to get away with it, and you may not. I'll have to
think about it.
> enbd-client 17514: sighandler terminates manager safely
> enbd-client 17514: client (-1) last error Invalid argument
>
> and this happens on the server:
>
> enbd-server 8137<#1322>: newproto net errored on packet. Breaking off.
> enbd-server 8138<#1322>: newproto net errored on packet. Breaking off.
> enbd-server 8137<# 819>: slavesighandler server (0) activates slave sighandler for signal 15
> enbd-server 8137: server (0) sighandler terminates slave 8137 safely
> enbd-server 8136: server (-1) relaunches child after SIGCHLD
> enbd-server 8136: server (-1) slave pid 8137 is down, launching new
> enbd-server 8136: server (-1) launched slave pid 8139
> enbd-server 8139: server (0) set default signal handlers for slave server 8139
> enbd-server 8138<# 819>: slavesighandler server (1) activates slave sighandler for signal 15
> enbd-server 8138: server (1) sighandler terminates slave 8138 safely
> enbd-server 8136: server (-1) relaunches child after SIGCHLD
> enbd-server 8136: server (-1) slave pid 8138 is down, launching new
> enbd-server 8140: server (1) set default signal handlers for slave server 8140
> enbd-server 8136: server (-1) launched slave pid 8140
>
> Then try to restart client:
>
> $ enbd-client garage:1099 -e -n 2 -i garage /dev/nda
OK.
> enbd-client 17530: client (-1) manager opened NBD device /dev/nda (2b00)
> enbd-client 17530: client (-1) left kernel bdflush sync boundary at 80%
> enbd-client 17530: client (-1) left kernel bdflush async boundary at 25%
> enbd-client 17531: client (-1) starts introduction sequence on port 1099
> enbd-client 17531: client (-1) got size 123522416640
> enbd-client 17531: client (-1) negotiated blksize 1024
> enbd-client 17531: client (-1) negotiated pulse_intvl 10
> enbd-client 17531: client (-1) got session port 1101 ok
> enbd-client 17531: client (-1) introduction sequence ends ok
> enbd-client 17531: client (-1) set show_errs flag on device
> enbd-client 17531: client (-1) sets session slots to 0-1
> enbd-client 17532: client (0) found device /dev/nda1 ok
> enbd-client 17533: client (1) found device /dev/nda2 ok
> enbd-client 17532: client (0) opened socket 5 to port 1101
> enbd-client 17532: client (0) read passwd ok from port 1101
> enbd-client 17532: client (0) got cliserv magic ok from port 1101
> enbd-client 17532: client (0) got a signature ok from port 1101
> enbd-client 17532: client (0) begins main loop
> enbd-client 17533: client (1) opened socket 5 to port 1101
> enbd-client 17533: client (1) read passwd ok from port 1101
> enbd-client 17533: client (1) got cliserv magic ok from port 1101
> enbd-client 17533: client (1) got a signature ok from port 1101
> enbd-client 17533: client (1) begins main loop
>
> then a pause, then
>
> newproto kernel errored 3 times when we asked for a new req: Bad file descriptor
Well, we are simply not being given anything by the kernel. I suspect
the kernel log will show what is being complained about (it's too hot
here to think!).
> The whole thing loops around the "newproto kernel errored..." until you kill the client.
Check the kernel emissions. "dmesg" (sorry for heat-induced brevity ..
there is a sort of heat haze between me and the keyboard - oh, it's my
eyes).
> And then after I kill the enbd-client:
>
> enbd-server 8147<#1322>: newproto net errored on packet. Breaking off.
Well, they're all positive the kernel refuses to play ball. I can dimly
imagine why. I'd need to see the kernel messages.
> $ raidstop /dev/md0
>
> I can then restart the client with no problems again, but this then
> means I lose the benefit of the intelligent resync.
It looks like the device is in use, and the new client won't be allowed
in to the kernel while the reference count is high. Kernel messages
would show.
It's really sort of a non-problem, no? DOn't kill the client and you
get no problem!
Peter