[ENBD] Stress testing ENBD
Christopher Eveland
enbd@lists.community.tummy.com
Thu, 24 Jan 2002 16:47:14 -0500
Hi again-
I'm now to the point of trying to stress test ENBD. I can get everything
talking between multiple servers and clients and such just fine, but I'm
having some trouble when I put a connection through its paces.
At the moment I'm using the "new" 2.4.27pre1. If I do a "make test" which
sets up an 8MB set of files on the server, and hook that up to /dev/nda on
the client, then:
# mke2fs /dev/nda
# mount /dev/nda /mnt
# touch /mnt/foo
# dd if=/dev/zero of=/mnt/foo bs=1024 count=7000
# sync
things seem to work just fine, and I can watch the network traffic spike up
when I do the sync. The problem comes when I try to get more ambitious:
# for ((i=1;i<=10;i++)); do dd if=/dev/zero of=/mnt/foo bs=1024 count=7000;
sync; done
will crash the system (well, apparently stop the transfers at any rate)
after a few iterations. I see the net traffic spike up, then I get a string
of errors, at which point I see the following string of message on the
console (client and server messages are interleaved), and the net traffic
drops of to only 4kb/sec, which I assume is the attempts to re-establish
connections:
nbd-server 4473: slavesighandler server (2) activates slave sighandler for
signal 11
nbd-server 4473: server (2) sighandler terminates slave 4473 safely
nbd-netserver 1543: net_sum recv_reply CKSUM failed (-22)
nbd-netserver 1543: net_sum sum exits FAIL (-22)
nbd-client-server 1543: sumgen sumgen exits FAIL (-22)
nbd-server 4470: server (-1) relaunches child after SIGCHLD
nbd-netserver 1543: net_write net_write exits FAIL -22
nbd-client-server 1543: writegen write first server fails sector 4152 len
34816nbd-client-server 1543: writegen exits FAIL
nbd-client 1543: handle_server_cmd_err Read_stat (3) failed Timer expired
port 3034 so clr sock
nbd-server 4470: server (-1) slave pid 4473 is down, launching new
nbd-server 4475: server (2) set default signal handlers for slave server
4475
nbd-server 4470: server (-1) launched slave pid 4475
nbd-server 4474: slavesighandler server (3) activates slave sighandler for
signal 11
nbd-server 4474: server (3) sighandler terminates slave 4474 safely
nbd-netserver 1540: net_sum recv_reply CKSUM failed (-22)
nbd-netserver 1540: net_sum sum exits FAIL (-22)
nbd-client-server 1540: sumgen sumgen exits FAIL (-22)
nbd-netserver 1540: net_write net_write exits FAIL -22
nbd-client-server 1540: writegen write first server fails sector 4220 len
34816nbd-client-server 1540: writegen exits FAIL
nbd-client 1540: handle_server_cmd_err Read_stat (0) failed Timer expired
port 3034 so clr sock
nbd-server 4470: server (-1) relaunches child after SIGCHLD
nbd-server 4470: server (-1) slave pid 4474 is down, launching new
nbd-server 4476: server (3) set default signal handlers for slave server
4476
nbd-server 4471: slavesighandler server (0) activates slave sighandler for
signal 11
nbd-server 4471: server (0) sighandler terminates slave 4471 safely
nbd-netserver 1541: net_sum recv_reply CKSUM failed (-22)
nbd-netserver 1541: net_sum sum exits FAIL (-22)
nbd-client-server 1541: sumgen sumgen exits FAIL (-22)
nbd-server 4470: server (-1) launched slave pid 4476
nbd-server 4470: server (-1) slave pid 4471 is down, launching new
nbd-server 4472: slavesighandler server (1) activates slave sighandler for
signal 11
nbd-server 4472: server (1) sighandler terminates slave 4472 safely
nbd-server 4477: server (0) set default signal handlers for slave server
4477
nbd-server 4470: server (-1) launched slave pid 4477
nbd-server 4470: server (-1) slave pid 4472 is down, launching new
nbd-server 4478: server (1) set default signal handlers for slave server
4478
nbd-server 4470: server (-1) launched slave pid 4478
nbd-server 4470: server (-1) relaunches child after SIGCHLD
nbd-netserver 1542: net_write net_write exits FAIL -22
nbd-client-server 1542: writegen write first server fails sector 2 len 1024
nbd-client-server 1542: writegen exits FAIL
nbd-client 1542: handle_server_cmd_err Read_stat (2) failed Connection
reset by peer port 3034 so clr sock
nbd-client 1543: client (3) last error Timer expired
nbd-client 1540: client (0) last error Timer expired
nbd-client 1536: sighandler relaunches child from manager
nbd-client 1536: client (-1) childminder verifies pid 1543 (3) is down
nbd-client 1536: client (-1) childminder launched pid 1593 (3)
nbd-client 1593: ok
nbd-client 1536: client (-1) childminder verifies pid 1540 (0) is down
nbd-client 1594: ok
nbd-client 1536: client (-1) childminder launched pid 1594 (0)
nbd-client 1593: client (3) opened socket 5 to port 3034
nbd-server 4477:nbd-client 1594: client (0) opened socket 5 to port 3034
nbd-server 4475: server (2) opened port 3034 (socket 6) for client
192.168.4.1
nbd-client 1594: client (0) read passwd ok from port 3034
nbd-client 1594: client (0) got cliserv magic ok from port 3034
nbd-client 1594: client (0) got a signature ok from port 3034
nbd-client 1594: client (0) begins main loop
nbd-server 4475: server (2) set new signal handlers for slave server 4475
server (0) opened port 3034 (socket 6) for client 192.168.4.1
nbd-client 1593: client (3) read passwd ok from port 3034
nbd-client 1593: client (3) got cliserv magic ok from port 3034
nbd-client 1593: client (3) got a signature ok from port 3034
nbd-client 1593: client (3) begins main loop
nbd-server 4477: server (0) set new signal handlers for slave server 4477
nbd-server 4477: slavesighandler server (0) activates slave sighandler for
signal 11
nbd-server 4477: server (0) sighandler terminates slave 4477 safely
nbd-server 4470: server (-1) relaunches child after SIGCHLD
nbd-server 4470: server (-1) slave pid 4477 is down, launching new
nbd-server 4479: server (0) set default signal handlers for slave server
4479
nbd-server 4470: server (-1) launched slave pid 4479
nbd-client 1542: client (2) last error Connection reset by peer
nbd-client 1536: sighandler relaunches child from manager
nbd-client 1536: client (-1) childminder verifies pid 1542 (2) is down
nbd-client 1536: client (-1) childminder launched pid 1595 (2)
nbd-client 1595: ok
nbd-client 1595: client (2) opened socket 5 to port 3034
nbd-server 4479: server (0) opened port 3034 (socket 6) for client
192.168.4.1
nbd-client 1595: client (2) read passwd ok from port 3034
nbd-client 1595: client (2) got cliserv magic ok from port 3034
nbd-client 1595: client (2) got a signature ok from port 3034
nbd-client 1595: client (2) begins main loop
nbd-server 4479: server (0) set new signal handlers for slave server 4479
nbd-netserver 1593: net_write net_write exits FAIL -22
nbd-client-server 1593: writegen write first server fails sector 2 len 1024
nbd-client-server 1593: writegen exits FAIL
nbd-client 1593: handle_server_cmd_err Read_stat (3) failed Connection
reset by peer port 3034 so clr sock
nbd-server 4475: slavesighandler server (2) activates slave sighandler for
signal 11
nbd-server 4475: server (2) sighandler terminates slave 4475 safely
nbd-server 4470: server (-1) relaunches child after SIGCHLD
nbd-server 4470: server (-1) slave pid 4475 is down, launching new
nbd-server 4480: server (2) set default signal handlers for slave server
4480
nbd-server 4470: server (-1) launched slave pid 4480
nbd-netserver 1594: net_write net_write exits FAIL -22
nbd-client-server 1594: writegen write first server fails sector 2 len 1024
nbd-client-server 1594: writegen exits FAIL
nbd-client 1594: handle_server_cmd_err Read_stat (0) failed Connection
reset by peer port 3034 so clr sock
nbd-server 4479: slavesighandler server (0) activates slave sighandler for
signal 11
nbd-server 4479: server (0) sighandler terminates slave 4479 safely
nbd-server 4470: server (-1) relaunches child after SIGCHLD
nbd-server 4470: server (-1) slave pid 4479 is down, launching new
nbd-server 4481: server (0) set default signal handlers for slave server
4481
nbd-server 4470: server (-1) launched slave pid 4481
nbd-netserver 1595: net_write net_write exits FAIL -22
nbd-client-server 1595: writegen write first server fails sector 2 len 1024
nbd-client-server 1595: writegen exits FAIL
nbd-client 1595: handle_server_cmd_err Read_stat (2) failed Connection
reset by peer port 3034 so clr sock
nbd-client 1593: client (3) last error Connection reset by peer
nbd-client 1536: sighandler relaunches child from manager
nbd-client 1536: client (-1) childminder verifies pid 1593 (3) is down
nbd-client 1596: ok
nbd-client 1536: client (-1) childminder launched pid 1596 (3)
nbd-client 1596: client (3) opened socket 5 to port 3034
nbd-server 4476: server (3) opened port 3034 (socket 6) for client
192.168.4.1
nbd-client 1596: client (3) read passwd ok from port 3034
nbd-client 1596: client (3) got cliserv magic ok from port 3034
nbd-client 1596: client (3) got a signature ok from port 3034
nbd-client 1596: client (3) begins main loop
nbd-server 4476: server (3) set new signal handlers for slave server 4476
nbd-server 4476: slavesighandler server (3) activates slave sighandler for
signal 11
nbd-server 4476: server (3) sighandler terminates slave 4476 safely
nbd-server 4470: server (-1) relaunches child after SIGCHLD
nbd-server 4470: server (-1) slave pid 4476 is down, launching new
nbd-client 1594: client (0) last error Connection reset by peer
nbd-server 4482: server (3) set default signal handlers for slave server
4482
nbd-server 4470: server (-1) launched slave pid 4482
nbd-client 1536: sighandler relaunches child from manager
nbd-netserver 1596: net_write net_write exits FAIL -22
nbd-client 1536: client (-1) childminder verifies pid 1594 (0) is down
nbd-client-server 1596: writegen write first server fails sector 2 len 1024
nbd-client-server 1596: writegen exits FAIL
nbd-client 1597:nbd-client 1596: handle_server_cmd_err Read_stat (3)
failed Connection reset by peer port 3034 so clr sock
ok
nbd-client 1536: client (-1) childminder launched pid 1597 (0)
nbd-client 1597: client (0) opened socket 5 to port 3034
nbd-server 4480: server (2) opened port 3034 (socket 6) for client
192.168.4.1
nbd-client 1597: client (0) read passwd ok from port 3034
nbd-client 1597: client (0) got cliserv magic ok from port 3034
nbd-client 1597: client (0) got a signature ok from port 3034
nbd-client 1597: client (0) begins main loop
nbd-server 4480: server (2) set new signal handlers for slave server 4480
nbd-server 4480: slavesighandler server (2) activates slave sighandler for
signal 11
nbd-server 4480: server (2) sighandler terminates slave 4480 safely
nbd-client 1595: client (2) last error Connection reset by peer
nbd-server 4470: server (-1) relaunches child after SIGCHLD
nbd-server 4470: server (-1) slave pid 4480 is down, launching new
nbd-server 4483: server (2) set default signal handlers for slave server
4483
nbd-server 4470: server (-1) launched slave pid 4483
nbd-client 1536: sighandler relaunches child from manager
nbd-client 1536: client (-1) childminder verifies pid 1595 (2) is down
nbd-netserver 1597: net_write net_write exits FAIL -22
nbd-client-server 1597: writegen write first server fails sector 12 len
1024
nbd-client-server 1597: writegen exits FAIL
nbd-client 1598: ok
nbd-client 1597: handle_server_cmd_err Read_stat (0) failed Connection
reset by peer port 3034 so clr sock
nbd-client 1536: client (-1) childminder launched pid 1598 (2)
nbd-client 1598: client (2) opened socket 5 to port 3034
nbd-server 4483: server (2) opened port 3034 (socket 6) for client
192.168.4.1
nbd-client 1598: client (2) read passwd ok from port 3034
nbd-client 1598: client (2) got cliserv magic ok from port 3034
nbd-client 1598: client (2) got a signature ok from port 3034
nbd-client 1598: client (2) begins main loop
nbd-server 4483: server (2) set new signal handlers for slave server 4483
nbd-server 4483: slavesighandler server (2) activates slave sighandler for
signal 11
And it goes on. Anyway, if anyone has any suggestions for what to check
into, I'd appreciate it (again), and if there are any diagnostics that might
help, let me know.
Thanks,
-Chris
PS Here are the options I have set in the makefile (mostly as it came):
###################### run options #########################
#
# server machine (runs nbd-server)
SERVER = beowulf2
# client machine (runs nbd-client and nbd.o)
CLIENT = beowulf1
# client whole device name
DEVICE = /dev/nda
# server builds these resources for export
EXPORT = /tmp/core0 /tmp/core1
# each of this many blocks
SRVSIZ = 4096
# blocksize on client and server 1024, 2048, 4096
BLKSIZ = 1024
# control port
PORT = 3033
# how many parallel channels (otherwise use server addr repeats)
OPTSOCK = -n 4
#OPTSOCK := $(SERVER) $(SERVER) $(SERVER) $(SERVER)
# session signature - default = random
#OPTSIG = -i "barbarbarbar"
# blocksize - default = 1024B or highest
offer
OPTBLK = -b $(BLKSIZ)
# use raid linear or raid 0 or raid 1 on server - default = linear
OPTRAID = -0
# timeout for first negotiate - default = 60s
OPTTIME = -t 120
# readonly on server side
OPTRO = #-r
# server maintains write order until this many ms delay - default = 0ms
OPTORDER = -w 10000
# client uses checksumming protocol at startup
OPTMD5SUM = #-m
# journalling
OPTJRNL = #-j /tmp/nbd.log
# module readahead in blocks
RAHEAD = 20
# module-based request aggregation (default 0)
MERGE_REQS = 32
# module timeout seconds on daemon service
OPT_REQ_TIMEO= -p 5
# module force sync every n seconds
SYNC_INTVL = 0
# apply throttle at n KB per second (default 0 = don't throttle)
SPEED_LIM = 0
# make client asyncronous (default, no). client acks before net replies.
#OPTASYNC = -a
#
############################################################