[ENBD] Weird enbd failure while updating md superblock
Peter T. Breuer
ptb at inv.it.uc3m.es
Tue Jun 13 10:02:49 MDT 2006
"Also sprach Bas van Schaik:"
> Lately, one of my RAID5-arrays on top of ENBD ran out of sync, causing
> it to run in degraded mode (2 of 3 disks available). No problem, of
> course, I just hotadded the third device (ndc) again and a resync was
> initiated. However, this resync failed while updating the md superblock
> on one of the up-and-running devices. From the clientside syslog:
>
> > Jun 13 04:00:12 localhost kernel: ENBD #1163[0]: enbd_rollback (0):
> > rollback req c3bc7b8c!
> > Jun 13 04:00:12 localhost kernel: ENBD #1163[1]: enbd_rollback (1):
> > rollback req c41f65ec!
> > Jun 13 04:00:12 localhost kernel: ENBD #1163[2]: enbd_rollback (2):
> > rollback req ced2404c!
> > Jun 13 04:00:12 localhost kernel: ENBD #1163[3]: enbd_rollback (3):
> > rollback req c419f04c!
Timeout (I'm a little surprised the device was in that mode, rather
than erroring on timeouts - it indicates you're not using the fr1 patch
part that alters md to talk to the enbd driver about what mode they
think each other should be in; it works for raid5 too).
> > Jun 13 04:00:31 localhost enbd-client: enbd-netserver 3127: # 161
> > recv_reply: stream read of 131072 bytes returned -110
> > Jun 13 04:00:31 localhost enbd-client: enbd-netserver 3127: # 514
> > net_read: net_read (1) exits FAIL
Well, that's
#define ETIMEDOUT 110 /* Connection timed out */
(asm/errno.h) from the socket. The client seems to be fine, but the
server or the network itself has failed.
> > Jun 13 04:00:31 localhost enbd-client: enbd-client 3127: #1654
> > handle_server_cmd_err: (1) cmd failed Input/output error on
> > 10.1.0.2:1102 so clear socket
Fair enough ... deliberate suicides at the client end. They'll try and
reconnect from scratch later on. Basically the socket has gone bad. I
have no idea why. ETIMEDOUT is not an error I know too much about. It
could be an internally generated error, mind you, indicating an alarm
timeout. I'll have to check ... yes. The stream layer reports that
error if there is no activity on the socket when there should be:
result = microselect(sock+1,NULL,wfds,efds,1000000*timeout);
if (result <= 0) {
result = -ETIMEDOUT;
and again for a socket send or receive that takes too long.
> > Jun 13 04:01:50 localhost enbd-server: enbd-server 4334: #1962
> > newproto: (ERROR) net errored (header 128 got 0) on req. Breaking off.
Well, the server sees a network error also, and it breaks off, which
might lead to what the client reports about not getting a reply back.
But this is 80s later! So the client saw a problem first. The server
sees the problem later.
> > Jun 13 04:01:51 localhost enbd-server: enbd-server 4335: #1962
> > newproto: (ERROR) net errored (header 128 got 0) on req. Breaking off.
> > Jun 13 04:01:51 localhost enbd-server: enbd-server 4334: #2236
> > mainloop: (ERROR) Server time out waiting 30s in mainloop. Breaking off
> > Jun 13 04:01:51 localhost enbd-server: enbd-server 4335: #2236
> > mainloop: (ERROR) Server time out waiting 30s in mainloop. Breaking off
> > Jun 13 04:01:51 localhost enbd-server: enbd-server 4333: #1962
> > newproto: (ERROR) net errored (header 128 got 0) on req. Breaking off.
> > Jun 13 04:01:51 localhost enbd-server: nbd/alarm 4336: # 150
> > unchain_current_ualarm: signalling ALRM for late (0.000000s) alarm
> > 0xbffff9e0 of 30.000000s
More interesting. The server definitely sees timouits for a service
cycle. It looks a bit as though the network has gone down between them.
> > Jun 13 04:01:51 localhost enbd-server: enbd-server 4336: #2236
> > mainloop: (ERROR) Server time out waiting 30s in mainloop. Breaking off
> > Jun 13 04:01:51 localhost enbd-server: enbd-server 4333: #2236
> > mainloop: (ERROR) Server time out waiting 30s in mainloop. Breaking off
> > Jun 13 04:01:59 localhost enbd-server: enbd-server 4334: #1190
> > slavesighandler: (WARNING) server (1) activates slave sighandler for
> > signal 15
> > Jun 13 04:02:00 localhost enbd-server: enbd-server: server (1)
> > sighandler terminates slave 4334 safely
Well, good. Now they'll try and restart later.
> > Jun 13 04:02:01 localhost enbd-server: enbd-server: server (-1)
> > session relaunches child after SIGCHLD
> > Jun 13 04:02:01 localhost enbd-server: enbd-server: server (-1) slave
> > pid 4336 is down, launching new
> > Jun 13 04:02:01 localhost enbd-server: enbd-server: server (3) set
> > default signal handlers
> > Jun 13 04:02:01 localhost enbd-server: enbd-server: server (-1)
> > launched slave pid 4364
> > Jun 13 04:03:59 localhost enbd-server: enbd-server 4361: #3131
> > negotiate: (ERROR) Server timed out in negotiate
>
> Time currently seems to differ about 1m10s.
Are you saying these are simultaneous, despite the apparent time
difference in the logs.
> Currently I'm running a new resync (hdc and ndb are in-sync, ndc
> out-of-sync), I'll report any problems back to the list. Does anyone
> have any idea what could be the cause of this? Currently, this setup is
> just a "backup of a backup", but it should be stable... By the way: I'm
> using the semi-latest 2.4.33pre, which is stable on another "cluster".
It looks as though one side or the other lost the network connection.
There's no actual "error" logged here, just reactions to network
outages. Without knowing the exact time difference (is it exactly 70s)
it's hard to say what happened first. I suppose the client saw the net
disappear first in the middle of some transaction and shut down. The
server also saw the same transaction fail and shut down on timeout.
If you were doing a rsync, I would suspect that it filled memory
somehow, leaving nothing for the tcp buffers, leading to timeouts (I
run rsync every night on many machines for many gigabytes of backups,
and it can get stuck on its own, so I do rsync on each subdir on its
own, and set a timeout - the "stuckness" is cured by a STOP CONT signal
pair, but I can't be bothered to do anything about programming it). If
you were doing a raid5 resync on its own, well, I dunno.
Peter
More information about the ENBD
mailing list