[Linux-HA] Re: deadtime, warntime, and drbd
Jason Joines
support at bus.okstate.edu
Mon Mar 7 13:03:47 MST 2005
Lars Marowsky-Bree wrote:
>It literally rebooted itself? Are you using the watchdog timer?
>
>Please provide the log messages of the node directly prior to the
>reboot.
>
>
>Sincerely,
> Lars Marowsky-Brée <lmb at suse.de>
>
########### nodea log start ###########
Feb 25 12:30:51 nodea kernel: drbd0: PingAck did not arrive in time.
Feb 25 12:30:51 nodea kernel: drbd0: drbd0_asender [7323]: cstate
Connected --> NetworkFailure
Feb 25 12:30:51 nodea kernel: drbd0: asender terminated
Feb 25 12:30:51 nodea kernel: drbd0: drbd0_receiver [7322]: cstate
NetworkFailure --> BrokenPipe
Feb 25 12:30:51 nodea kernel: drbd0: short read expecting header on
sock: r=-512
Feb 25 12:30:51 nodea kernel: drbd0: worker terminated
Feb 25 12:30:51 nodea kernel: drbd0: drbd0_receiver [7322]: cstate
BrokenPipe --> Unconnected
Feb 25 12:30:51 nodea kernel: drbd0: Connection lost.
Feb 25 12:30:51 nodea kernel: drbd0: drbd0_receiver [7322]: cstate
Unconnected --> WFConnection
Feb 25 12:30:51 nodea kernel: drbd1: PingAck did not arrive in time.
Feb 25 12:30:51 nodea kernel: drbd1: drbd1_asender [7436]: cstate
Connected --> NetworkFailure
Feb 25 12:30:51 nodea kernel: drbd1: asender terminated
Feb 25 12:30:51 nodea kernel: drbd1: drbd1_receiver [7435]: cstate
NetworkFailure --> BrokenPipe
Feb 25 12:30:51 nodea kernel: drbd1: short read receiving data block:
read 3900 expected 4096
Feb 25 12:30:51 nodea kernel: drbd1: error receiving Data, l: 4112!
Feb 25 12:30:51 nodea kernel: drbd1: worker terminated
Feb 25 12:30:51 nodea kernel: drbd1: drbd1_receiver [7435]: cstate
BrokenPipe --> Unconnected
Feb 25 12:30:51 nodea kernel: drbd1: Connection lost.
Feb 25 12:30:51 nodea kernel: drbd1: drbd1_receiver [7435]: cstate
Unconnected --> WFConnection
Feb 25 12:31:03 nodea heartbeat[878]: WARN: node nodeb: is dead
Feb 25 12:31:03 nodea heartbeat[878]: info: Dead node nodeb gave up
resources.
Feb 25 12:31:03 nodea heartbeat[878]: info: Link nodeb:eth1 dead.
Feb 25 12:31:04 nodea ipfail[887]: info: Link Status update: Link
nodeb/eth1 now has status dead
Feb 25 12:31:04 nodea ipfail[887]: debug: Found ping node 172.18.91.254!
Feb 25 12:31:04 nodea ipfail[887]: info: Asking other side for ping node
count.
Feb 25 12:31:04 nodea ipfail[887]: debug: Message [num_ping] sent.
Feb 25 12:31:04 nodea ipfail[887]: info: Checking remote count of ping
nodes.
Feb 25 12:38:00 nodea heartbeat[878]: WARN: Cluster node nodeb returning
after partition.
Feb 25 12:38:00 nodea heartbeat[878]: WARN: Deadtime value may be too small.
Feb 25 12:38:00 nodea heartbeat[878]: info: See documentation for
information on tuning deadtime.
Feb 25 12:38:00 nodea heartbeat[878]: WARN: 436 lost packet(s) for
[nodeb] [358470:358907]
Feb 25 12:38:00 nodea heartbeat[878]: ERROR: lost a lot of packets!
Feb 25 12:38:00 nodea heartbeat[878]: info: Link nodeb:eth1 up.
Feb 25 12:38:00 nodea heartbeat[878]: WARN: Late heartbeat: Node nodeb:
interval 433430 ms
Feb 25 12:38:00 nodea heartbeat[878]: info: Status update for node
nodeb: status active
Feb 25 12:38:00 nodea ipfail[887]: info: Link Status update: Link
nodeb/eth1 now has status up
Feb 25 12:38:00 nodea ipfail[887]: info: Status update: Node nodeb now
has status active
Feb 25 12:38:00 nodea heartbeat[16016]: debug: notify_world: setting
SIGCHLD Handler to SIG_DFL
Feb 25 12:38:00 nodea heartbeat: info: Running /etc/ha.d/rc.d/status status
Feb 25 12:38:01 nodea kernel: drbd0: drbd0_receiver [7322]: cstate
WFConnection --> WFReportParams
Feb 25 12:38:01 nodea kernel: drbd0: Handshake successful: DRBD Network
Protocol version 74
Feb 25 12:38:01 nodea kernel: drbd0: incompatible states (both Primary!)
Feb 25 12:38:01 nodea kernel: drbd0: drbd0_receiver [7322]: cstate
WFReportParams --> StandAlone
Feb 25 12:38:01 nodea kernel: drbd0: error receiving ReportParams, l: 72!
Feb 25 12:38:01 nodea kernel: drbd0: worker terminated
Feb 25 12:38:01 nodea kernel: drbd0: asender terminated
Feb 25 12:38:01 nodea kernel: drbd0: drbd0_receiver [7322]: cstate
StandAlone --> StandAlone
Feb 25 12:38:01 nodea kernel: drbd0: Connection lost.
Feb 25 12:38:01 nodea kernel: drbd0: receiver terminated
Feb 25 12:38:02 nodea heartbeat[878]: WARN: Shutdown delayed until
current resource activity finishes.
Feb 25 12:38:02 nodea ipfail[887]: debug: Other side is unstable.
Feb 25 12:38:04 nodea kernel: drbd1: drbd1_receiver [7435]: cstate
WFConnection --> WFReportParams
Feb 25 12:38:04 nodea kernel: drbd1: Handshake successful: DRBD Network
Protocol version 74
Feb 25 12:38:04 nodea kernel: drbd1: Connection established.
Feb 25 12:38:04 nodea kernel: drbd1: I am(S):
1:00000002:00000001:0000002e:00000006:01
Feb 25 12:38:04 nodea kernel: drbd1: Peer(P):
1:00000002:00000001:0000002f:00000006:10
Feb 25 12:38:04 nodea kernel: drbd1: drbd1_receiver [7435]: cstate
WFReportParams --> WFBitMapT
Feb 25 12:38:04 nodea kernel: drbd1: Secondary/Unknown --> Secondary/Primary
Feb 25 12:38:05 nodea kernel: drbd1: drbd1_receiver [7435]: cstate
WFBitMapT --> SyncTarget
Feb 25 12:38:05 nodea kernel: drbd1: Resync started as SyncTarget (need
to sync 60488 KB [15122 bits set]).
Feb 25 12:38:07 nodea kernel: drbd1: Resync done (total 2 sec; paused 0
sec; 30244 K/sec)
Feb 25 12:38:07 nodea kernel: drbd1: drbd1_worker [12364]: cstate
SyncTarget --> Connected
Feb 25 12:38:32 nodea heartbeat[878]: WARN: node nodeb: is dead
Feb 25 12:38:32 nodea heartbeat[878]: info: Dead node nodeb gave up
resources.
Feb 25 12:38:32 nodea heartbeat[878]: info: Link nodeb:eth1 dead.
Feb 25 12:38:33 nodea kernel: drbd1: PingAck did not arrive in time.
Feb 25 12:38:33 nodea kernel: drbd1: drbd1_asender [16022]: cstate
Connected --> NetworkFailure
Feb 25 12:38:33 nodea kernel: drbd1: asender terminated
Feb 25 12:38:33 nodea kernel: drbd1: drbd1_receiver [7435]: cstate
NetworkFailure --> BrokenPipe
Feb 25 12:38:33 nodea kernel: drbd1: short read expecting header on
sock: r=-512
Feb 25 12:38:33 nodea kernel: drbd1: worker terminated
Feb 25 12:38:33 nodea kernel: drbd1: drbd1_receiver [7435]: cstate
BrokenPipe --> Unconnected
Feb 25 12:38:33 nodea kernel: drbd1: Connection lost.
Feb 25 12:38:33 nodea kernel: drbd1: drbd1_receiver [7435]: cstate
Unconnected --> WFConnection
Feb 25 12:38:33 nodea ipfail[887]: info: Link Status update: Link
nodeb/eth1 now has status dead
Feb 25 12:38:33 nodea ipfail[887]: debug: Found ping node 172.18.91.254!
Feb 25 12:38:33 nodea ipfail[887]: info: Asking other side for ping node
count.
Feb 25 12:38:33 nodea ipfail[887]: debug: Message [num_ping] sent.
Feb 25 12:38:33 nodea ipfail[887]: info: Checking remote count of ping
nodes.
Feb 25 12:41:09 nodea kernel: drbd1: drbd1_receiver [7435]: cstate
WFConnection --> WFReportParams
Feb 25 12:41:09 nodea kernel: drbd1: Handshake successful: DRBD Network
Protocol version 74
Feb 25 12:41:09 nodea kernel: drbd1: Connection established.
Feb 25 12:41:09 nodea kernel: drbd1: I am(S):
1:00000002:00000001:0000002f:00000006:01
Feb 25 12:41:09 nodea kernel: drbd1: Peer(S):
1:00000002:00000001:0000002f:00000006:11
Feb 25 12:41:09 nodea kernel: drbd1: drbd1_receiver [7435]: cstate
WFReportParams --> WFBitMapT
Feb 25 12:41:09 nodea kernel: drbd1: Secondary/Unknown -->
Secondary/Secondary
Feb 25 12:41:09 nodea kernel: drbd1: drbd1_receiver [7435]: cstate
WFBitMapT --> SyncTarget
Feb 25 12:41:09 nodea kernel: drbd1: Resync started as SyncTarget (need
to sync 520192 KB [130048 bits set]).
Feb 25 12:41:23 nodea kernel: drbd1: Resync done (total 14 sec; paused 0
sec; 37156 K/sec)
Feb 25 12:41:23 nodea kernel: drbd1: drbd1_worker [16024]: cstate
SyncTarget --> Connected
Feb 25 12:42:15 nodea heartbeat[878]: info: Heartbeat restart on node nodeb
Feb 25 12:42:15 nodea heartbeat[878]: info: Link nodeb:eth1 up.
Feb 25 12:42:15 nodea heartbeat[878]: info: Status update for node
nodeb: status up
Feb 25 12:42:15 nodea ipfail[887]: info: Link Status update: Link
nodeb/eth1 now has status up
Feb 25 12:42:15 nodea ipfail[887]: info: Status update: Node nodeb now
has status up
Feb 25 12:42:15 nodea heartbeat[16027]: debug: notify_world: setting
SIGCHLD Handler to SIG_DFL
Feb 25 12:42:15 nodea heartbeat[878]: info: Status update for node
nodeb: status active
Feb 25 12:42:15 nodea heartbeat[878]: debug: StartNextRemoteRscReq():
child count 1
Feb 25 12:42:15 nodea ipfail[887]: info: Status update: Node nodeb now
has status active
Feb 25 12:42:15 nodea heartbeat: info: Running /etc/ha.d/rc.d/status status
Feb 25 12:42:15 nodea ipfail[887]: debug: Other side is unstable.
Feb 25 12:42:15 nodea heartbeat[878]: info: remote resource transition
completed.
Feb 25 12:42:15 nodea ipfail[887]: debug: Other side is now stable.
Feb 25 12:42:15 nodea heartbeat[16031]: debug: notify_world: setting
SIGCHLD Handler to SIG_DFL
Feb 25 12:42:15 nodea heartbeat: info: Running /etc/ha.d/rc.d/status status
Feb 25 12:42:15 nodea heartbeat[878]: info: Heartbeat shutdown in
progress. (878)
Feb 25 12:42:15 nodea heartbeat[16035]: info: Giving up all HA resources.
Feb 25 12:42:15 nodea ipfail[887]: debug: Got join message from another
ipfail client. (nodeb)
Feb 25 12:42:15 nodea ipfail[887]: debug: Found ping node 172.18.91.254!
Feb 25 12:42:15 nodea ipfail[887]: info: Asking other side for ping node
count.
Feb 25 12:42:15 nodea ipfail[887]: debug: Message [num_ping] sent.
Feb 25 12:42:15 nodea heartbeat: info: Releasing resource group: nodea
drbddisk::drbd0
Filesystem::/dev/drbd0::/home::xfs::rw,suid,dev,exec,noauto,nouser,async,noatime,quota,grpquota
172.18.88.102 smb
Feb 25 12:42:15 nodea heartbeat: info: Running /etc/init.d/smb stop
Feb 25 12:42:15 nodea heartbeat: debug: Starting /etc/init.d/smb stop
Feb 25 12:42:15 nodea heartbeat: debug: /etc/init.d/smb stop done. RC=0
Feb 25 12:42:15 nodea heartbeat: info: Running
/etc/ha.d/resource.d/IPaddr 172.18.88.102 stop
Feb 25 12:42:15 nodea heartbeat: debug: Starting
/etc/ha.d/resource.d/IPaddr 172.18.88.102 stop
Feb 25 12:42:16 nodea heartbeat: info: /sbin/route -n del -host
172.18.88.102
Feb 25 12:42:16 nodea heartbeat: info: /sbin/ifconfig eth0:0 down
Feb 25 12:42:16 nodea heartbeat: info: IP Address 172.18.88.102 released
Feb 25 12:42:16 nodea heartbeat: debug: /etc/ha.d/resource.d/IPaddr
172.18.88.102 stop done. RC=0
Feb 25 12:42:16 nodea heartbeat: info: Running
/etc/ha.d/resource.d/Filesystem /dev/drbd0 /home xfs
rw,suid,dev,exec,noauto,nouser,async,noatime,quota,grpquota stop
Feb 25 12:42:16 nodea heartbeat: debug: Starting
/etc/ha.d/resource.d/Filesystem /dev/drbd0 /home xfs
rw,suid,dev,exec,noauto,nouser,async,noatime,quota,grpquota stop
Feb 25 12:42:16 nodea heartbeat: ERROR: Couldn't unmount /home
Feb 25 12:42:16 nodea heartbeat: debug: /etc/ha.d/resource.d/Filesystem
/dev/drbd0 /home xfs
rw,suid,dev,exec,noauto,nouser,async,noatime,quota,grpquota stop done. RC=1
Feb 25 12:42:16 nodea heartbeat: ERROR: Return code 1 from
/etc/ha.d/resource.d/Filesystem
Feb 25 12:42:16 nodea ipfail[887]: info: No giveup timer to abort.
Feb 25 12:42:17 nodea heartbeat: info: Retrying failed stop operation
[Filesystem::/dev/drbd0::/home::xfs::rw,suid,dev,exec,noauto,nouser,async,noatime,quota,grpquota]
Feb 25 12:42:17 nodea heartbeat: info: Running
/etc/ha.d/resource.d/Filesystem /dev/drbd0 /home xfs
rw,suid,dev,exec,noauto,nouser,async,noatime,quota,grpquota stop
Feb 25 12:42:17 nodea heartbeat: debug: Starting
/etc/ha.d/resource.d/Filesystem /dev/drbd0 /home xfs
rw,suid,dev,exec,noauto,nouser,async,noatime,quota,grpquota stop
Feb 25 12:42:17 nodea heartbeat: debug: /etc/ha.d/resource.d/Filesystem
/dev/drbd0 /home xfs
rw,suid,dev,exec,noauto,nouser,async,noatime,quota,grpquota stop done. RC=0
Feb 25 12:42:17 nodea heartbeat: info: Running
/etc/ha.d/resource.d/drbddisk drbd0 stop
Feb 25 12:42:17 nodea heartbeat: debug: Starting
/etc/ha.d/resource.d/drbddisk drbd0 stop
Feb 25 12:42:17 nodea kernel: drbd0: Primary/Unknown --> Secondary/Unknown
Feb 25 12:42:18 nodea heartbeat: debug: /etc/ha.d/resource.d/drbddisk
drbd0 stop done. RC=0
Feb 25 12:42:18 nodea heartbeat: info: Releasing resource group: nodeb
drbddisk::drbd1
Feb 25 12:42:18 nodea heartbeat: info: Running
/etc/ha.d/resource.d/drbddisk drbd1 stop
Feb 25 12:42:18 nodea heartbeat: debug: Starting
/etc/ha.d/resource.d/drbddisk drbd1 stop
Feb 25 12:42:18 nodea heartbeat: debug: /etc/ha.d/resource.d/drbddisk
drbd1 stop done. RC=0
Feb 25 12:42:18 nodea heartbeat[16035]: info: killing
/usr/lib/heartbeat/ipfail process group 887 with signal 15
Feb 25 12:42:18 nodea heartbeat[16035]: info: killing heartbeat resource
child process group 16031 with signal 9
Feb 25 12:42:18 nodea heartbeat[16035]: info: All HA resources relinquished.
Feb 25 12:42:18 nodea heartbeat[878]: info: killing
/usr/lib/heartbeat/ipfail process group 887 with signal 15
Feb 25 12:42:18 nodea heartbeat[878]: info: EOF from client pid 887
Feb 25 12:42:19 nodea heartbeat[878]: info: killing HBFIFO process 880
with signal 15
Feb 25 12:42:19 nodea heartbeat[878]: info: killing HBWRITE process 881
with signal 15
Feb 25 12:42:19 nodea heartbeat[878]: info: killing HBREAD process 882
with signal 15
Feb 25 12:42:19 nodea heartbeat[878]: info: killing HBWRITE process 883
with signal 15
Feb 25 12:42:19 nodea heartbeat[878]: info: killing HBREAD process 884
with signal 15
Feb 25 12:42:19 nodea heartbeat[878]: info: killing HBWRITE process 885
with signal 15
Feb 25 12:42:19 nodea heartbeat[878]: info: killing HBREAD process 886
with signal 15
Feb 25 12:42:19 nodea heartbeat[878]: info: Core process 880 exited. 7
remaining
Feb 25 12:42:19 nodea heartbeat[878]: info: Core process 881 exited. 6
remaining
Feb 25 12:42:19 nodea heartbeat[878]: info: Core process 882 exited. 5
remaining
Feb 25 12:42:19 nodea heartbeat[878]: info: Core process 883 exited. 4
remaining
Feb 25 12:42:19 nodea heartbeat[878]: info: Core process 884 exited. 3
remaining
Feb 25 12:42:19 nodea heartbeat[878]: info: Core process 885 exited. 2
remaining
Feb 25 12:42:19 nodea heartbeat[878]: info: Core process 886 exited. 1
remaining
Feb 25 12:42:19 nodea heartbeat[878]: info: Heartbeat shutdown complete.
Feb 25 12:42:19 nodea heartbeat[878]: info: Heartbeat restart triggered.
Feb 25 12:42:19 nodea heartbeat[878]: info: Restarting heartbeat.
Feb 25 12:42:19 nodea heartbeat[878]: info: Performing heartbeat restart
exec.
Feb 25 12:42:20 nodea kernel: drbd1: Secondary/Secondary -->
Secondary/Primary
Feb 25 12:42:36 nodea heartbeat[878]: info: **************************
Feb 25 12:42:36 nodea heartbeat[878]: info: Configuration validated.
Starting heartbeat 1.2.3
Feb 25 12:42:36 nodea heartbeat[16280]: info: heartbeat: version 1.2.3
Feb 25 12:42:36 nodea heartbeat[16280]: info: Heartbeat generation: 22
Feb 25 12:42:36 nodea heartbeat[16280]: info: ucast: write socket
priority set to IPTOS_LOWDELAY on eth0
Feb 25 12:42:36 nodea heartbeat[16280]: info: ucast: bound send socket
to device: eth0
Feb 25 12:42:36 nodea heartbeat[16280]: info: ucast: bound receive
socket to device: eth0
Feb 25 12:42:36 nodea heartbeat[16280]: info: ucast: started on port 694
interface eth0 to 172.18.88.93
Feb 25 12:42:36 nodea heartbeat[16280]: info: ucast: write socket
priority set to IPTOS_LOWDELAY on eth1
Feb 25 12:42:36 nodea heartbeat[16280]: info: ucast: bound send socket
to device: eth1
Feb 25 12:42:36 nodea heartbeat[16280]: info: ucast: bound receive
socket to device: eth1
Feb 25 12:42:36 nodea heartbeat[16280]: info: ucast: started on port 694
interface eth1 to 172.17.1.1
Feb 25 12:42:36 nodea heartbeat[16280]: info: ping heartbeat started.
Feb 25 12:42:36 nodea heartbeat[16280]: info: pid 16280 locked in memory.
Feb 25 12:42:36 nodea heartbeat[16280]: info: Local status now set to: 'up'
Feb 25 12:42:37 nodea heartbeat[16282]: info: pid 16282 locked in memory.
Feb 25 12:42:37 nodea heartbeat[16286]: info: pid 16286 locked in memory.
Feb 25 12:42:37 nodea heartbeat[16283]: info: pid 16283 locked in memory.
Feb 25 12:42:37 nodea heartbeat[16280]: info: Link nodeb:eth1 up.
Feb 25 12:42:37 nodea heartbeat[16280]: info: Status update for node
nodeb: status active
Feb 25 12:42:37 nodea heartbeat[16289]: debug: notify_world: setting
SIGCHLD Handler to SIG_DFL
Feb 25 12:42:37 nodea heartbeat[16284]: info: pid 16284 locked in memory.
Feb 25 12:42:37 nodea heartbeat[16287]: info: pid 16287 locked in memory.
Feb 25 12:42:37 nodea heartbeat[16288]: info: pid 16288 locked in memory.
Feb 25 12:42:37 nodea heartbeat[16280]: info: Link
172.18.91.254:172.18.91.254 up.
Feb 25 12:42:37 nodea heartbeat[16280]: info: Status update for node
172.18.91.254: status ping
Feb 25 12:42:37 nodea heartbeat[16280]: info: Local status now set to:
'active'
Feb 25 12:42:37 nodea heartbeat[16280]: info: Starting child client
"/usr/lib/heartbeat/ipfail" (90,90)
Feb 25 12:42:37 nodea heartbeat: info: Running /etc/ha.d/rc.d/status status
Feb 25 12:42:37 nodea heartbeat[16285]: info: pid 16285 locked in memory.
Feb 25 12:42:37 nodea heartbeat[16290]: info: Starting
"/usr/lib/heartbeat/ipfail" as uid 90 gid 90 (pid 16290)
Feb 25 12:42:37 nodea heartbeat[16280]: info: remote resource transition
completed.
Feb 25 12:42:37 nodea heartbeat[16280]: info: remote resource transition
completed.
Feb 25 12:42:37 nodea heartbeat[16280]: info: Local Resource acquisition
completed. (none)
Feb 25 12:42:37 nodea heartbeat[16280]: info: Initial resource
acquisition complete (T_RESOURCES(them))
Feb 25 12:42:37 nodea ipfail[16290]: debug: PID=16290
Feb 25 12:42:37 nodea ipfail[16290]: debug: Signing in with heartbeat
Feb 25 12:42:37 nodea ipfail[16290]: debug: [We are nodea]
Feb 25 12:42:37 nodea ipfail[16290]: debug: auto_failback -> 0 (off)
Feb 25 12:42:37 nodea ipfail[16290]: debug: Setting message filter mode
Feb 25 12:42:37 nodea ipfail[16290]: debug: Starting node walk
Feb 25 12:42:37 nodea ipfail[16290]: debug: Cluster node: 172.18.91.254:
status: ping
Feb 25 12:42:37 nodea ipfail[16290]: debug: Cluster node: nodeb: status:
active
Feb 25 12:42:37 nodea ipfail[16290]: debug: [They are nodeb]
Feb 25 12:42:37 nodea ipfail[16290]: debug: Cluster node: nodea: status:
active
Feb 25 12:42:37 nodea ipfail[16290]: debug: Setting message signal
Feb 25 12:42:37 nodea ipfail[16290]: debug: Waiting for messages...
Feb 25 12:42:38 nodea ipfail[16290]: debug: Got asked for num_ping.
Feb 25 12:42:38 nodea ipfail[16290]: debug: Found ping node 172.18.91.254!
Feb 25 12:42:38 nodea ipfail[16290]: info: Ping node count is balanced.
Feb 25 12:42:38 nodea ipfail[16290]: debug: Abort message sent.
########### nodea log stop ###########
More information about the Linux-HA
mailing list