[Linux-HA] 2.0.1 says ping node dead, heartbeat exits

Dave Dykstra dwdha at drdykstra.us
Thu Sep 22 07:31:16 MDT 2005


On Wed, Sep 21, 2005 at 03:56:24PM -0600, Alan Robertson wrote:
> Subject: Re: [Linux-HA] ipfail says ping node is dead, but it isn't
> Juan Carlos Castro y Castro wrote:
> >Hello, I'm new to heartbeat and am trying to set up a simple two node V1
> >style cluster, as per http://www.linux-ha.org/GettingStarted -- I'm
> >running Fedora Core 3 but compiled everything from source (2.0.1 stable)
> >using --prefix=/usr/local/heartbeat --localstatedir=/var/heartbeat --
> >Below you'll see my config files and the logs. I apologize in advance if
> >I failed to see an obvious solution lying on a dimly-lit corner of some
> >faraway FAQ page. I searched the list archives too. Please don't hit me,
> >OK? ;)
> 
> No, this sounds like a legit problem.  I've just never seen it before...
> We made some changes in the ping code, maybe we broke something :-(.
> I'll go reread the code.
> 
> Sep 21 14:32:10 ws06 ipfail: [5476]: info: Ping node count is balanced.
> 
> This message indicates that the problem is the same viewed from both 
> machines.
> 
> >What happens is when I pull the eth0 cable from the master node (while
> >keeping the heartbeat interface, through crossover cable, in place), the
> >IP never gets relinquished, services are never stopped on one side, nor
> >started on the other. Can it have something to do with the "WARN: node
> >192.168.0.200: is dead" message?
> >
> >ws06.instant.intranet has IP 192.168.0.106 on eth0 and 10.10.10.1 on eth1.
> >t8b.instant.intranet has IP 192.168.0.198 on eth0 and 10.10.10.2 on eth1.
> >Cluster address is 192.168.0.171.
> >
> >I created the hacluster user with uid 171 and gid 100. I chowned the
> >/var/heartbeat/lib/heartbeat/cores/hacluster directory to hacluster to
> >get rid of a "permission denied" message in a previous attempt.
> >
> >Here's my config files (identical on the two sides):
> 
> 
> Those all looked good...
> 
> I suspected the ping code, but I think it looks right, I'm afraid... 
> The changes were simpler than I remembered them being...
> 
> Can you reproduce this again, but attach a tcpdump of some of the 
> returned ping packets along with the output of ps axww | grep heartbeat 
> ?  Please dump all the data in the ping packets.



I am having what may be a related problem.   My first try at 2.0.1,
recompiled from source from Horms' debian package, started on my standby
node in my test cluster (2.0.0 still on the active one), said the ping
node could not be found and soon after heartbeat just dies.  It works fine
with 2.0.0 on both nodes.  I enabled debug level of 1, below's the logs.

- Dave


Sep 21 17:09:13 swx2 heartbeat: [29985]: info: respawn directive: hacluster /usr/lib/heartbeat/ipfail 
Sep 21 17:09:13 swx2 heartbeat: [29985]: info: AUTH: i=1: key = 0x814a610, auth=0xb7dac94c, authname=sha1
Sep 21 17:09:13 swx2 heartbeat: [29985]: WARN: Core dumps could be lost if multiple dumps occur
Sep 21 17:09:13 swx2 heartbeat: [29985]: WARN: Consider setting /proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum supportability
Sep 21 17:09:13 swx2 heartbeat: [29985]: WARN: Logging daemon is disabled --enabling logging daemon is recommended
Sep 21 17:09:13 swx2 heartbeat: [29985]: info: **************************
Sep 21 17:09:13 swx2 heartbeat: [29985]: info: Configuration validated. Starting heartbeat 2.0.1
Sep 21 17:09:13 swx2 heartbeat: [29986]: info: heartbeat: version 2.0.1
Sep 21 17:09:13 swx2 heartbeat: [29986]: info: Heartbeat generation: 299
Sep 21 17:09:13 swx2 heartbeat: [29986]: info: Removing /var/run/heartbeat/rsctmp failed, recreating.
Sep 21 17:09:13 swx2 heartbeat: [29986]: info: glib: ucast: write socket priority set to IPTOS_LOWDELAY on eth0
Sep 21 17:09:13 swx2 heartbeat: [29986]: info: glib: ucast: bound send socket to device: eth0
Sep 21 17:09:13 swx2 heartbeat: [29986]: info: glib: ucast: bound receive socket to device: eth0
Sep 21 17:09:13 swx2 heartbeat: [29986]: info: glib: ucast: started on port 694 interface eth0 to 172.18.30.23
Sep 21 17:09:13 swx2 heartbeat: [29986]: info: glib: ucast: write socket priority set to IPTOS_LOWDELAY on eth0
Sep 21 17:09:13 swx2 heartbeat: [29986]: info: glib: ucast: bound send socket to device: eth0
Sep 21 17:09:13 swx2 heartbeat: [29986]: info: glib: ucast: bound receive socket to device: eth0
Sep 21 17:09:13 swx2 heartbeat: [29986]: info: glib: ucast: started on port 694 interface eth0 to 172.18.30.24
Sep 21 17:09:13 swx2 heartbeat: [29986]: info: glib: ping heartbeat started.
Sep 21 17:09:13 swx2 heartbeat: [29986]: info: G_main_add_SignalHandler: Added signal handler for signal 17
Sep 21 17:09:13 swx2 heartbeat: [29986]: info: pid 29986 locked in memory.
Sep 21 17:09:13 swx2 heartbeat: [29986]: info: Local status now set to: 'up'
Sep 21 17:09:14 swx2 heartbeat: [29993]: info: pid 29993 locked in memory.
Sep 21 17:09:14 swx2 heartbeat: [29999]: info: pid 29999 locked in memory.
Sep 21 17:09:14 swx2 heartbeat: [29994]: info: pid 29994 locked in memory.
Sep 21 17:09:14 swx2 heartbeat: [29998]: info: pid 29998 locked in memory.
Sep 21 17:09:14 swx2 heartbeat: [29995]: info: pid 29995 locked in memory.
Sep 21 17:09:14 swx2 heartbeat: [29996]: info: pid 29996 locked in memory.
Sep 21 17:09:14 swx2 heartbeat: [29997]: info: pid 29997 locked in memory.
Sep 21 17:09:14 swx2 heartbeat: [29986]: info: Link swx1:eth0 up.
Sep 21 17:09:14 swx2 heartbeat: [29986]: info: Status update for node swx1: status active
Sep 21 17:09:14 swx2 harc[30000]: info: Running /etc/ha.d/rc.d/status status
Sep 21 17:09:14 swx2 heartbeat: [29986]: info: Exiting status process 30000 returned rc 0.
Sep 21 17:09:33 swx2 heartbeat: [29986]: WARN: node 172.18.1.1: is dead
Sep 21 17:09:33 swx2 heartbeat: [29986]: info: Local status now set to: 'active'
Sep 21 17:09:33 swx2 harc[30012]: info: Running /etc/ha.d/rc.d/status status


heartbeat[29985]: 2005/09/21_17:09:13 debug: add_option(keepalive,1)
heartbeat[29985]: 2005/09/21_17:09:13 debug: add_option(warntime,2)
heartbeat[29985]: 2005/09/21_17:09:13 debug: add_option(deadtime,10)
heartbeat[29985]: 2005/09/21_17:09:13 debug: add_option(node,swx1)
heartbeat[29985]: 2005/09/21_17:09:13 debug: add_option(node,swx2)
heartbeat[29985]: 2005/09/21_17:09:13 debug: add_option(ping,172.18.1.1)
heartbeat[29985]: 2005/09/21_17:09:13 debug: uid=hacluster, gid=<null>
heartbeat[29985]: 2005/09/21_17:09:13 info: respawn directive: hacluster /usr/lib/heartbeat/ipfail 
heartbeat[29985]: 2005/09/21_17:09:13 debug: add_option(auto_failback,off)
heartbeat[29985]: 2005/09/21_17:09:13 debug: uid=hacluster, gid=<null>
heartbeat[29985]: 2005/09/21_17:09:13 debug: uid=<null>, gid=haclient
heartbeat[29985]: 2005/09/21_17:09:13 debug: uid=root, gid=<null>
heartbeat[29985]: 2005/09/21_17:09:13 debug: uid=<null>, gid=haclient
heartbeat[29985]: 2005/09/21_17:09:13 debug: Beginning authentication parsing
heartbeat[29985]: 2005/09/21_17:09:13 debug: 16 max authentication methods
heartbeat[29985]: 2005/09/21_17:09:13 debug: Keyfile opened
heartbeat[29985]: 2005/09/21_17:09:13 debug: Keyfile perms OK
heartbeat[29985]: 2005/09/21_17:09:13 debug: 16 max authentication methods
heartbeat[29985]: 2005/09/21_17:09:13 debug: Found authentication method [sha1]
heartbeat[29985]: 2005/09/21_17:09:13 info: AUTH: i=1: key = 0x814a610, auth=0xb7dac94c, authname=sha1
heartbeat[29985]: 2005/09/21_17:09:13 debug: Outbound signing method is 1
heartbeat[29985]: 2005/09/21_17:09:13 debug: Authentication parsing complete [1]
heartbeat[29985]: 2005/09/21_17:09:13 debug: add_option(hopfudge,1)
heartbeat[29985]: 2005/09/21_17:09:13 debug: add_option(baud,19200)
heartbeat[29985]: 2005/09/21_17:09:13 debug: add_option(hbgenmethod,file)
heartbeat[29985]: 2005/09/21_17:09:13 debug: add_option(realtime,true)
heartbeat[29985]: 2005/09/21_17:09:13 debug: add_option(normalpoll,true)
heartbeat[29985]: 2005/09/21_17:09:13 debug: add_option(msgfmt,classic)
heartbeat[29985]: 2005/09/21_17:09:13 debug: add_option(conn_logd_time,60)
heartbeat[29985]: 2005/09/21_17:09:13 debug: add_option(log_badpack,true)
heartbeat[29985]: 2005/09/21_17:09:13 debug: add_option(coredumps,true)
heartbeat[29985]: 2005/09/21_17:09:13 WARN: Core dumps could be lost if multiple dumps occur
heartbeat[29985]: 2005/09/21_17:09:13 WARN: Consider setting /proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum supportability
heartbeat[29985]: 2005/09/21_17:09:13 debug: add_option(crm,false)
heartbeat[29985]: 2005/09/21_17:09:13 WARN: Logging daemon is disabled --enabling logging daemon is recommended
heartbeat[29985]: 2005/09/21_17:09:13 info: **************************
heartbeat[29985]: 2005/09/21_17:09:13 info: Configuration validated. Starting heartbeat 2.0.1
heartbeat[29985]: 2005/09/21_17:09:13 debug: HA configuration OK.  Heartbeat starting.
heartbeat[29986]: 2005/09/21_17:09:13 info: heartbeat: version 2.0.1
heartbeat[29986]: 2005/09/21_17:09:13 info: Heartbeat generation: 299
heartbeat[29986]: 2005/09/21_17:09:13 info: Removing /var/run/heartbeat/rsctmp failed, recreating.
heartbeat[29986]: 2005/09/21_17:09:13 debug: opening ucast eth0 (UDP/IP unicast)
heartbeat[29986]: 2005/09/21_17:09:13 info: glib: ucast: write socket priority set to IPTOS_LOWDELAY on eth0
heartbeat[29986]: 2005/09/21_17:09:13 info: glib: ucast: bound send socket to device: eth0
heartbeat[29986]: 2005/09/21_17:09:13 info: glib: ucast: bound receive socket to device: eth0
heartbeat[29986]: 2005/09/21_17:09:13 info: glib: ucast: started on port 694 interface eth0 to 172.18.30.23
heartbeat[29986]: 2005/09/21_17:09:13 debug: ucast channel eth0 now open...
heartbeat[29986]: 2005/09/21_17:09:13 debug: opening ucast eth0 (UDP/IP unicast)
heartbeat[29986]: 2005/09/21_17:09:13 info: glib: ucast: write socket priority set to IPTOS_LOWDELAY on eth0
heartbeat[29986]: 2005/09/21_17:09:13 info: glib: ucast: bound send socket to device: eth0
heartbeat[29986]: 2005/09/21_17:09:13 info: glib: ucast: bound receive socket to device: eth0
heartbeat[29986]: 2005/09/21_17:09:13 info: glib: ucast: started on port 694 interface eth0 to 172.18.30.24
heartbeat[29986]: 2005/09/21_17:09:13 debug: ucast channel eth0 now open...
heartbeat[29986]: 2005/09/21_17:09:13 debug: opening ping 172.18.1.1 (ping membership)
heartbeat[29986]: 2005/09/21_17:09:13 info: glib: ping heartbeat started.
heartbeat[29986]: 2005/09/21_17:09:13 debug: ping channel 172.18.1.1 now open...
heartbeat[29986]: 2005/09/21_17:09:13 debug: FIFO process pid: 29993
heartbeat[29986]: 2005/09/21_17:09:13 debug: write process pid: 29994
heartbeat[29986]: 2005/09/21_17:09:13 debug: read child process pid: 29995
heartbeat[29986]: 2005/09/21_17:09:13 debug: write process pid: 29996
heartbeat[29986]: 2005/09/21_17:09:13 debug: read child process pid: 29997
heartbeat[29986]: 2005/09/21_17:09:13 debug: write process pid: 29998
heartbeat[29986]: 2005/09/21_17:09:13 debug: read child process pid: 29999
heartbeat[29986]: 2005/09/21_17:09:13 info: G_main_add_SignalHandler: Added signal handler for signal 17
heartbeat[29986]: 2005/09/21_17:09:13 debug: Limiting CPU: 30 CPU seconds every 60000 milliseconds
heartbeat[29986]: 2005/09/21_17:09:13 info: pid 29986 locked in memory.
heartbeat[29986]: 2005/09/21_17:09:13 debug: Waiting for child processes to start
heartbeat[29986]: 2005/09/21_17:09:13 info: Local status now set to: 'up'
heartbeat[29986]: 2005/09/21_17:09:13 debug: All your child process are belong to us
heartbeat[29986]: 2005/09/21_17:09:13 debug: Starting local status message @ 1000 ms intervals
heartbeat[29993]: 2005/09/21_17:09:14 info: pid 29993 locked in memory.
heartbeat[29993]: 2005/09/21_17:09:14 debug: Limiting CPU: 6 CPU seconds every 60000 milliseconds
heartbeat[29994]: 2005/09/21_17:09:14 info: pid 29994 locked in memory.
heartbeat[29999]: 2005/09/21_17:09:14 info: pid 29999 locked in memory.
heartbeat[29994]: 2005/09/21_17:09:14 debug: Limiting CPU: 24 CPU seconds every 60000 milliseconds
heartbeat[29999]: 2005/09/21_17:09:14 debug: Limiting CPU: 6 CPU seconds every 60000 milliseconds
heartbeat[29998]: 2005/09/21_17:09:14 info: pid 29998 locked in memory.
heartbeat[29998]: 2005/09/21_17:09:14 debug: Limiting CPU: 24 CPU seconds every 60000 milliseconds
heartbeat[29995]: 2005/09/21_17:09:14 info: pid 29995 locked in memory.
heartbeat[29995]: 2005/09/21_17:09:14 debug: Limiting CPU: 6 CPU seconds every 60000 milliseconds
heartbeat[29996]: 2005/09/21_17:09:14 info: pid 29996 locked in memory.
heartbeat[29996]: 2005/09/21_17:09:14 debug: Limiting CPU: 24 CPU seconds every 60000 milliseconds
heartbeat[29997]: 2005/09/21_17:09:14 info: pid 29997 locked in memory.
heartbeat[29997]: 2005/09/21_17:09:14 debug: Limiting CPU: 6 CPU seconds every 60000 milliseconds
heartbeat[29986]: 2005/09/21_17:09:14 info: Link swx1:eth0 up.
heartbeat[29986]: 2005/09/21_17:09:14 debug: CreateInitialFilter: ip-request-resp
heartbeat[29986]: 2005/09/21_17:09:14 debug: CreateInitialFilter: ask_resources
heartbeat[29986]: 2005/09/21_17:09:14 debug: CreateInitialFilter: status
heartbeat[29986]: 2005/09/21_17:09:14 debug: CreateInitialFilter: hb_takeover
heartbeat[29986]: 2005/09/21_17:09:14 debug: CreateInitialFilter: 1
heartbeat[29986]: 2005/09/21_17:09:14 debug: CreateInitialFilter: status.dpkg-old
heartbeat[29986]: 2005/09/21_17:09:14 debug: CreateInitialFilter: ip-request
heartbeat[29986]: 2005/09/21_17:09:14 info: Status update for node swx1: status active
heartbeat[29986]: 2005/09/21_17:09:14 debug: Status seqno: 4057293 msgtime: 1127340553
heartbeat[29986]: 2005/09/21_17:09:14 debug: StartNextRemoteRscReq() - calling hook
heartbeat[29986]: 2005/09/21_17:09:14 debug: notify_world: invoking harc: OLD status: up
heartbeat[29986]: 2005/09/21_17:09:14 debug: Process [status] started pid 30000
heartbeat[29986]: 2005/09/21_17:09:14 debug: Starting notify process [status]
heartbeat[30000]: 2005/09/21_17:09:14 debug: notify_world: setting SIGCHLD Handler to SIG_DFL
heartbeat[30000]: 2005/09/21_17:09:14 debug: notify_world: Running harc status
harc[30000]:	2005/09/21_17:09:14 info: Running /etc/ha.d/rc.d/status status
heartbeat[29986]: 2005/09/21_17:09:14 info: Exiting status process 30000 returned rc 0.
heartbeat[29986]: 2005/09/21_17:09:14 debug: RscMgmtProc 'status' exited code 0
heartbeat[29986]: 2005/09/21_17:09:33 WARN: node 172.18.1.1: is dead
heartbeat[29986]: 2005/09/21_17:09:33 debug: StartNextRemoteRscReq() - calling hook
heartbeat[29986]: 2005/09/21_17:09:33 debug: notify_world: invoking harc: OLD status: up
heartbeat[29986]: 2005/09/21_17:09:33 debug: Process [status] started pid 30012
heartbeat[29986]: 2005/09/21_17:09:33 debug: Starting notify process [status]
heartbeat[29986]: 2005/09/21_17:09:33 debug: Comm_now_up(): updating status to active
heartbeat[29986]: 2005/09/21_17:09:33 info: Local status now set to: 'active'
heartbeat[29986]: 2005/09/21_17:09:33 debug: Sending local starting msg: resourcestate = 0
heartbeat[29986]: 2005/09/21_17:09:33 debug: hb_rsc_isstable: ResourceMgmt_child_count: 1, other_is_stable: 0, takeover_in_progress: 0, going_standby: 0, standby running(ms): 0, resourcestate: 0
heartbeat[29986]: 2005/09/21_17:09:33 ERROR: socket_wait_conn_new: trying to create in /var/run/heartbeat/register bind:: No such file or directory
heartbeat[29986]: 2005/09/21_17:09:33 debug: Cannot open registration socket at /var/run/heartbeat/register
heartbeat[29986]: 2005/09/21_17:09:33 debug: Exiting from pid 29986 [rc=4]
heartbeat[30012]: 2005/09/21_17:09:33 debug: notify_world: setting SIGCHLD Handler to SIG_DFL
heartbeat[30012]: 2005/09/21_17:09:33 debug: notify_world: Running harc status
harc[30012]:	2005/09/21_17:09:33 info: Running /etc/ha.d/rc.d/status status
heartbeat[29993]: 2005/09/21_17:09:34 CRIT: Emergency Shutdown: Master Control process died.
heartbeat[29993]: 2005/09/21_17:09:34 CRIT: Killing pid 29986 with SIGTERM
heartbeat[29993]: 2005/09/21_17:09:34 CRIT: Killing pid 29994 with SIGTERM
heartbeat[29993]: 2005/09/21_17:09:34 CRIT: Killing pid 29995 with SIGTERM
heartbeat[29993]: 2005/09/21_17:09:34 CRIT: Killing pid 29996 with SIGTERM
heartbeat[29993]: 2005/09/21_17:09:34 CRIT: Killing pid 29997 with SIGTERM
heartbeat[29993]: 2005/09/21_17:09:34 CRIT: Killing pid 29998 with SIGTERM
heartbeat[29997]: 2005/09/21_17:09:34 debug: Process 29997 processing SIGTERM
heartbeat[29997]: 2005/09/21_17:09:34 debug: Exiting from pid 29997 [rc=15]
heartbeat[29993]: 2005/09/21_17:09:34 CRIT: Killing pid 29999 with SIGTERM
heartbeat[29993]: 2005/09/21_17:09:34 CRIT: Emergency Shutdown(MCP dead): Killing ourselves.
heartbeat[29994]: 2005/09/21_17:09:34 debug: Process 29994 processing SIGTERM
heartbeat[29994]: 2005/09/21_17:09:34 debug: Exiting from pid 29994 [rc=15]
heartbeat[29995]: 2005/09/21_17:09:34 debug: Process 29995 processing SIGTERM
heartbeat[29995]: 2005/09/21_17:09:34 debug: Exiting from pid 29995 [rc=15]
heartbeat[29999]: 2005/09/21_17:09:34 debug: Process 29999 processing SIGTERM
heartbeat[29999]: 2005/09/21_17:09:34 debug: Exiting from pid 29999 [rc=15]
heartbeat[29996]: 2005/09/21_17:09:34 debug: Process 29996 processing SIGTERM
heartbeat[29996]: 2005/09/21_17:09:34 debug: Exiting from pid 29996 [rc=15]
heartbeat[29993]: 2005/09/21_17:09:34 debug: Process 29993 processing SIGTERM
heartbeat[29993]: 2005/09/21_17:09:34 debug: Exiting from pid 29993 [rc=15]
heartbeat[29998]: 2005/09/21_17:09:34 debug: Process 29998 processing SIGTERM
heartbeat[29998]: 2005/09/21_17:09:34 debug: Exiting from pid 29998 [rc=15]


More information about the Linux-HA mailing list