[Linux-HA] Heartbeat 1.2.0 random segfaults on node restart

Umberto Nicoletti unicoletti at prometeo.it
Thu Mar 25 01:23:55 MST 2004


Hi Alan,
thanks for the prompt reply and sorry for cross-posting with Mike, but I
thought that maybe the two issues were related, since they both dealt
with tty errors and I hoped that it was only a problem with version
1.2.0, something that can be quickly fixed with a downgrade.

comments inline.

On Wed, 2004-03-24 at 17:36, Alan Robertson wrote:
> Umberto Nicoletti wrote:
> > Hi all,
> > thanks for the fine piece software.
> > 
> > I have setup two servers (fujitsu-siemens PRIMERGY, double XEON 1GB RAM,
> > two SCSI RAID 1 disk drives) with heartbeat to provide available file
> > serving and authentication via samba to a windows network.
> > They run SuSE 8.0 with the latest online update. I have upgraded
> > heartbeat shipped with suse (ancient 0.4.9) with 1.2.0 from source rpm
> > found on linux-ha site.
> > The servers have been running heartbeat 1.2.0 for only a week and now I
> > was going to put them in production, but this morning when I logged in I
> > found that a failover occurred overnight.
> > 
> > Yesterday evening I started rsync to mirror 50GB of data from a third
> > server to the master . The rsync ended at about 00.20 (4 hours on 100MB
> > network for those interested) and at about 4 am it happened that one of
> > the ping_group failed, so the master failed and the slave became the new
> > active node. This is fine, but when the master restarted the slave
> > crashed, giving the message reported below. Please note that this
> > happened also the day before (in fact the log refers to that occasion)
> > but I did not notice.
> > 
> > This morning I tried to restart heartbeat on the slave and it happened
> > again. So I STFW and set ulimit -c unlimited on heartbeat rc.d scripts.
> > 
> > In accord with Murphy's law the error did not show up again.
> > So I commented the ulimit -c unlimited line, rebooted and everything now
> > runs smooth, no more segfaults!
> > 
> > Now I am running a big/heavy rsync against the same server as this
> > night, trying to reproduce a high load situation (I have already
> > transferred 15GB and everythin is ok).
> > 
> > I attach my ha.cf file.
> > 
> > Any help would be very much appreciated. I'll post my findings if I have
> > any to this list.
> > Thanks for your attention,
> > Umberto
> > 
> > /log snippet
> > 
> > heartbeat: 2004/03/23_16:39:18 info: remote resource transition
> > completed.
> > heartbeat: 2004/03/23_16:39:18 info: remote resource transition
> > completed.
> > heartbeat: 2004/03/23_16:39:18 info: Local Resource acquisition
> > completed. (none)
> > heartbeat: 2004/03/23_16:39:18 info: Initial resource acquisition
> > complete (T_RESOURCES(them))
> > heartbeat: 2004/03/24_02:46:47 ERROR: EOF in ttygets [/dev/ttyS0]:
> > Success [0]
> > heartbeat: 2004/03/24_02:47:05 WARN: ha_msg_add_nv: line doesn't contain
> > '='
> > heartbeat: 2004/03/24_02:47:05 info:
> > heartbeat: 2004/03/24_02:47:05 ERROR: Exiting HBREAD process 3425 killed
> > by signal 11.
> > heartbeat: 2004/03/24_02:47:05 ERROR: Core heartbeat process died!
> > Restarting.
> > heartbeat: 2004/03/24_02:47:05 info: Heartbeat shutdown in progress.
> > (3421)
> > heartbeat: 2004/03/24_02:47:05 info: Giving up all HA resources.
> 
> This is certainly nasty...
> 
> But, it is quite different from Mike Dahmus' problem.  You are having 
> troubles with read processes, and he's having troubles with write processes...
> 

My apologies again....

> Let me make sure I understand...
> 
>  > The rsync ended at about 00.20 (4 hours on 100MB
>  > network for those interested) and at about 4 am it happened that one of
>  > the ping_group failed, so the master failed and the slave became the new
>  > active node. This is fine, but when the master restarted the slave
>  > crashed, giving the message reported below.
> 
> About 4 AM, the master lost its connectivity to the network, but the slave 
> retained it, so ipfail decided to move the data from one side to the other. 
>   Then you say something about "the master restarted".  What do you mean 
> the master restarted?
> 

Here's the full recap with logs...

At 2:47am robin (currently slave) restarted because of segfault as
showed in this log:

heartbeat: 2004/03/24_02:46:47 ERROR: EOF in ttygets [/dev/ttyS0]:
Success [0]
heartbeat: 2004/03/24_02:47:05 WARN: ha_msg_add_nv: line doesn't contain
'='
heartbeat: 2004/03/24_02:47:05 info:
heartbeat: 2004/03/24_02:47:05 ERROR: Exi
ting HBREAD process 3425 killed by signal 11.
heartbeat: 2004/03/24_02:47:05 ERROR: Core heartbeat process died!
Restarting.
heartbeat: 2004/03/24_02:47:05 info: Heartbeat shutdown in progress.
(3421)
heartbeat: 2004/03/24_02:47:05 info: Giving up all HA resources

The log on batman shows this:

heartbeat: 2004/03/24_02:47:00 info: Received shutdown notice from
'robin'.
heartbeat: 2004/03/24_02:47:00 info: Resources being acquired from
robin.
heartbeat: 2004/03/24_02:47:00 info: Running /etc/ha.d/rc.d/status
status
heartbeat: 2004/03/24_02:47:00 info: /usr/lib/heartbeat/mach_down:
nice_failback: foreign resources acquired
heartbeat: 2004/03/24_02:47:00 info: mach_down takeover complete.
heartbeat: 2004/03/24_02:47:00 info: mach_down takeover complete for
node robin.
heartbeat: 2004/03/24_02:47:00 info: Local Resource acquisition
completed.
heartbeat: 2004/03/24_02:47:11 WARN: node robin: is dead
heartbeat: 2004/03/24_02:47:11 info: Dead node robin held no resources.
heartbeat: 2004/03/24_02:47:11 info: Resources being acquired from
robin.
heartbeat: 2004/03/24_02:47:11 info: Running /etc/ha.d/rc.d/status
status
heartbeat: 2004/03/24_02:47:11 info: /usr/lib/heartbeat/mach_down:
nice_failback: foreign resources acquired
heartbeat: 2004/03/24_02:47:11 info: mach_down takeover complete.
heartbeat: 2004/03/24_02:47:11 info: mach_down takeover complete for
node robin.
heartbeat: 2004/03/24_02:47:11 info: Local Resource acquisition
completed.
heartbeat: 2004/03/24_02:47:14 info: Heartbeat restart on node robin
heartbeat: 2004/03/24_02:47:14 info: Status update for node robin:
status up
heartbeat: 2004/03/24_02:47:14 info: Running /etc/ha.d/rc.d/status
status
heartbeat: 2004/03/24_02:47:14 info: Status update for node robin:
status active
heartbeat: 2004/03/24_02:47:14 info: remote resource transition
completed.
heartbeat: 2004/03/24_02:47:14 info: Running /etc/ha.d/rc.d/status
status

At about 4 the batman loses connectivity with switches (ping group) and
fails over to robin:

heartbeat: 2004/03/24_04:17:16 WARN: node switches: is dead
heartbeat: 2004/03/24_04:17:16 info: Link switches:switches dead.
heartbeat: 2004/03/24_04:17:16 info: Running /etc/ha.d/rc.d/status
status
heartbeat: 2004/03/24_04:17:20 info: batman wants to go standby [all]
heartbeat: 2004/03/24_04:17:20 info: standby: robin can take our all
resources
heartbeat: 2004/03/24_04:17:20 info: give up all HA resources (standby).
heartbeat: 2004/03/24_04:17:20 info: Releasing resource group: batman
192.168.10.20 smb ldap::restart rsyncd
heartbeat: 2004/03/24_04:17:20 info: Running
/etc/ha.d/resource.d/rsyncd  stop
heartbeat: 2004/03/24_04:17:20 info: Running /etc/ha.d/resource.d/ldap
restart stop
heartbeat: 2004/03/24_04:17:23 info: Running /etc/ha.d/resource.d/smb 
stop
heartbeat: 2004/03/24_04:17:24 info: Running /etc/ha.d/resource.d/IPaddr
192.168.10.20 stop
heartbeat: 2004/03/24_04:17:24 info: /sbin/route -n del -host
192.168.10.20
heartbeat: 2004/03/24_04:17:24 info: /sbin/ifconfig eth0:0 down
heartbeat: 2004/03/24_04:17:24 info: IP Address 192.168.10.20 released
heartbeat: 2004/03/24_04:17:24 info: all HA resource release completed
(standby).
heartbeat: 2004/03/24_04:17:24 info: Local standby process completed
[all].
heartbeat: 2004/03/24_04:17:31 WARN: 1 lost packet(s) for [robin]
[6285:6287]
heartbeat: 2004/03/24_04:17:31 info: remote resource transition
completed.
heartbeat: 2004/03/24_04:17:31 info: No pkts missing from robin!
heartbeat: 2004/03/24_04:17:31 info: Other node completed standby
takeover of all resources.

At 10:00 in the morning I try to restart batman 
by issuing
/etc/init.d/heartbeat restart

When I restart batman, robin is holding all resources and batman is
slave.
I get the following error into the logs of robin:

heartbeat: 2004/03/24_10:07:34 info: Received shutdown notice from
'batman'.
heartbeat: 2004/03/24_10:07:34 info: Resources being acquired from
batman.
heartbeat: 2004/03/24_10:07:34 info: Running /etc/ha.d/rc.d/status
status
heartbeat: 2004/03/24_10:07:34 info: No local resources
[/usr/lib/heartbeat/ResourceManager listkeys robin] to acquire.
heartbeat: 2004/03/24_10:07:34 info: Taking over resource group
192.168.10.20
heartbeat: 2004/03/24_10:07:34 info: Acquiring resource group: batman
192.168.10.20 smb ldap::restart rsyncd
heartbeat: 2004/03/24_10:07:37 info: Running /etc/ha.d/resource.d/ldap
restart start
heartbeat: 2004/03/24_10:07:40 info: /usr/lib/heartbeat/mach_down:
nice_failback: foreign resources acquired
heartbeat: 2004/03/24_10:07:40 info: mach_down takeover complete.
heartbeat: 2004/03/24_10:07:40 info: mach_down takeover complete for
node batman.
heartbeat: 2004/03/24_10:07:47 WARN: node batman: is dead
heartbeat: 2004/03/24_10:07:47 info: Dead node batman held no resources.
heartbeat: 2004/03/24_10:07:47 info: Resources being acquired from
batman.
heartbeat: 2004/03/24_10:07:47 info: Running /etc/ha.d/rc.d/status
status
heartbeat: 2004/03/24_10:07:47 info: No local resources
[/usr/lib/heartbeat/ResourceManager listkeys robin] to acquire.
heartbeat: 2004/03/24_10:07:47 info: Taking over resource group
192.168.10.20
heartbeat: 2004/03/24_10:07:47 info: Acquiring resource group: batman
192.168.10.20 smb ldap::restart rsyncd
heartbeat: 2004/03/24_10:07:50 info: Running /etc/ha.d/resource.d/ldap
restart start
heartbeat: 2004/03/24_10:07:53 info: /usr/lib/heartbeat/mach_down:
nice_failback: foreign resources acquired
heartbeat: 2004/03/24_10:07:53 info: mach_down takeover complete.
heartbeat: 2004/03/24_10:07:53 info: mach_down takeover complete for
node batman.
heartbeat: 2004/03/24_10:07:57 info: Heartbeat restart on node batman
heartbeat: 2004/03/24_10:07:57 info: Status update for node batman:
status up
heartbeat: 2004/03/24_10:07:57 info: Running /etc/ha.d/rc.d/status
status
heartbeat: 2004/03/24_10:07:57 info: Status update for node batman:
status active
heartbeat: 2004/03/24_10:07:57 info: remote resource transition
completed.
heartbeat: 2004/03/24_10:07:57 info: Running /etc/ha.d/rc.d/status
status
heartbeat: 2004/03/24_10:07:57 ERROR: Exiting HBREAD process 4469 killed
by signal 11.
heartbeat: 2004/03/24_10:07:57 ERROR: Core heartbeat process died!
Restarting.
heartbeat: 2004/03/24_10:07:57 info: Heartbeat shutdown in progress.
(4465)
heartbeat: 2004/03/24_10:07:57 info: Giving up all HA resources.
heartbeat: 2004/03/24_10:07:57 info: Releasing resource group: batman
192.168.10.20 smb ldap::restart rsyncd
heartbeat: 2004/03/24_10:07:57 info: Running
/etc/ha.d/resource.d/rsyncd  stop
heartbeat: 2004/03/24_10:07:57 info: Running /etc/ha.d/resource.d/ldap
restart stop
heartbeat: 2004/03/24_10:08:01 info: Running /etc/ha.d/resource.d/smb 
stop
heartbeat: 2004/03/24_10:08:01 info: Running /etc/ha.d/resource.d/IPaddr
192.168.10.20 stop
heartbeat: 2004/03/24_10:08:01 info: /sbin/route -n del -host
192.168.10.20
heartbeat: 2004/03/24_10:08:01 info: /sbin/ifconfig eth0:0 down
heartbeat: 2004/03/24_10:08:01 info: IP Address 192.168.10.20 released
heartbeat: 2004/03/24_10:08:01 info: killing /usr/lib/heartbeat/ipfail
process group 4476 with signal 15
heartbeat: 2004/03/24_10:08:01 info: All HA resources relinquished.

At this point I set ulimit -c unlimited in the startup scripts, reboot
the cluster, but I cannot reproduce the error anymore.

> And, why aren't any of these events in the log you attached?  And why don't 
> the timestamps match your times?

I probably made a mistake while pasting, but the relevant line is there
and it's the one you noticed already.
I can send the full log from both nodes if you want.

> All I see in the log anwhere near this time is:
>  > heartbeat: 2004/03/24_02:46:47 ERROR: EOF in ttygets [/dev/ttyS0]:
>  > Success [0]
>  > heartbeat: 2004/03/24_02:47:05 WARN: ha_msg_add_nv: line doesn't contain
>  > '='
>  > heartbeat: 2004/03/24_02:47:05 info:
>  > heartbeat: 2004/03/24_02:47:05 ERROR: Exiting HBREAD process 3425 killed
>  > by signal 11.
> 
> So, it got EOF in ttygets (which it shouldn't get unless you send it a 
> signal for some reason), then a few seconds later it got an illegal line - 
> probably due to data corruption on the serial line - then it crashes with a 
> SEGFAULT.
> 
> Is it possible you have other processes trying to use these lines for 
> something?

I checked that already and I am sure heartbeat is the only process
accessing the serial line:

what I did to check this was:
robin:/var/log # fuser -v /dev/ttyS0
 
                     USER        PID ACCESS COMMAND
/dev/ttyS0           root        736 f....  heartbeat
                     root        839 f....  heartbeat
                     root        840 f....  heartbeat
                     root        841 f....  heartbeat
                     root        842 f....  heartbeat
                     root        843 f....  heartbeat

checked that no gpm or getty was running.

I tried to cat /dev/ttyS0 for some time, but the only output I got was
related to hertbeat.

It should not be cable problem , as the cluster ran heartbeat 0.4.9 for
almost a month and never had problems.

Many thanks,
Umberto



More information about the Linux-HA mailing list