Heartbeat won't fail over (v0.4.9)
Salvatore Tepedino
sal@netlineis.com
20 Apr 2001 10:34:14 -0400
(My apologies to Juri as I accidentally responded just to him, and not
to all, so he'll get this twice)
On 20 Apr 2001 01:22:20 +0200, Juri Haberland wrote:
> Salvatore Tepedino wrote:
> >
> > Ok... I've tried everything I can think of, checked the list
archives,
> > docs logs... I'm dumbfounded...
> > This all worked at one time... I think it broke about the time I
updated
> > openssl, but downgrading that did not help, so I don't think that's
the
> > problem.
>
> I don't think that your problem is related to that.
>
> > My problem is that when the main director fails (stop heartbeat) the
> > backup doesn't start ldirectord or even mark the primary as dead.
> > ldirectord works fine by itself. If I make each machine the resource
> > owner and remove the other from ha.cf, they work fine (by themselves
of
> > course) starting ldirectord etc.
> > I installed heartbeat via RPM (Even tried installing the stonith RPM
in
> > case that was the trouble), and I also tried replacing heartbeat on
the
> > secondary with the tar.gz version, just in case there were any
> > differences.. same result.
> > When primary fails, the secondary logs only this line... no matter
how
> > long I wait. The primary does the same when the secondary fails...
just
> > this line, no declaration of death...
>
> Hmm, actually the logs look good from what I can see though they seem
to
> describe another case. You talked about:
>
> - primary fails, secondary does not take over
>
Yes. The secondary just logs the shutdone line.
> You logs provide:
>
> - primary comes back online and takes the resources, but the secondary
doesn't
> release it (or maybe you didn't send this part of the logs)
The secondary has nothing to release... it never does anything when the
primary fails, save for the shutdone line.
> - secondary comes back online and sees that primary is working thus
staying
> inactive
I tested this by shutting down the primary, waiting a bit (9 seconds,
according to the logs, longer than the dead time and the initial dead
time) then restarted the secondary to see if it would take over once it
came back up and saw that the primary didn't have the resources... but
it didn't... it just sat there until I restarted the primary (which it
then logged).. Here are the logs from that test:
-+=Primary shuts down, all seems normal=+-
Apr 20 09:50:03 heckle heartbeat[31121]: info: Heartbeat shutdown in
progress.
Apr 20 09:50:03 heckle heartbeat[31121]: ERROR: controlfifo2msg: cannot
create message
Apr 20 09:50:03 heckle heartbeat[31121]: ERROR: control_process: NULL
message
Apr 20 09:50:03 heckle heartbeat[21594]: info: Giving up all HA
resources.
Apr 20 09:50:04 heckle heartbeat: info: Releasing resource group:
heckle.netlineis.com IPaddr::207.176.8.52 ldirectord::ldirectord.cf
Apr 20 09:50:04 heckle heartbeat: info: Running
/etc/ha.d/resource.d/ldirectord ldirectord.cf stop
Apr 20 09:50:05 heckle heartbeat: info: Running
/etc/ha.d/resource.d/IPaddr 207.176.8.52 stop
Apr 20 09:50:05 heckle heartbeat: info: IP Address 207.176.8.52 released
Apr 20 09:50:05 heckle heartbeat[21594]: info: All HA resources
relinquished.
Apr 20 09:50:06 heckle heartbeat[31121]: info: Heartbeat shutdown
complete.
-+=Secondary logs just this=+-
Apr 20 09:50:08 jeckle heartbeat: info: Running /etc/ha.d/rc.d/shutdone
shutdone
-+=So I restart the secondary to see if it comes back up, sees the
primary is offline and take over. It doesn't. Logs this=+-
Apr 20 09:50:17 jeckle heartbeat[20427]: info: Heartbeat shutdown in
progress.
Apr 20 09:50:17 jeckle heartbeat[20427]: ERROR: controlfifo2msg: cannot
create message
Apr 20 09:50:17 jeckle heartbeat[20427]: ERROR: control_process: NULL
message
Apr 20 09:50:17 jeckle heartbeat[15225]: info: Giving up all HA
resources.
Apr 20 09:50:18 jeckle heartbeat[15225]: info: All HA resources
relinquished.
Apr 20 09:50:19 jeckle heartbeat[20427]: info: Heartbeat shutdown
complete.
Apr 20 09:50:33 jeckle heartbeat[15300]: info:
**************************
Apr 20 09:50:33 jeckle heartbeat[15300]: info: Configuration validated.
Starting heartbeat 0.4.9
Apr 20 09:50:33 jeckle heartbeat[15301]: info: heartbeat: version 0.4.9
Apr 20 09:50:33 jeckle heartbeat[15301]: info: Heartbeat generation: 45
Apr 20 09:50:33 jeckle heartbeat[15301]: notice: Starting serial
heartbeat on tty /dev/ttyS0
Apr 20 09:50:33 jeckle heartbeat[15301]: notice: UDP heartbeat started
on port 694 interface eth0
Apr 20 09:50:33 jeckle heartbeat[15306]: info: Local status now set to:
'up'
Apr 20 09:50:34 jeckle heartbeat[15306]: info: Heartbeat restart on node
jeckle.netlineis.com
Apr 20 09:50:34 jeckle heartbeat[15306]: info: Link
jeckle.netlineis.com:eth0 up.
Apr 20 09:50:34 jeckle heartbeat: info: Running /etc/ha.d/rc.d/ifstat
ifstat
-+=So I wait longer than I should to see if anything happens (remember,
production server). Finally give up and restart the primary (which logs
normally). Secondary logs this during the primary restart=+-
Apr 20 09:51:00 jeckle heartbeat[15306]: info: Local status now set to:
'active'
Apr 20 09:51:00 jeckle heartbeat[15306]: info: Heartbeat restart on node
heckle.netlineis.com
Apr 20 09:51:00 jeckle heartbeat[15306]: info: Link
heckle.netlineis.com:eth0 up.
Apr 20 09:51:00 jeckle heartbeat[15306]: WARN: Late heartbeat: Node
heckle.netlineis.com: interval 26310 ms
Apr 20 09:51:00 jeckle heartbeat[15306]: info: Node
heckle.netlineis.com: status up
Apr 20 09:51:00 jeckle heartbeat: info: Running /etc/ha.d/rc.d/ifstat
ifstat
Apr 20 09:51:00 jeckle heartbeat: info: Running /etc/ha.d/rc.d/status
status
Apr 20 09:51:00 jeckle heartbeat[15306]: info: Link
heckle.netlineis.com:/dev/ttyS0 up.
Apr 20 09:51:00 jeckle heartbeat: info: Running /etc/ha.d/rc.d/ifstat
ifstat
Apr 20 09:51:00 jeckle heartbeat[15331]: info: No local resources
[/usr/lib/heartbeat/ResourceManager listkeys jeckle.netlineis.com]
Apr 20 09:51:00 jeckle heartbeat[15331]: info: Resource acquisition
completed.
Apr 20 09:51:01 jeckle heartbeat[15306]: info: Node
heckle.netlineis.com: status active
Apr 20 09:51:01 jeckle heartbeat: info: Running /etc/ha.d/rc.d/status
status
Apr 20 09:51:02 jeckle heartbeat: info: Running
/etc/ha.d/rc.d/ip-request ip-request
Apr 20 09:51:02 jeckle heartbeat: info: Running
/etc/ha.d/resource.d/IPaddr 207.176.8.52 status
I even tried giving the secondary the AudibleBeep resource and then
failing it to see if the primary would take that over, but the same
thing happened, only in reverse, so the primary has the same behavior.
> Btw, the config files are looking correct.
Good to hear from another pair of eyes. Thanks :)
>
> Cheers,
> Juri
>
--
Salvatore D. Tepedino
Linux System Administrator
Netline Internet Solutions
(516) 832-8200
PGP Key: http://www.netlineis.com/pgp/sal.asc