Heartbeat won't fail over (v0.4.9)
Salvatore Tepedino
sal@netlineis.com
19 Apr 2001 14:54:28 -0400
Ok... I've tried everything I can think of, checked the list archives,
docs logs... I'm dumbfounded...
This all worked at one time... I think it broke about the time I updated
openssl, but downgrading that did not help, so I don't think that's the
problem.
My problem is that when the main director fails (stop heartbeat) the
backup doesn't start ldirectord or even mark the primary as dead.
ldirectord works fine by itself. If I make each machine the resource
owner and remove the other from ha.cf, they work fine (by themselves of
course) starting ldirectord etc.
I installed heartbeat via RPM (Even tried installing the stonith RPM in
case that was the trouble), and I also tried replacing heartbeat on the
secondary with the tar.gz version, just in case there were any
differences.. same result.
When primary fails, the secondary logs only this line... no matter how
long I wait. The primary does the same when the secondary fails... just
this line, no declaration of death...
Mar 27 14:03:05 jeckle heartbeat: info: Running /etc/ha.d/rc.d/shutdone
shutdone
Unfortunately for me, this is a production setup, and I've probably
experimented more than I should have ;) (No one's complained (too
loudly) yet ;)
The configs are the same on both machines...
I'd like to thank everyone in advance for your help.
Here are my IPs, logs and configs are at the bottom (sorry for the log
spam... don't want to leave any clues out). Let me know if you need any
more information, or you want me to experiment with anything.
Primary's IP: 207.176.8.50
Secondary's IP: 207.176.8.51
Resource's (VIP): 207.176.8.52
(Primary's startup log w/ secondary already running)
Mar 27 14:03:17 heckle heartbeat[3922]: info: Configuration validated.
Starting heartbeat 0.4.9
Mar 27 14:03:17 heckle heartbeat[3923]: info: heartbeat: version 0.4.9
Mar 27 14:03:17 heckle heartbeat[3923]: info: Heartbeat generation: 33
Mar 27 14:03:17 heckle heartbeat[3923]: notice: Starting serial
heartbeat on tty /dev/ttyS0
Mar 27 14:03:17 heckle heartbeat[3923]: notice: UDP heartbeat started on
port 694 interface eth0
Mar 27 14:03:17 heckle heartbeat[3928]: info: Local status now set to:
'up'
Mar 27 14:03:19 heckle heartbeat[3928]: info: Heartbeat restart on node
heckle.netlineis.com
Mar 27 14:03:19 heckle heartbeat[3928]: info: Local status now set to:
'active'
Mar 27 14:03:19 heckle heartbeat[3928]: info: Heartbeat restart on node
jeckle.netlineis.com
Mar 27 14:03:19 heckle heartbeat[3928]: info: Link
jeckle.netlineis.com:/dev/ttyS0 up.
Mar 27 14:03:19 heckle heartbeat[3928]: info: Node jeckle.netlineis.com:
status active
Mar 27 14:03:19 heckle heartbeat[3928]: info: Link
heckle.netlineis.com:eth0 up.
Mar 27 14:03:19 heckle heartbeat[3928]: info: Link
jeckle.netlineis.com:eth0 up.
Mar 27 14:03:19 heckle heartbeat[3928]: info: Node heckle.netlineis.com:
status up
Mar 27 14:03:19 heckle heartbeat[3928]: info: Node heckle.netlineis.com:
status active
Mar 27 14:03:19 heckle heartbeat: info: Running /etc/ha.d/rc.d/status
status
Mar 27 14:03:19 heckle heartbeat: info: Running /etc/ha.d/rc.d/ifstat
ifstat
Mar 27 14:03:19 heckle heartbeat: info: Running /etc/ha.d/rc.d/status
status
Mar 27 14:03:19 heckle heartbeat: info: Running /etc/ha.d/rc.d/ifstat
ifstat
Mar 27 14:03:19 heckle heartbeat: info: Running /etc/ha.d/rc.d/status
status
Mar 27 14:03:19 heckle heartbeat: info: Running /etc/ha.d/rc.d/ifstat
ifstat
Mar 27 14:03:20 heckle heartbeat: info: Running
/etc/ha.d/resource.d/IPaddr 207.176.8.52 status
Mar 27 14:03:20 heckle heartbeat: info: Running
/etc/ha.d/rc.d/ip-request ip-request
Mar 27 14:03:20 heckle heartbeat[3930]: info: Resource acquisition
completed.
Mar 27 14:03:30 heckle heartbeat: info: Running
/etc/ha.d/resource.d/IPaddr 207.176.8.52 status
Mar 27 14:03:30 heckle heartbeat: info: Acquiring resource group:
heckle.netlineis.com IPaddr::207.176.8.52 ldirectord::ldirectord.cf
Mar 27 14:03:30 heckle heartbeat: info: Running
/etc/ha.d/resource.d/IPaddr 207.176.8.52 start
Mar 27 14:03:30 heckle heartbeat: info: ifconfig eth0:0 207.176.8.52
netmask 255.255.255.0^Ibroadcast 207.176.8.255
Mar 27 14:03:30 heckle heartbeat: info: Sending Gratuitous Arp for
207.176.8.52 on eth0:0 [eth0]
Mar 27 14:03:30 heckle heartbeat: info: Running
/etc/ha.d/resource.d/ldirectord ldirectord.cf start
(Secondary logs this as primary comes back up)
Mar 27 14:03:20 jeckle heartbeat[25816]: info: Heartbeat restart on node
heckle.netlineis.com
Mar 27 14:03:20 jeckle heartbeat[25816]: WARN: Late heartbeat: Node
heckle.netlineis.com: interval 15230 ms
Mar 27 14:03:20 jeckle heartbeat[25816]: info: Node
heckle.netlineis.com: status up
Mar 27 14:03:20 jeckle heartbeat[25813]: ERROR: ha_msg_add_nv: line
doesn't contain '='
Mar 27 14:03:20 jeckle heartbeat[25813]: ERROR: s>>>
Mar 27 14:03:20 jeckle heartbeat: info: Running /etc/ha.d/rc.d/status
status
Mar 27 14:03:22 jeckle heartbeat[25816]: info: Node
heckle.netlineis.com: status active
Mar 27 14:03:22 jeckle heartbeat: info: Running /etc/ha.d/rc.d/status
status
Mar 27 14:03:22 jeckle heartbeat: info: Running
/etc/ha.d/rc.d/ip-request ip-request
Mar 27 14:03:23 jeckle heartbeat: info: Running
/etc/ha.d/resource.d/IPaddr 207.176.8.52 status
I've read that the error lines are normal, casued by the serial
heartbeat and
the restart... I've gotten the warn for the late heartbeat to say some
large numbers for that interval w/o a failover or anything being added
to the logs besides the shutdone line (until the primary starts again)
(And finally, the secondary's startup log)
Mar 27 14:22:45 jeckle heartbeat[27014]: info: Configuration validated.
Starting heartbeat 0.4.9
Mar 27 14:22:45 jeckle heartbeat[27015]: info: heartbeat: version 0.4.9
Mar 27 14:22:45 jeckle heartbeat[27015]: info: Heartbeat generation: 43
Mar 27 14:22:45 jeckle heartbeat[27015]: notice: Starting serial
heartbeat on tty /dev/ttyS0
Mar 27 14:22:45 jeckle heartbeat[27015]: notice: UDP heartbeat started
on port 694 interface eth0
Mar 27 14:22:45 jeckle heartbeat[27021]: info: Local status now set to:
'up'
Mar 27 14:22:46 jeckle heartbeat[27021]: info: Heartbeat restart on node
jeckle.netlineis.com
Mar 27 14:22:46 jeckle heartbeat[27021]: info: Link
jeckle.netlineis.com:eth0 up.
Mar 27 14:22:46 jeckle heartbeat[27021]: info: Local status now set to:
'active'
Mar 27 14:22:46 jeckle heartbeat[27021]: info: Heartbeat restart on node
heckle.netlineis.com
Mar 27 14:22:46 jeckle heartbeat[27021]: info: Link
heckle.netlineis.com:/dev/ttyS0 up.
Mar 27 14:22:46 jeckle heartbeat[27021]: info: Node
heckle.netlineis.com: status active
Mar 27 14:22:46 jeckle heartbeat[27021]: info: Link
heckle.netlineis.com:eth0 up.
Mar 27 14:22:46 jeckle heartbeat[27021]: info: Node
jeckle.netlineis.com: status up
Mar 27 14:22:46 jeckle heartbeat[27021]: info: Node
jeckle.netlineis.com: status active
Mar 27 14:22:46 jeckle heartbeat: info: Running /etc/ha.d/rc.d/status
status
Mar 27 14:22:46 jeckle heartbeat: info: Running /etc/ha.d/rc.d/ifstat
ifstat
Mar 27 14:22:46 jeckle heartbeat: info: Running /etc/ha.d/rc.d/ifstat
ifstat
Mar 27 14:22:46 jeckle heartbeat: info: Running /etc/ha.d/rc.d/status
status
Mar 27 14:22:46 jeckle heartbeat: info: Running /etc/ha.d/rc.d/ifstat
ifstat
Mar 27 14:22:46 jeckle heartbeat: info: Running /etc/ha.d/rc.d/status
status
Mar 27 14:22:46 jeckle heartbeat[27024]: info: No local resources
[/usr/lib/heartbeat/ResourceManager listkeys jeckle.netlineis.com]
Mar 27 14:22:46 jeckle heartbeat[27024]: info: Resource acquisition
completed.
*** The Config Files ***
haresources:
(Nodenames match uname -n.)
heckle.netlineis.com IPaddr::207.176.8.52 ldirectord::ldirectord.cf
ha.cf:
[root@heckle ha.d]# grep -v \# ha.cf
debugfile /var/log/ha-debug
logfacility local0
keepalive 1
deadtime 3
initdead 7
serial /dev/ttyS0
udp eth0
nice_failback off
node heckle.netlineis.com
node jeckle.netlineis.com
Authkeys are the same on both directors.
--
Salvatore D. Tepedino
Linux System Administrator
Netline Internet Solutions
(516) 832-8200
PGP Key: http://www.netlineis.com/pgp/sal.asc