[Linux-HA] Heartbeat starts services twice, why?

"rueh hänä" rueh at gmx.net
Wed Mar 29 09:35:21 MST 2006


Hi folks

I catched up a problem with my heartbeat cluster with

- FC4 Kernel 2.6.11-1.1369_FC4smp
- heartbeat-2.0.0-1
- drbd-0.7.15
- tomcat5-5.0.30-5jpp_6fc
- httpd-2.0.54-10.2
- mon-0.99.2-1.rhfc1.dag

I think the problem is relating to mon, but i dont think, that mon is the
problem. It just seems to be a service, that couldnt be started twice,
because of the port, that can only be used once. I was testing a scenario,
where i kill heartbeat on master, so that slave can take over. After that, i
started master again and wanted him to takeover the resources from slave.
Ive done a simple /etc/init.d/heartbeat stop on slave.  

Here is the log:

Mar 29 18:03:57 linuxweb21 heartbeat: [23013]: info: acquire all HA
resources (standby).
Mar 29 18:03:57 linuxweb21 ResourceManager[23033]: [23047]: info: Acquiring
resource group: linuxweb21.ten.ch 192.168.0.89 drbddisk::drbd
Filesystem::/dev
Mar 29 18:03:58 linuxweb21 heartbeat: [23014]: info: Local Resource
acquisition completed.
Mar 29 18:03:58 linuxweb21 ResourceManager[23033]: [23109]: info: Running
/etc/ha.d/resource.d/IPaddr 192.168.0.89 start
Mar 29 18:03:58 linuxweb21 IPaddr[23111]: [23160]: info: /sbin/ifconfig
eth0:0 192.168.0.89 netmask 255.255.255.0       broadcast 192.168.0.255
Mar 29 18:03:58 linuxweb21 IPaddr[23111]: [23165]: info: Sending Gratuitous
Arp for 192.168.0.89 on eth0:0 [eth0]
Mar 29 18:03:58 linuxweb21 IPaddr[23111]: [23166]:
/usr/lib64/heartbeat/send_arp -i 500 -r 10 -p
/var/lib/heartbeat/rsctmp/send_arp/send_arp-192.168.0.89
Mar 29 18:03:58 linuxweb21 ResourceManager[23033]: [23199]: info: Running
/etc/ha.d/resource.d/drbddisk drbd start
Mar 29 18:03:58 linuxweb21 kernel: drbd0: Secondary/Secondary -->
Primary/Secondary
Mar 29 18:03:58 linuxweb21 ResourceManager[23033]: [23240]: info: Running
/etc/ha.d/resource.d/Filesystem /dev/drbd0 /drbd xfs start
Mar 29 18:03:59 linuxweb21 kernel: XFS mounting filesystem drbd0
Mar 29 18:03:59 linuxweb21 ResourceManager[23033]: [23280]: info: Running
/etc/ha.d/resource.d/ipalias  start
Mar 29 18:03:59 linuxweb21 ResourceManager[23033]: [23315]: info: Running
/etc/ha.d/resource.d/httpd  start
Mar 29 18:04:01 linuxweb21 ResourceManager[23033]: [23342]: info: Running
/etc/ha.d/resource.d/tomcat5  start
Mar 29 18:04:03 linuxweb21 ResourceManager[23033]: [24070]: info: Running
/etc/ha.d/resource.d/mysqld  start
Mar 29 18:04:05 linuxweb21 ResourceManager[23033]: [24198]: info: Running
/etc/ha.d/resource.d/mon  start
Mar 29 18:04:06 linuxweb21 mon[24205]: running as daemon
Mar 29 18:04:06 linuxweb21 mon[24205]: mon server started
Mar 29 18:04:06 linuxweb21 ResourceManager[23033]: [24223]: info: Running
/etc/ha.d/resource.d/moncheck  start
Mar 29 18:04:06 linuxweb21 heartbeat: [23013]: info: all HA resource
acquisition completed (standby).
Mar 29 18:04:06 linuxweb21 heartbeat: [22996]: info: Standby resource
acquisition done [all].
Mar 29 18:04:06 linuxweb21 harc[24236]: [24239]: info: Running
/etc/ha.d/rc.d/status status

#### Somewhere around here, the log messages are normally finished. In ####
fact, after starting all services. But after that, i get more logs #### and
it seems, that heartbeat starts the services again.

Mar 29 18:04:07 linuxweb21 harc[24261]: [24264]: info: Running
/etc/ha.d/rc.d/ip-request-resp ip-request-resp
Mar 29 18:04:07 linuxweb21 ip-request-resp[24261]: [24267]: received
ip-request-resp 192.168.0.89 OK yes
Mar 29 18:04:07 linuxweb21 ResourceManager[24268]: [24276]: info: Acquiring
resource group: linuxweb21.ten.ch 192.168.0.89 drbddisk::drbd
Filesystem::/dev
Mar 29 18:04:07 linuxweb21 ResourceManager[24268]: [24352]: info: Running
/etc/ha.d/resource.d/ipalias  start
Mar 29 18:04:07 linuxweb21 ResourceManager[24268]: [24386]: info: Running
/etc/ha.d/resource.d/httpd  start
Mar 29 18:04:08 linuxweb21 ResourceManager[24268]: [24411]: info: Running
/etc/ha.d/resource.d/tomcat5  start
Mar 29 18:04:08 linuxweb21 ResourceManager[24268]: [24484]: info: Running
/etc/ha.d/resource.d/mysqld  start
Mar 29 18:04:09 linuxweb21 ResourceManager[24268]: [24600]: info: Running
/etc/ha.d/resource.d/mon  start

## AND here is the problem. mon couldnt be started a second time, because ##
the port is already in use. And because of this, heartbeat shuts down ## the
services and runs further without them. Now i have a non running 
## cluster.

Mar 29 18:04:09 linuxweb21 mon[24605]: fatal, could not bind TCP server port
2583: Die Adresse wird bereits verwendet
Mar 29 18:04:09 linuxweb21 ResourceManager[24268]: [24609]: ERROR: Return
code 1 from /etc/ha.d/resource.d/mon
Mar 29 18:04:09 linuxweb21 ResourceManager[24268]: [24610]: CRIT: Giving up
resources due to failure of mon

Mar 29 18:04:09 linuxweb21 ResourceManager[24268]: [24611]: info: Releasing
resource group: linuxweb21.ten.ch 192.168.0.89 drbddisk::drbd
Filesystem::/dev
Mar 29 18:04:10 linuxweb21 ResourceManager[24268]: [24621]: info: Running
/etc/ha.d/resource.d/moncheck  stop
Mar 29 18:04:10 linuxweb21 ResourceManager[24268]: [24649]: info: Running
/etc/ha.d/resource.d/mon  stop
Mar 29 18:04:10 linuxweb21 mon[24205]: caught TERM signal, exiting
Mar 29 18:04:10 linuxweb21 ResourceManager[24268]: [24668]: info: Running
/etc/ha.d/resource.d/mysqld  stop
Mar 29 18:04:13 linuxweb21 ResourceManager[24268]: [24754]: info: Running
/etc/ha.d/resource.d/tomcat5  stop
Mar 29 18:04:18 linuxweb21 ResourceManager[24268]: [24890]: info: Running
/etc/ha.d/resource.d/httpd  stop
Mar 29 18:04:22 linuxweb21 ResourceManager[24268]: [24909]: info: Running
/etc/ha.d/resource.d/ipalias  stop
Mar 29 18:04:22 linuxweb21 ResourceManager[24268]: [24937]: info: Running
/etc/ha.d/resource.d/Filesystem /dev/drbd0 /drbd xfs stop
Mar 29 18:04:23 linuxweb21 ResourceManager[24268]: [24965]: info: Running
/etc/ha.d/resource.d/drbddisk drbd stop
Mar 29 18:04:23 linuxweb21 kernel: drbd0: Primary/Secondary -->
Secondary/Secondary
Mar 29 18:04:23 linuxweb21 ResourceManager[24268]: [24991]: info: Running
/etc/ha.d/resource.d/IPaddr 192.168.0.89 stop
Mar 29 18:04:24 linuxweb21 IPaddr[24993]: [25004]: info: /sbin/route -n del
-host 192.168.0.89
Mar 29 18:04:24 linuxweb21 IPaddr[24993]: [25006]: info: /sbin/ifconfig
eth0:0 down
Mar 29 18:04:24 linuxweb21 IPaddr[24993]: [25009]: info: IP Address
192.168.0.89 released


My question. Is this normal? If not, what could be the cause? Is it even
mon, that produces this problem, or even not?

Here is my ha.cf (i diffed slave's and master's before):

keepalive 2
deadtime 60
warntime 30
udpport 694
bcast   eth1            # Linux
auto_failback off
watchdog /dev/watchdog
node    linuxweb21
node    linuxweb22
debug 0
use_logd yes
conn_logd_time 60

Here is my haresourcse:

linuxweb21 192.168.0.89 drbddisk::drbd Filesystem::/dev/drbd0::/drbd::xfs
ipalias httpd tomcat5 mysqld mon moncheck

Permissions are ok, the start scripts in resource.d / init.d are the same (i
also diffed them). 

-- 
Echte DSL-Flatrate dauerhaft für 0,- Euro*!
"Feel free" mit GMX DSL! http://www.gmx.net/de/go/dsl


More information about the Linux-HA mailing list