IP-Adress-Takeover

Alan Robertson alanr@unix.sh
Wed, 09 Jan 2002 07:41:55 -0700


Hi Andre,

In your original problem, you had troubles with the machine rebooting.  Did
you solve those?

If you're running a firewall on these machines, my guess is that either you
have a misconfiguration, or that packets are getting lost as a result of the
firewall.

There are a couple of kinds of misconfigurations which could be causing the
problem.

First the haresources file should be IDENTICAL between the two machines, and
both machine should be running with nice_failback set the same.  The virtual
IP address should NOT be configured into the OS before heartbeat starts.

As is nearly always the case, the answers will be found in the logs.

	-- Alan Robertson
	   alanr@unix.sh

Andre Krajnik wrote:
> 
> Hi Alan,
> 
> sorry for the lack of time.
> 
> I have searched a while for reseachening my problem and after a few
> changes at my firewall I got new results:
> 
> If one node is down, the other one recognizes it. but it doesn't take
> the cluster-ip over. But in the case I boot the _passive_ node, it takes
> the cluster-ip over from the former active one.
> 
> I can't find why. I read all doc I could find for several times but I
> can't find a suitable hint.
> 
> /etc/ha,d/ha.cf
> 
> all standard except the nodes:
> 
> node node1.domain.tld
> node node2.domain.tld
> 
> /etc/ha.d/haressources
> all standard except:
> 
> node1.domain.tld    IPAddr::a.b.c.d/28/a.b.c.bc
> (or should I put the cluster-name here?)
> 
> a.b.c.d = cluster-address
> a.b.c.bc = broadcast-address
> 
> Alan Robertson wrote:
> 
>  > Andre Krajnik wrote:
>  >
>  >>Hi Alan,
>  >>
>  >>I have needed some time to collect data from my logfiles but here they
>  >>are. I hope you can tell me or give me a hint why my cluster switches
>  >>over. I can't see any reason.
>  >>
>  >
>  > OK.  Looks like someone issued a shutdown to the whole machine "node2".
>  >
>  > Notice, lots of processes got signal 15 (termination) and shut down.
>  >
>  >
>  >>Nov 22 09:33:16 node2 rc: Stopping kheader:  succeeded
>  >>Nov 22 09:33:16 node2 sshd[619]: Received signal 15; terminating.
>  >>Nov 22 09:33:17 node2 sshd: sshd shutdown succeeded
>  >>Nov 22 09:33:17 node2 heartbeat[503]: info: Heartbeat shutdown in
> progress.
>  >>Nov 22 09:33:17 node2 heartbeat[10993]: info: Giving up all HA resources.
>  >>Nov 22 09:33:17 node2 heartbeat: info: Releasing resource group:
>  >>node2.domain.de IPaddr::aaa.bbb.ccc.eee/28/aaa.bbb.ccc.1
>  >>Nov 22 09:33:17 node2 heartbeat: info: Running
>  >>/etc/ha.d/resource.d/IPaddr aaa.bbb.ccc.eee/28/aaa.bbb.ccc.zzz stop
>  >>Nov 22 09:33:17 node2 heartbeat: info: IP Address aaa.bbb.ccc.eee
> released
>  >>Nov 22 09:33:17 node2 heartbeat[10993]: info: All HA resources
> relinquished.
>  >>Nov 22 09:33:18 node2 heartbeat[503]: info: Heartbeat shutdown complete.
>  >>Nov 22 09:33:19 node2 xinetd: xinetd shutdown failed
>  >>
>  >
>  > Now the machine reboots...
>  >
>  >>...
>  >>Nov 22 09:34:56 node2 kernel:   Internal registers self-test: passed.
>  >>Nov 22 09:34:56 node2 kernel:   ROM checksum self-test: passed
> (0x04f4518b).
>  >>Nov 22 09:34:56 node2 heartbeat[493]: info: **************************
>  >>Nov 22 09:34:56 node2 heartbeat[493]: info: Configuration validated.
>  >>Starting heartbeat 0.4.8
>  >>Nov 22 09:34:56 node2 heartbeat[498]: info: heartbeat: version 0.4.8
>  >>Nov 22 09:34:56 node2 heartbeat[498]: info: Creating FIFO
>  >>/var/run/heartbeat-fifo.
>  >>Nov 22 09:34:56 node2 heartbeat[498]: notice: UDP heartbeat started on
>  >>port 1001 interface eth1
>  >>Nov 22 09:34:56 node2 heartbeat[501]: info: Local status now set to: 'up'
>  >>Nov 22 09:34:56 node2 heartbeat[501]: info: Link node2.domain.de:eth1:
>  >>status up
>  >>Nov 22 09:34:57 node2 atd: atd startup succeeded
>  >>Nov 22 09:34:57 node2 heartbeat[501]: info: Local status now set to:
>  >>'active'
>  >>Nov 22 09:34:57 node2 heartbeat[501]: info: Link
>  >>s0031175.domain.de:eth1: status up
>  >>Nov 22 09:34:57 node2 heartbeat[501]: info: Node s0031175.domain.de:
>  >>status active
>  >>Nov 22 09:34:39 node2 rc.sysinit: Mounting proc filesystem succeeded
>  >>Nov 22 09:34:39 node2 setsysfont: findacm: No such file or directory
>  >>Nov 22 09:34:39 node2 rc.sysinit: Setting default font succeeded
>  >>.....
>  >>Nov 22 09:34:55 node2 random: Initializing random number generator:
>  >>succeeded
>  >>Nov 22 09:34:55 node2 netfs: Mounting other filesystems:  succeeded
>  >>Nov 22 09:34:57 node2 heartbeat: info: Running /etc/ha.d/rc.d/status
> status
>  >>Nov 22 09:34:57 node2 crond[539]: (CRON) STARTUP (fork ok)
>  >>Nov 22 09:34:57 node2 heartbeat: info: Running
>  >>/etc/ha.d/resource.d/IPaddr aaa.bbb.ccc.eee/28/aaa.bbb.ccc.zzz status
>  >>Nov 22 09:34:57 node2 heartbeat: info: Running /etc/ha.d/rc.d/ip-request
>  >>ip-request
>  >>Nov 22 09:34:57 node2 crond: crond startup succeeded
>  >>Nov 22 09:34:57 node2 heartbeat: info: Running
>  >>/etc/ha.d/rc.d/ip-request-resp ip-request-resp
>  >>Nov 22 09:34:57 node2 heartbeat: received ip-request-resp
>  >>IPaddr::aaa.bbb.ccc.eee/28/aaa.bbb.ccc.zzz OK
>  >>Nov 22 09:34:58 node2 heartbeat: info: Acquiring resource group:
>  >>node2.domain.de IPaddr::aaa.bbb.ccc.eee/28/aaa.bbb.ccc.1
>  >>Nov 22 09:34:58 node2 heartbeat: info: Running
>  >>/etc/ha.d/resource.d/IPaddr aaa.bbb.ccc.eee/28/aaa.bbb.ccc.zzz start
>  >>Nov 22 09:34:58 node2 sshd: Starting sshd:
>  >>Nov 22 09:34:58 node2 sshd:  succeeded
>  >>Nov 22 09:34:58 node2 sshd[648]: Server listening on 0.0.0.0 port 22.
>  >>Nov 22 09:34:58 node2 sshd[648]: Generating 768 bit RSA key.
>  >>Nov 22 09:34:58 node2 sshd: ^[[60G
>  >>Nov 22 09:34:58 node2 sshd:
>  >>Nov 22 09:34:58 node2 rc: Starting sshd:  succeeded
>  >>Nov 22 09:34:58 node2 heartbeat: info: ifconfig eth0:0 aaa.bbb.ccc.eee
>  >>netmask 255.255.255.240^Ibroadcast aaa.bbb.ccc.zzz
>  >>Nov 22 09:34:58 node2 sshd[648]: RSA key generation complete.
>  >>Nov 22 09:34:58 node2 heartbeat: info: Sending Gratuitous Arp for
>  >>aaa.bbb.ccc.eee on eth0:0 [eth0]
>  >>Nov 22 09:34:58 node2 kernel: send_arp uses obsolete
> (PF_INET,SOCK_PACKET)
>  >>Nov 22 09:34:59 node2 xinetd[692]: Error reading included directory:
>  >>/etc/xinetd.d [line=14]
>  >>Nov 22 09:34:59 node2 xinetd[692]: {init_services} no services.
> Exiting...
>  >>Nov 22 09:34:59 node2 xinetd: xinetd startup succeeded
>  >>
>  >>
>  >
>  >
>  > So, what you can see is that something on node2 requested a graceful
> reboot,
>  > and it was carried out.
>  >
>  > Heartbeat was only one of many subsystems affected by this, and it
> doesn't
>  > off hand appear to be anything associated with heartbeat.
>  >
>  > The clues to the cause are probably in parts of the logs you didn't send
>  > out.
>  >
>  >      -- Alan Robertson
>  >         alanr@unix.sh
>  >
>  >
> ------------------------------------------------------------------------------
>  > Linux HA Web Site:
>  >   http://linux-ha.org/
>  > Linux HA HOWTO:
>  >
> http://metalab.unc.edu/pub/Linux/ALPHA/linux-ha/High-Availability-HOWTO.html
>  >
> ------------------------------------------------------------------------------
>  >
>  >
>  >
>  >
> 
> --
> mfg
> 
> Andre
> 
> ------------------------------------------------------------------------------
> Linux HA Web Site:
>   http://linux-ha.org/
> Linux HA HOWTO:
>   http://metalab.unc.edu/pub/Linux/ALPHA/linux-ha/High-Availability-HOWTO.html
> ------------------------------------------------------------------------------