[Linux-HA] Re: Autofailback problem

Anetsberger Stephan.Anetsberger at Infoscreen.de
Tue Jun 8 00:22:30 MDT 2004


Hi Collin,

I once had the same problem with an earlier version of heartbeat.
Then I just commented out the line with 'auto_failback = on' in ha.cf and
then it worked fine.

Stephan

-----Ursprüngliche Nachricht-----
Von: linux-ha-request at lists.linux-ha.org
[mailto:linux-ha-request at lists.linux-ha.org]
Gesendet: Keines
An: linux-ha at lists.linux-ha.org
Betreff: Linux-HA Digest, Vol 7, Issue 14


Send Linux-HA mailing list submissions to
	linux-ha at lists.linux-ha.org

To subscribe or unsubscribe via the World Wide Web, visit
	http://lists.linux-ha.org/mailman/listinfo/linux-ha
or, via email, send a message with subject or body 'help' to
	linux-ha-request at lists.linux-ha.org

You can reach the person managing the list at
	linux-ha-owner at lists.linux-ha.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Linux-HA digest..."


Today's Topics:

   1. Autofailback problem (Colin Bruce)
   2. Reboot if network or service fail (Nuno Tavares)
   3. Re: stonith apcmastersnmp relocation error FIXED (Andreas Huck)
   4. Forcing failover or failback (Alexis Layton)


----------------------------------------------------------------------

Message: 1
Date: Sun, 6 Jun 2004 19:33:35 +0100 (BST)
From: Colin Bruce <ccx004 at coventry.ac.uk>
Subject: [Linux-HA] Autofailback problem
To: <linux-ha at lists.linux-ha.org>
Message-ID:
	<Pine.LNX.4.33.0406061919440.8283-100000 at alfgar.coventry.ac.uk>
Content-Type: TEXT/PLAIN; charset=US-ASCII

Dear All,

I seem to have a similar problem to the one posted by Nicolas Schmitz on
the 2nd of June.

I have set up heartbeat version 1.2.2 on two test systems. I have one
service address (192.168.255.7). I have no services - all I am doing is
pinging that address for the moment. The master is called lnxtst2 and the
slave lnxtst3. If I ping the service address I get a response. If I then
shutdown the master cleanly the service address moves to the slave and the
pings continue. If I start the master again the service address moves back
to the master as we would expect. However, it is still on the slave as well.
If I type IPaddr 192.168.255.7 stop on the slave it stops perfectly well so
the question is how come it works if typed by hand but doesn't work if run
from Heartbeat. Actually, the real question is what might I have done wrong.

Here is a section from the logfile as the master tries to take back the
resource. This is taken from the slave. The log file on the master doesn't
show any errors.

heartbeat: 2004/06/06_18:49:50 info: Heartbeat restart on node lnxtst2
heartbeat: 2004/06/06_18:49:50 info: Link lnxtst2:eth0 up.
heartbeat: 2004/06/06_18:49:50 info: Status update for node lnxtst2: status
up
heartbeat: 2004/06/06_18:49:50 info: Running
/usr/local/packages/heartbeat-1.2.2/etc/ha.d/rc.d/status status
heartbeat: 2004/06/06_18:51:50 WARN: 1 lost packet(s) for [lnxtst2] [61:63]
heartbeat: 2004/06/06_18:51:50 info: Status update for node lnxtst2: status
active
heartbeat: 2004/06/06_18:51:50 info: No pkts missing from lnxtst2!
heartbeat: 2004/06/06_18:51:50 info: Running
/usr/local/packages/heartbeat-1.2.2/etc/ha.d/rc.d/status status
heartbeat: 2004/06/06_18:51:51 ERROR: Both machines own our resources!
heartbeat: 2004/06/06_18:51:51 info: remote resource transition completed.
heartbeat: 2004/06/06_18:51:51 ERROR: Both machines own our resources!
heartbeat: 2004/06/06_18:51:51 ERROR: Both machines own foreign resources!
heartbeat: 2004/06/06_18:51:51 info: lnxtst3 wants to go standby [foreign]
heartbeat: 2004/06/06_18:51:51 ERROR: Both machines own our resources!
heartbeat: 2004/06/06_18:51:51 ERROR: Both machines own foreign resources!
heartbeat: 2004/06/06_18:52:02 WARN: No reply to standby request.  Standby
request cancelled.
heartbeat: 2004/06/06_18:52:02 ERROR: Both machines own our resources!
heartbeat: 2004/06/06_18:52:02 ERROR: Both machines own foreign resources!

The conf files are more or less as distributed. We have added entries to
ha.cf for the nodes and a line in haresources for the service address.

In ha.cf, auto_failback is set to on and the node lines are

node    lnxtst2
node    lnxtst3

In haresources the only line is

lnxtst2     192.168.255.7

Any thoughts would be appreciated.

Best wishes...
Colin Bruce



------------------------------

Message: 2
Date: Mon, 07 Jun 2004 08:39:17 +0100
From: Nuno Tavares <nunotavares at hotmail.com>
Subject: [Linux-HA] Reboot if network or service fail
To: linux-ha at muc.de
Message-ID: <pan.2004.06.07.07.39.17.361197 at hotmail.com>
Content-Type: text/plain; charset=ISO-8859-15

Greetings to all.

This is the situation: I use to manage systems remotely (as probably most
of us), but I'm a really distracted guy and sometimes I start playing
around with the network and (guess what) I loose my connection. Then I'll
have to wait until I have physical access to the machines.

I'd like to set heartbeat to detect when it has lost contact with the
outside network, instead of a specific host, and reboot if confirmed. So
if I mess with network parameters, it will reboot.

Moreover, another critical service that has to be monitored is SSH daemon.
So if I do a "service sshd restart" and it doesn't restart the daemon, I'd
like heartbeat to failback a standard configuration.

So, my questions are:
1) how to monitor+react network connectivity (instead of peer's)
2) how to monitor services (may involve creating a heartbeat script to
rollback to a service "safe-mode")

-- 
Nuno Tavares
http://nthq.cjb.net/




------------------------------

Message: 3
Date: Mon, 7 Jun 2004 11:25:46 +0200
From: Andreas Huck <ha at huck.it>
Subject: Re: [Linux-HA] stonith apcmastersnmp relocation error FIXED
To: General Linux-HA mailing list <linux-ha at lists.linux-ha.org>
Message-ID: <200406071125.46392.ha at huck.it>
Content-Type: text/plain;  charset="iso-8859-15"

Hi,

On Saturday 05 June 2004 14:44, Lars Ellenberg wrote:
> / 2004-06-05 13:45:06 +0200
>
> \ Andreas Piesk:
> > Andreas Huck schrieb:
> > >Hi,
> > >
> > >On Friday 04 June 2004 22:23, Andreas Piesk wrote:
> > >[...]
> > >
> > >>>    // issue a warning if ident mismatches
> > >>>-    if (strcmp(ident, TESTED_IDENT) != 0) {
> > >>>+    for(i=sizeof(APC_tested_ident)/sizeof(APC_tested_ident[0]) -1; i
> > >>> >=0 ; i--) +      if (!strcmp(ident, APC_tested_ident[i])) break; +

> > >>>  if (i<0) {
> > >>>       syslog(LOG_WARNING,
> > >>>              "%s: module not tested with this hardware '%s'",
> > >>>              __FUNCTION__, ident);
> > >>
> > >>hmm, i don't like the for-loop especially the division. how about
> > >>something like that:
> > >>
> > >>static const char* APC_tested_ident[] = {"AP9606", "AP7920",
> > >>                                         "AP_other_well_tested",
> > >>                                          NULL};
> > >>int i=0;
> > >>while( APC_tested_ident[i] != NULL &&
> > >>       strcmp(ident,APC_tested_ident[i])) i++;
> > >>
> > >>  if( APC_tested_ident[i] == NULL ) {
> > >>    // not tested
> > >
> > >hm, you exchange a single integer division by a dependency (NULL has to
> > >be the last entry). But OK, your code is easier to read, go ahead.
> >
> > you are right about the dependency. this problem can easily be solved by
> > using a macro to build the APC_tested_ident array:
> >
> > #define IDENTS(b ... ) {b,NULL}
> > char *APC_tested_idents[]=IDENTS("AP9606","AP7920","AP1234");
> >
> > about the integer division: i'm not concerned about the division itself.
> > the method size of array/number of elements depends on the size of the
> > elements. what if apc changes the ident to something like 'APC
> > SUPERDUPER 0815'?
>
> nothing.
> APC_tested_ident is an array of char* .
> sizeof(APC_tested_ident) has nothing to do with the length of the
> cstrings those char* point to.
>  #define NR_ARRAY_ELEMENTS(A) (sizeof(A)/sizeof(A[0]))

in  linux-ha/portability.h, which is included in apcmastersnmp.c, we already
have
#define 	DIMOF(a)                ((int) (sizeof(a)/sizeof(a[0])) )

so this makes the for-loop eaier to read:
+    for(i=DIMOF(APC_tested_ident) -1; i >=0 ; i--)
+      if (strcmp(ident, APC_tested_ident[i]) == 0) break;
+    if (i<0) {

Anyway.
Regards,
Andreas
 



------------------------------

Message: 4
Date: Mon, 07 Jun 2004 11:46:45 -0400
From: "Alexis Layton" <alex at permabit.com>
Subject: [Linux-HA] Forcing failover or failback
To: linux-ha at lists.linux-ha.org
Message-ID: <opr88d37ik6xmxra at smtp.permabit.com>
Content-Type: text/plain; format=flowed; delsp=yes; charset=iso-8859-1

Hi folks,

Currently our apphb notification plugin calls ha_standby when our
application is no longer healthy.  This appears to always shut down the
current node and start the secondary, regardless of whether the secondary
is the current node.

What is the canonical way to programatically fail over to the other node
in a Linux-HA 1.2 based configuration?

-- 
Alexis Layton
Permabit, Inc.


------------------------------

_______________________________________________
Linux-HA mailing list
Linux-HA at lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha


End of Linux-HA Digest, Vol 7, Issue 14
***************************************


More information about the Linux-HA mailing list