Network Availibility

Juri Haberland juri@koschikode.com
Mon, 23 Apr 2001 15:45:33 +0200


Damn, this one should have gone to the list...
Sorry, Lionel.

Lionel Cottin wrote:
> 
> Thank you for your ideas about this Juri;
> 
> mon can be a good alternative; it will do the job expected, but i think it's
> better to let each node deciding with heartbeat who is the best node to
> deliver the service.
> 
> As in other cluster products, the heartbeat system has to monitor all the
> needed ressources to run a particular service on a particular node. If it

The problem here is that heartbeat wasn't designed to do any monitoring
at all!

> detects a problem, it calls an application, which role is to check each
> ressource, and try to recover it if necessary; (in our case, it can
> "ifconfig up" another local interface, plugged on another switch, to recover
> the network connectivity) if this step fails, then the heartbeat system
> tells the other node to take over.

This can currently only done by shutting down heartbeat on the active
node.

> With such implementation, we can set up an escalation of recover-actions
> before moving the service onto another node (it can be a time consuming
> operation with Oracle databases for example)... But i also agree with you
> that in many cases the mon solution is a sufficient one...

All of this I did with mon. Mon was running on each node monitoring only
the resources that _this_ node currently held. If heartbeat moved the
services to another node (aka stop the running services) it also told mon
to stop monitoring the services but not the network or heartbeat itself.
If a service fails, mon tried to restart it and if that failed three
times it rebooted the node in the hope to clean things up (the windoze
approach ;-). I monitored network connectivity, disk access, nfs access
(it was a file server) and monitored heartbeat itself. If, for  instance,
the ext3 journald decided to lockup, mon did a "reboot -fn"; very rude
but the only reliable way in such a case.

> In heartbeat, I've seen that each time an interface becomes dead, the
> /etc/ha.d/rc.d/ifstat script is called; i think it can be a good place to
> add some tests and recover actions...

If an interface becomes dead, it just means heartbeat does not get an
answer from its partner. It cannot distinguish whether your gateway or
hub or the other node is dead. What is really needed is a complete
framework for monitoring - I think Michael Moerz wanted to work on that,
but I'm not sure.

> This script, by default, just contains the true command; i've tried to
> replace it by false, but nothing happens....
> Any idea with this ??

Maybe just a left over? No idea.

Juri