Heartbeat: broadcast or point to point

Steve Underwood steveu@netpage.com.hk
Wed, 31 Mar 1999 09:01:41 +0000


> Alan Robertson wrote:
> > > I think udp broadcasts are sufficient for my implementation because I
> > > only act if five in sequence go missing, so it's not critical if a
> > > packet or two get lost.
> >
> > That may be sufficient for your particular application, but for some
> > applications, 5 seconds downtime without knowing it is a long time.  It
> > would be nice if we could do something which could be tuned to other
> > applications.

You need to be very careful about over enthusiastic fail-overs. I've had trouble
before trying to get a really fast decision about a failed element in a system,
when those elements are subject to variable loading. Servers twiddling their
thumbs waiting for their big moment have no loading issues. The ones they are
monitoring may. You can get trouble with fail-overs occurring only because the
live server suffered a short term high load, which delayed its heartbeats. This
is part of the key problem in HA and redundant systems - it can be easy to handle
the good and bad system modules, but very difficult to handle the quirky and
intermittent ones.

I agree that 5 seconds can be a long time. The first redundant system I had to
design had to fail-over in 20ms. That was practical because the units were
performing sets of well defined axis transformations every 20ms. The load was
totally predictable, and detecting a failure by some negotiation at the end of
each 20ms cycle did the trick. I doubt a generic solution to fast stable
changeover is practical without custom hardware.

Steve