[LinuxFailSafe] FailSafe Activity - what's happening with FailSafe today?

Lars Marowsky-Bree lmb@suse.de
Tue, 28 Jan 2003 12:12:40 +0100


On 2003-01-22T19:12:15,
   Kashif Shaikh <kshaikh@consensys.com> said:

> Nope.  Failsafe needs a lot of improvements.  One of the most apparent
> problems I have found is failsafe complexity -- it's 300,000+ lines of
> code(not counting the GUI component) with 7 daemons * number of nodes
> trying to communicate with each other.  As you can imagine, the smallest
> problems quickly propagate to all the subsystems causing things to
> malfunction.

It also makes it difficult to understand, because even if you have some of the
design documents you'd find that there is a certain gap between the whiteboard
design and the actual code base. Refactoring the FailSafe code base is
something I personally deem nearly impossible without active collaboration
with SGI, and well, this isn't happening for a variety of reasons ;)

Our - SuSE's - decision has thus been to jump into cold water and rescue the
best of FailSafe ideas, of which there are indeed plenty - the cdbd is a
magnificient piece of code and functionality - and rebuild them on top of
heartbeat, which seems to have received more community attention in the past
and present.

I've in fact posted a design proposal for such a reworked Resource/Recovery
Manager to the linux-ha-dev list two weeks ago; people familiar with FailSafe
will notice a few similarities.


Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
Principal Squirrel 
SuSE Labs - Research & Development, SuSE Linux AG
  
"If anything can go wrong, it will." "Chance favors the prepared (mind)."
  -- Capt. Edward A. Murphy            -- Louis Pasteur