[Linux-ha-dev] Quorum server and split-site

Lars Marowsky-Bree lmb at suse.de
Thu Feb 14 11:57:57 MST 2008


On 2008-02-14T19:39:52, Dejan Muhamedagic <dejanmm at fastmail.fm> wrote:

> > Basically, quorumd only "works" in scenarios where fencing is not
> > required and stop failures don't hurt; most of the time, DR scenarios
> > are requested for mission critical services though.
> 
> I must say that I don't see how this could work at all with what
> we currently have, unless the site with quorum assumes that the
> other site (which obviously doesn't have quorum) will clean
> itself up. This is a problem similar to the suicide plugin, i.e.
> the requesting party may have to trust that the other party will
> take care off the issue itself somehow.

Well, one way or the other that's exactly what has to be trusted
"sufficiently" here.

Alternatively, it can wait for an admin to confirm that the other site
is indeed cleanly down.  But even admins can be wrong.

You don't get 100%.

> > There's also the rathert fundamental design issue that I don't
> > personally like the stretched cluster implementation as-is; running a
> > local cluster protocol over a WAN link is not exactly perfect. However,
> > that could be considered an inefficiency; the limitations above limit
> > the use cases of the quorum server to a very, very small subset.
> I'm not familiar with the protocol used by quorumd.

That's not what I said; quorumd uses SSL. But we need to have the full
M:M communication across the wire right now; not optimal.

> > I personally favor the cluster-of-clusters approach, but I concede that
> > I don't have working code for that either. ;-)
> I don't see this as a problem of quorumd.

It's a design issue with stretched clusters.

> All it should do is to pick a site with quorum. Other stuff is beyond
> it.

True, but that still means that currently, we cannot handle split-side.
Just picking a site with quorum isn't sufficient.

> The rest is probably something which has to be supported either by the
> CRM or by stonithd, i.e. sth like setting up stonithd so that it is
> going to return success, probably after a configurable delay, for
> every request which is for nodes belonging to the other site.

That's part of making split-site work via stretched clusters, yes. The
delay is already handled by quorumd, so no need to duplicate it. It'd be
somewhat cleaner if the CRM didn't even ask for those nodes to be
fenced; the quorumd client could set them to a clean shutdown state in
the CIB instead, for example.

I'd personally prefer the sites to be more independent clusters in their
own right, with the DR component managing the fail-over more or less
indepdently. The DR component would also take care of differences in the
resource configuration when running on some site. But that's a question
of design philsophy, I guess.


> Another issue I see with quorumd is that it is itself a SPOF.

This however it is not ;-)


Regards,
    Lars

-- 
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde



More information about the Linux-HA-Dev mailing list