[Linux-ha-dev] Quorum server and split-site
Dejan Muhamedagic
dejanmm at fastmail.fm
Thu Feb 14 11:39:52 MST 2008
Hi,
On Thu, Feb 14, 2008 at 07:22:08PM +0100, Lars Marowsky-Bree wrote:
> In a call today the question was raised why I considered the quorumd
> design, as it stands today, as not capable of supporting split-side
> configurations.
>
> The call was a bit tight, so I'd like to elaborate here:
>
> 1. Consider the easy case of 2 nodes at two sites, with a 3rd quorum
> server somewhere.
>
> 2. Consider loss of link between the two sites; possibly site A also
> loses connection to the quorum server.
>
> 3. After a timeout, the quorum server will grant quorum to site B; site
> A will have lost it.
>
> 4. Site A will try to stop resources as soon as it loses quorum. However
> it _cannot deal with stop failures_ at this stage.
>
> It cannot use fencing, as the current setup cannot differentiate between
> two sites, and doesn't know it has to consider the other site dead
> already. (It'd block eternally.) Further, if there's just one node, the
> suicide plugin wouldn't work - see the discussion on the linux-ha list.
Suicide plugin will have to be made to work again. In one form or
the other.
> The same scenario occurs with >2 nodes, just more so - the system
> doesn't know the difference between site-local and site-remote fencing,
> and cannot cope.
>
> Basically, quorumd only "works" in scenarios where fencing is not
> required and stop failures don't hurt; most of the time, DR scenarios
> are requested for mission critical services though.
I must say that I don't see how this could work at all with what
we currently have, unless the site with quorum assumes that the
other site (which obviously doesn't have quorum) will clean
itself up. This is a problem similar to the suicide plugin, i.e.
the requesting party may have to trust that the other party will
take care off the issue itself somehow.
> There's also the rathert fundamental design issue that I don't
> personally like the stretched cluster implementation as-is; running a
> local cluster protocol over a WAN link is not exactly perfect. However,
> that could be considered an inefficiency; the limitations above limit
> the use cases of the quorum server to a very, very small subset.
I'm not familiar with the protocol used by quorumd.
> I personally favor the cluster-of-clusters approach, but I concede that
> I don't have working code for that either. ;-)
I don't see this as a problem of quorumd. All it should do is
to pick a site with quorum. Other stuff is beyond it. The rest is
probably something which has to be supported either by the CRM or
by stonithd, i.e. sth like setting up stonithd so that it is
going to return success, probably after a configurable delay, for
every request which is for nodes belonging to the other site.
Another issue I see with quorumd is that it is itself a SPOF.
Thanks,
Dejan
> Regards,
> Lars
>
> --
> Teamlead Kernel, SuSE Labs, Research and Development
> SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG N?rnberg)
> "Experience is the name everyone gives to their mistakes." -- Oscar Wilde
>
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
More information about the Linux-HA-Dev
mailing list