[Linux-ha-dev] Request For Comments: Proposed Consensus Cluster
membership manager for heartbeat
Alan Robertson
alanr@suse.com
Tue, 31 Oct 2000 10:14:57 -0700
Lars Marowsky-Bree wrote:
>
> On 2000-10-30T15:25:02,
> David Brower <dbrower@us.oracle.com> said:
>
> > It's hard for me to understand how serial links are of interest
> > with cluster sizes greater than a very few nodes, and certainly
> > not at all once we have 8 or 10.
>
> Yes.
>
> Serial links are mostly useful for 2-3 node clusters, beyond that, they not
> only become a bottleneck but also a cabling nightmare ;-)
>
> > I am skeptical of anything that relies exclusively on messaging;
> > since I believe in shared storage systems, I think the presence
> > of shared, persistant storage can be very useful at quickly
> > resolving connectivity, partition and quorum problems.
>
> Shared storage reachable yes/no is basically just an additional "message"
> about connectivity.
It's a message about connectivity which is sent on a completely independent
communication medium. An important medium without doubt, but just another
medium.
> I, as Alan, like the Kimberlite approach of using the shared storage as an
> additional full communication medium.
>
> However, I see that a shared storage system may not necessarily be connected
> to all nodes, and we'll end up having to forward packets between
> "subclusters" to have full connectivity in the main cluster. (Which is sort of
> the same with serial links, where you don't want to forward _everything_ over
> the serial link but just what has to be forwarded)
>
> Now, let me throw in two random ideas.
<snip>
> I suggest "OSPF - Anatomy of a routing protocol" by Moy for further reading.
I'll get a copy.
>
> For messaging, there are two issues. First is getting the message to all
> nodes, second is to acknowledge that the message was either send to all nodes
> or none.
>
> The first one can be solved reasonably well by using a flood fill algorithm.
> This has been proven to work rather well with Usenet. Basically newsservers
> announce over all their links: "I have message-id foo", and the peers reply
> with "send message with message-id foo" if they haven't seen it yet and
> forward it on too. (There is a slight short-cut in the Path header - ie a
> message isn't offered to a host which is already in the path to prevent loops
> and messages don't have to be offered everytime)
>
> The second one is basically solved as soon as you know that the message has
> been accepted by at least two peers: This will protect you against a single
> failure, because both of them will continue broadcasting the message.
My first reaction is that this is overkill for this problem. A cluster is
quite a different entity from the world wide internet with all of its
political, organizational and technical challenges. You can afford
redundant communications, and you should use them. This mitigates the need
to work perfectly when your communications are broken. Survive. Keep
working. Don't do anything stupid. Tell the admin that something is
broken. Get help.
For the upper layer cluster membership, this makes good sense. But for the
lower-layer membership, it sounds like a lot of complexity.
> Ensuring ordered delivery is a bit tricky, I haven't yet come up with a
> solution which doesn't involve sending a reply from every node to the cluster
> leader. However, if heartbeat automatically broke up the cluster into
> subclusters of lets say 16 nodes, this would be a hierarchial messaging
> approach and wouldn't be too bad.
It doesn't matter that much, if you have 16 clusters of 16 nodes, you get
256 replies. This matters a lot if you have an n**2 communication
algorithm. But if it's linear, then it's linear, and cutting it into
subclusters doesn't help much, it just means that one machine doesn't have
to process all the replies. For 100 nodes that's probably not too bad (use
random reply delays). For 1000 nodes, it probably matters.
> But hell, right now we are just solving the membership issue and not the
> quorum issue, no? ;-)
Actually, this issue of replies comes up in the membership algorithm, and
ordered message delivery isn't the same as quorum either. But in either
case, you're right ;-)
For ordered message delivery, look at the phoenix n-phase transactions.
They've done a pretty reasonable job.
> Let me add I do not believe that the networking concepts are 1:1 applicable,
> but that I nonetheless think that they do hold some truth and we shouldn't
> reinvent the wheel everytime.
I understand. I've sent a query off to my favorite bookstore.
Certainly you raise some interesting thoughts. If you have hierarchical
cluster support, what would be the "natural" lower-level cluster size. One
answer is: Whatever a single switch supports. For example, 24 nodes, or 32
nodes or however many plug into a single switch. Another thought would be
"however many machines are in a rack". These are probably related ;-) For
google, the latter answer would be 80.
I am still shooting for supporting ~ 100 nodes for the lower-layer
membership software.
Maybe it won't work. I think it's a good goal, though...
-- Alan Robertson
alanr@suse.com