[Linux-ha-dev] [RFC] Fencing, the next one

Andrew lists at beekhof.homeip.net
Sun Aug 15 11:15:44 MDT 2004


Mostly the purpose of this reply is to set the record straight on a 
couple of things (how it works now) and make sure I understand Alan's 
proposal properly.

For what its worth, I do like the idea of a smart fence_d (options 
i,2,c from one of my previous emails in this thread) provided that:
- the CRM decides who is connecting to which fencing devices (to me 
this part is definitely a resource management problem)
- (for the moment at least) the CRM should decide when and who to fence 
(but not who should do it)

Likewise, a 2-node RM would also have to perform the same functions.

Whether the LRM or a fence_d monitors the controller connections 
doesn't make much difference to me, but for consistency perhaps the LRM 
is a good option.

Andrew

On Aug 13, 2004, at 10:40 PM, Alan Robertson wrote:

> Lars Marowsky-Bree wrote:
>
>
> Hi Lars,
>
> This is very well-written and I think communicates your ideas very 
> clearly.  Thanks for your efforts!
>
>> Hi all,
>
>> 1. Assumption (at least for the time being): In a CRM cluster, the CRM
>>    is in charge.
>> As we don't have a global cluster recovery policy manager yet, the CRM
>> is probably the closest we have to it, and I would argue that it may
>> also be possible to eventually extend it to meet all requirements. 
>> Maybe
>> it will also become subservient to the meta-policy manager. However 
>> that
>> may turn out, I propose to accept that the CRM is in charge for the 
>> time
>> being, which it is capable of for now, until we integrate with GFS.
>> We don't have a fencing layer yet, nor do we have any other users of
>> fencing yet. Or if they would appear, we'd have to figure out a way to
>> tell the CRM about their needs. But, see below, I think this proposal
>> also allows for an easier migration to a fencing subsystem eventually.
>> We'll need to revisit this after the beta in the next year.
>
> I understand this assumption, and that I can accept it as long as we 
> don't have to integrate or support any cluster filesystems while 
> holding to this assumption.  I would assume that the phrase "until we 
> integrate with GFS" to mean that you agree with these same idea.
>
> Is that a correct understanding?
>
> However, I believe that a design which actively takes advantage of 
> this assumption  hurts us from several perspectives, which I will 
> explain later in this reply.
>
>> 2. We want to monitor the fencing daemon.
>
> Yes.
>
>> This means we need to pick one of the nodes from the list which can
>> reach a given particular STONITH Device, start the fenced there and
>> begin monitoring it.
>> Now, starting/stopping/chosing a node is really something the CRM does
>> well with in combination with the LRM. So, I propose to have the 
>> fenced
>> started, stopped, monitored etc via an OCF Resource Agent. The fenced
>> tasks would be configured as (dependency free, self-fencing) regular
>> resources in the CIB/CRM.
>> If the fenced dies or a monitoring failure of some sort is received, 
>> the
>> CRM reallocates the resource somewhere else as always.
>> Note explicitly: That's all perfectly within the spec of the OCF RA. 
>> See
>> the next point.
>> 3. We do NOT overload the OCF RA API with any semantic changes or new
>>    actions.
>
> This assumption seems reasonable.  It does not come without cost
> however, since I'm afraid it adds dependencies between the
> STONITH resources, and the non-resource of the fencing operation.
> See below for more examples and explanations.
>
>> Instead, the fenced, after having started, connects to the STONITH
>> Device, signs in with the local CRMd and informs it about the list of
>> nodes it can fence via the STONITH device it controls, and also when
>> that list changes. Fencing requests from the CRMd to the fenced also 
>> go
>> via that connection.
>
>
> However, here is where I beg to differ with you.  I talk more about 
> why this is later...
>
>> For this channel, the IPC layer is used. This means that the normal 
>> code
>> path to fence a node is non-blocking.
>
> Good.
>
>> (Quick stray: _If_ we lose the node which we need to fence some other
>> node, yes indeed we have to reallocate the fenced for that device
>> somewhere else first and complete the sign-in, which means forking a 
>> few
>> processes etc. However, I shall claim for the sake of this discussion
>> that this does not matter; we are not in a critical write-out path for
>> GFS here, second we are not even yet integrated with a cluster
>> filesystem, and third if a node is so sick that it can't fork a RA +
>> fenced we better find out very quickly so we don't accidentially
>> allocate some other resources there. So I claim this is 'good 
>> enough'.)
>
> But, just because you're in charge of everything (which you have to be 
> in the non-GFS scenario), and I fully accept that assumption, does not 
> mean that this second assumption follows.  Perhaps it's implied rather 
> than stated here.  Let me see if I can tease out an assumption or two 
> which I believe may be implied here...
>
> 	A) That it is acceptable to have a completely different
> 		implementation for the GFS and non-GFS cases.
>
> 		This will mean writing, integrating, and debugging
> 		and [possibly even supporting indefinitely]
> 		two different implementations for the fencing
> 		code.  The main reason why I think it would be
> 		acceptable to do this would be if your proposal
> 		were significantly less effort (like thousands
> 		of lines of code) different in size.
>
> 	B) That this implementation is significantly shorter or
> 		less work than one which could be common.
>
> 		[See the tail end of (A) for reasoning why
> 		 one could argue that this is implied by
> 		 your statements].
> 	
> 	    An alternative statement of this is:
>
> 	    We don't have time or resources or whatever to
> 		do this "right" the first time.  This version
> 		is the classic "We don't have time to do it
> 		right, so let's make time to do it over".
>
> 		Sun Jiang Dong and Huang Zhen believe they
> 		have the time to do this in their schedules.
> 		I also think they can do this - and my belief
> 		is that they can do this in time.
>
> 		It is my strong suspicion is an implementation
> 		which has the smarts in the fencing subsystem
> 		is approximately the same size as one which
> 		is dumb.  It is my belief that the sizes
> 		differ by no more than 500 lines
> 		of code one way or the other.  It is entirely
> 		possible that the "right" approach actually
> 		has fewer lines of code in it - by the time
> 		you take care of all the timing and implicit
> 		dependencies of putting into the CRM and
> 		getting it to work flawlessly.
>
> 		As far as the work in the CRM itself, it's
> 		quite obvious that this proposal is more
> 		work in the CRM than a "smart" fencing
> 		subsystem would be.  Since the CRM is
> 		our main critical path item to release 2.0,
> 		it would seem that schedules would actually
> 		be improved by a smarter fencing subsystem
> 		being done by someone other than Andrew
> 		(our brilliant, wonderful, but critical-path
> 		CRM developer)
> 	
> 	C) The abstraction of the concept of fencing from
> 		a cluster-wide perspective should be
> 		incorporated into the CRM, rather than
> 		being a full-fledged subsystem with its
> 		own high-level abstractions.
>
> 	D) Adding non-functional complexity to the CRM is
> 		acceptable (low-cost or no-cost).
>
> 		The CRM is by far the most complex subsystem that
> 		we have.  It is my belief that anything
> 		which adds complexity to the CRM must have
> 		strongly compelling motiviations to include.
> 		Since this does not (in my belief) shorten
> 		schedules or add any new capabilities, I don't
> 		know of any compelling reasons to choose
> 		this alternative.
>
> 	[Guess that's all that comes to mind for the moment]
>
>> Note: This means that inside the CIB, the fenced's are just really 
>> plain
>> out stupid resources and the CRM doesn't even know the difference 
>> (we'll
>> just to have to tag them somehow so the GUI knows).  _If_ we had a
>> fencing system which managed all of that for us, that could sign in to
>> us and tell us about all that and we'd relay the fencing requests 
>> there
>> etc. And even if the CRM is managing the fenced topology as proposed 
>> here, we
>> can start multiple incarnations for those devices which support it, so
>> we can shorten case c) (see below) and have it all pre-allocated. I
>> think this is pretty cool.
>> I do not propose this as the final solution, but as one to work with 
>> for
>> the time being and one we can improve on. In particular 4) should
>> address the most loudly voiced concerns against overloading the OCF RA
>> API.
>
> To this comment, I note implicit assumption (B) above.
>
> I would assume intend to mean me when you say "loudly voiced".  If 
> that's the case then I can only say "Mea culpa"!  I agree that this 
> was one of my most loudly voiced detailed design concern. And, I agree 
> that this approach handles it nicely.
>
> However, I would not agree that this was my most loudly-voiced 
> complaint overall.
>
> I would say that my most loudly voiced high-level design concern is 
> that the implementation is not reusable.  I stand by that as my "most 
> loudly voiced" high-level design concern.
>
> From an architectural point of view, my most loudly-voiced complaint 
> would be that fencing is specialized and complex enough to require a 
> "smart" fencing subsystem.  The CRM will likely be in control of that 
> fencing subsystem, but I believe that architecturally, my most loudly 
> voiced concern would be that "fencing rates its own subsystem".  In 
> this sense, I mean that to be that it knows about fencing, and that 
> it's interfaces are high-level, rather than requiring detailed 
> fencing-specific knowledge to use. The preceding comment is the 
> high-level design consequence of this architectural concern.  
> Fundamentally, I claim that the fencing is it's own knowledge domain - 
> separate from anything else in the cluster system.  No one else 
> particularly needs that knowledge, and it's complex enough to make 
> abstracting it a good choice.
>
> My other architecture-level comment to go with this is that all of 
> Linux-HA has the architectural principle of all components being 
> readily replacable.  This means they have clean independent interfaces 
> at a high-enough level to be usable without an especially smart 
> program calling them.  This means that they have to be smart enough to 
> be usable in other contexts.  So, this architectural deficiency is the 
> result of the violation of an architectural principle.
>
> [[Architectural concerns are difficult to communicate adequately.  So, 
> I apologize in advance if I fail to make this as clear as I would 
> like.  If I've said something non-obvious or maybe even stupid, please 
> ask me to clarify.]]
>
> Here are my reasons for claiming this:
>
>  - We already have a low-level stupid fencing component.  We
> 	don't need another interface to approximately this same
> 	level of function.  If we're going to crawl up food chain
> 	in terms of levels, then presenting a fencing
> 	abstraction that is easy to use is a better choice.
>
>  - Fencing topology is something which no one else has any use
> 	for, and having to know it complicates the use
> 	of the fencing subsystem, and hence the CRM.
> 	Adding complexity to the CRM will slow us down,
> 	and make the system less reliable and provide us
> 	nothing in return.
> 	["Complexity is the enemy of reliability"]
>
> And a personal reason for feeling strongly about this:
>
>  - The biggest single problem with heartbeat 1.x is that it
> 	didn't abstract out the cluster resource manager into
> 	a separate subsystem.  I mention this because of my
> 	obvious personal "blame" in this matter, but to also
> 	use it as an example of what happens when you try
> 	and combine things that shouldn't be combined.
>
>> However, this would allow us to leverage the functionality which the 
>> CRM
>> already implements (like managing the configuration, monitoring,
>> allocating based on node attributes etc).
>
> This is good.
>
>> To give three roughly sketched scenarios how this would work in
>> practice:
>> a) Regular operation.
>> CRM starts up and initially assigns all resources. As all nodes are
>> healthy, no fencing needs to occur, so the fenced gets started on some
>> node and begins monitoring, just like all other resources, and signs 
>> in
>> with the CRM, tells it the list of nodes it can fence (which we track
>> in the node status section in the CIB). This one is real easy ;-)
>
> I would contend that it's not so easy, because it requires that you 
> bring into your domain information which is not properly any of your 
> business - that is - fencing topology.  And, then you have to 
> coordinate and merge all this data in the CRM, and understand how to 
> manage that data.  And, the CRM has no business knowing or dealing 
> with this information, or knowing the rules for managing it.

Well if fencing is being done by the CRM then it *is* the CRM's 
business :)

Also, the information itself is pretty boring (basically a list of 
tupples where the first one can fence the second - or whatever) and 
doesn't really require any management.

But I take you point that you feel they should be totally separate.

>
> This is a consequence of having an insufficiently high-enough 
> interface to fencing.  That is, the fencing subsystem isn't smart 
> enough.
>
>> b) Node failure.
>> A node fails now; lets for this scenario assume it's not the one 
>> hosting
>> the fenced.
>> The CRM on the DC notices that that node held some resources which,
>> prior to being allocated somewhere else, need to have node fencing
>> completed.
>> The Policy Engine goes through it's list of fenced-node lists, figures
>> out which fenced will do that, and adds an appropriate operation to 
>> the
>> transition graph which needs to be executed prior to reallocating the
>> resources.
>> Then the Transitioner eventually sees this graph, sees the operation 
>> and
>> relays it to the CRMd on said node; now instead of going to the LRM, 
>> the
>> CRMd is smart! and switches the message to the fenced, which then
>> eventually returns a success code and we progress as normal.
>
>> (Note: The CRM already has the switching logic of relaying messages to
>> multiple targets for the admin client/lrm etc.)
>> c) Harder case: fenced node fails.
>> We lose a node which was hosting a fenced.
>> Same as before, we notice that we have some resources which require
>> fencing. However, different from before we notice we don't have a node
>> which says it could perform that fencing request. So, those resources
>> are 'blocked' for now.
>> But, the fenced itself has _no_ dependencies (and is self-fencing), so
>> the PE will figure that it can reallocate the fenced somewhere. So
>> eventually it will compute that, execute the Transitioner who'll go 
>> tell
>> the LRM who'll go start the RA etc pp, so a second later, magically,
>> somewhere in the cluster a new fenced signs in with a CRMd and tells 
>> us
>> about it's powers.
>
> This implies specialized knowledge of fencing topology which you 
> otherwise have no particular use for.  This is a consequence of not 
> having a smart-enough fencing subsystem.

I dont believe there is specialized knowledge required here.

The fencing RA would be re-located if one of its operations fails.

The fencing RA is started because it meets 2 conditions a) its not 
running already, and b) all its pre-requisites (ie. none) are 
satisfied.

In terms of the timing of the actual fencing operation, it only makes 
sense that an operation (not just the fencing operation) cant occur 
until the resource is started somewhere.

Those are the basic building blocks that all RAs work on, not just 
fencing.

>
>> Then, as that data is relayed to the Designated Coordinator CRM, the 
>> PE
>> notices that, woah cool, with that fenced started it _can_ satisfy 
>> those
>> fencing requirements and proceeds like in scenario b).
>
> Correct me if I'm wrong, but I think that this approach means that you 
> have to know that fencing operations (which aren't resources) have 
> implicit dependencies on the fenced Resources.   Since a fencing 
> operation isn't exactly a resource, you have to have special-case code 
> to wait until all fencing resources have been relocated before 
> performing any fencing.
>
> So, this is special-case code (and semi-ugly code I would guess) to 
> hack in the waiting for the right fenceds to be instantiated in order 
> to perform fencing.
>
> And case (d)
>   The DC fails...
>
> And cases e:  Both the DC and the needed fenced die:
>   The DC fails, and it was also the node which was the node which
>   also hosted the fenced that would have been used to reset it.
>
> And, all but case (b) requires a certain amount of special case code 
> unique to fencing.
>
> Let me reiterate that this is a very low-level consequence of not 
> having a smart-enough fencing subsystem (that is, of architectural 
> difficulties). You can move this around, or hide it in a different 
> place, and maybe even make it a little better.  [It is rare that 
> architectural problems can be made to go away by good design].  But, 
> my main complaints are at the architectural level.  This comment is a 
> side-effect of what I believe to be an architectural deficiency.
>
> Every time a node dies, you need to go find out the information you 
> had on what fenceds it was running, and clean out this information 
> when it dies. This isn't hard, but it's an example of a few dozen 
> lines of code that you simply don't have to write if you have a smart 
> fencing layer.
>
> And whenever you start a new DC, you need to go query all the various 
> fenceds and re-collect this information all over again.  Another 
> few-dozen lines of code...

My question (for Alan) is how this changes when you have a "smart" 
fencing system.  Would you not have the same problems if you replaced 
"DC" above with its fence_d equivalent?  (whether thats a "local 
fence_d" or an elected "super fence_d" I'm not sure it makes much 
difference).



Perhaps you could outline how some of these scenarios would play out 
with a "smart" fencing system.  If I understand the basic concept (and 
if its still the same as last time I checked):

Node A requests its local fenced to fence Node B
If Node A can do it and it works, report done.
If Node A can't do it (and the request originated at Node A), it finds 
someone who can and forwards the request.
	If delegate succeeded, report that answer
	Else pick another delegate and try again
Report failure

Questions (mostly I want to double check with you what you have in 
mind):
	Where (and how) does Node A get it's topology info from?
		- Broadcast of local changes and method for requesting current 
contents?

	What if the node A chooses isn't there/needs fencing?

	What if no-one can perform the fence?
		- Fence_d would return failure and that branch of the transition 
would block

	Who is asking for the fence operation?
		- the CRM?

	Who (and how) is deciding which nodes are controlling which fencing 
devices?
		- the CRM?

	Who is in charge of monitoring?
		- the LRM?
		- the fence_d?

>
>
>> Some answers ahead of time:
>> - If the fenced signs in with us anyway, couldn't we monitor it via 
>> that
>>   link directly - ie, we notice it dieing anyway (the connection 
>> closes)
>>   and it could use apphbd internally to make sure it itself is
>>   fail-fast. Do we really need the indirection via the LRM? My 
>> rationale
>>   for the path chosen is that it allows us to reuse existing
>>   functionality in the CRM/LRM better, and in the configuration, but 
>> one
>>   might disagree.
>
> I don't have much problem with this, as this is largely what I 
> proposed in this regard.
>
>> - Why does the fenced sign in to us and not we with the fenced?
>>   Rationale is that we start it, and thus we may be around while it
>>   isn't. However, it's a reasonably mood point to argue who is running
>>   the while(!connected) {signin()} loop, be it the fenced, the CRMd, 
>> or
>>   whether the fenced signs in with the LRM or the CRM etc. I don't
>>   really care nor think it matters much, and it's easily changed one 
>> way
>>   or the other later.
>
> For your set of assumptions (including all the implicit assumptions I 
> mentioned above), this is sensible, since the design you propose is 
> unusable in other contexts.  However,  since I believe that you really 
> want an independent "smart" fencing subsystem.  In this case, what 
> would make sense is that you make the CRM a client of the fencing 
> subsystem, just like it's a "client" of the LRM.
>
>> Thanks for all insights; I do hope this addresses some concerns. I've
>> tried _really_ hard to listen (respectively, read) them.
>
> This proposal is an improvement in some ways over previous proposals, 
> and like most designs can be made to work.  However, I am still not in 
> favor of it - primarily for architectural reasons.
>
> I believe it will extend our schedules on our critical path, and add 
> non-functional complexity to our most complex subsystem, thereby 
> impacting system reliability.
>
> And, without addressing the architecture issues (wrong level of 
> interface to the fencing layer), these problems will be very difficult 
> to completely eliminate.  This is a common problem of choosing the 
> wrong API or interface for a subsystem.  Changing the implementation 
> cannot not fix the interface.
>
> Hopefully I did a little better job of explaining these architectural 
> objections than I have in the past.
>
> Please feel free to ask for clarification.
>
> And, thank you again for your well-presented explanation of your 
> design.
>
> -- 
>     Alan Robertson <alanr at unix.sh>
>
> "Openness is the foundation and preservative of friendship...  Let me 
> claim from you at all times your undisguised opinions." - William 
> Wilberforce
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/



More information about the Linux-HA-Dev mailing list