[Linux-ha-dev] Dynamic Modify the timeout values

Alan Robertson alanr at unix.sh
Fri Aug 10 08:58:23 MDT 2007


DAIKI MATSUDA wrote:
>>>>> Hi, All
>>>>>
>>>>> I add the new function for heartbeat-2.0.8 and attached its patch file.
>>>>>
>>>>> The function is to apply the new timeout parameters ( keepalive,
>>>>> deadtime, deadping, warntime ) without stopping the heartbeat services.
>>>>> Currently heartbeat boot scripts supply the 'reload' or 'forcereload'
>>>>> function, but it, they are same, does stop the services and the HA
>>>>> services are moved to standby node, because its process kills the forked
>>>>> heartbeat processes and clients ( crmd etc. ).
>>>>> So, we think to without suspending the services make the changing
>>>>> parameters to apply to driving nodes. Current feature is following.
>>>>> 1. changing ha.cf <http://ha.cf> file for 4 parameters
>>>>> 2. send working parent heartbeat process signal SIGRTMAX ( e.g. kill -s
>>>>> SIGRTMAX `cat /var/run/heartbeat.pid` (Why do I choose SIGRTMAX? I do
>>>>> not find the unused good signal.)
>>>>>
>>>>> As we research the heatbeat, it may be safety. And I want to listen to
>>>>> your issues for patch and functions.
>>>> Sorry to be coming in so late on this, but I was working on the release
>>>> for many weeks now.  I really like the idea of dynamically modifying the
>>>> heartbeat configuration - but if you're going to go to the trouble to do
>>>> it, I'd like to see it done more generally.
>>>>
>>>> In other words, I'd like to be able to change nearly any  parameter in
>>>> ha.cf at run time without restarting heartbeat.
>>>>
>>>> This would require reworking (and improving) the way heartbeat starts
>>>> up.  This would be probably about twice or three times as much work as
>>>> what you've done, but it would be much more useful, and much more general.
>>>>
>>>> In the end, if done right, it could be groundwork to letting let us
>>>> eventually be able receive config updates from the CIB.  [I know there's
>>>> a bootstrapping issue, but we can deal with that when we get to deciding
>>>> to do that work].
>>>>
>>>> I have thought about this and have some specific ideas on what kinds of
>>>> things need to be done to make this happen.
>>> Hi, Alan.
>>>
>>> I understood what you say and think it is very good idea to tread all
>>> parameters in ha.cf. I thought my implementation is for testing and it
>>> is better that you, ha-dev team, make its feature.
>> I don't know quite what you meant by "it is better that you, ha-dev
>> team, make it's feature".
> 
> I am sorry for poor English. It means that the feature you think to
> make is better than what I made.
> If possible, could you show the schedule

Not a problem.  This will all work out.

I don't have a particular schedule in mind.  I'm also not sure how long
it will take, and this kind of thing depends a lot on how well the
person doing the change knows the code.


Here is a suggested approach.  At each stage, please test the patch
some, submit the patch for review and then test it extensively, and
submit it for re-review if you found more bugs.  I would suggest in this
order - to keep you from spending too much time testing a patch we ask
you to do over.  In fact, on the first stage maybe review your data
structures first, because that will determine the code in the end.

Step 1 - Further categorize and modularize the configuration.
	There are at least 4 kinds of statements in the configuration
	and there may be more:
	  1. media statements - like ucast, bcast, etc. Things
		which load plugins and start read/write processes
	  2. global statements - which affect some or all of the
		media statements - things like port number, serial
		baud rate, etc.  Knowing which global statements
		affect which media statements, may eventually be
		important.
	  3. Respawn statements - things which start child processes
		this includes the implied respawn statements in things
		like 'crm on'.
	  4. Other statements.  For each of these, figure out which
		class of processes are affected by each change.

	Make it so that each media statement is processed by a single
	function call.  Right now, the processing for any given media
	statement is embedded in a loop.  This is just restructuring.

	If you store all the ha.cf statements in an array, then you can
	make a minor improvement even in this stage.  Make a pass
	through the array looking for global statements and execute them
	first.  This will fix some known annoying behaviors where these
	need to occur before they're used.

	For media and respawn statements, you need to add an association
	between the statements and the child processes they created.
	That way, when we finally get around to processing changes, we
	can kill them when they go away or change.  We already have
	a special way to track processes.  Use that code, but create
	new associations.

	Note that this doesn't implement the feature we are talking
	about, it just lays the groundwork for it.  At this point
	the code won't be able to do anything new.  That happens
	in step 2.  Test this code in CTS, and test it manually.
	Have it reviewed, and repeat until people are happy.
	Then I'll commit it for you.

Step 2 - add the code to deal with changes in the configuration, and
figure out when to kill things, when to start new ones.

Step 3 - Create CTS tests which change the configuration, then change it
back, watching for the correct behavior in each case.  Run 1000
instances of this test alone in a CTS run.  After you have had the code
reviewed, and have run these tests, and everyone is happy, then we'll
commit this stage of the changes.

Suggested Enhancement - after doing this:
Since you now know how to restart anything in heartbeat, you should also
be able to restart a pair of read and write children if either should
die.   So, we should be able to then recover from them dying.  Add the
code to do this, and fix up the CTS test which is supposed to kill
random processes, to know how to kill any process in the system.  Turn
the test back on, and run 1000 instances of this test in CTS.  Similarly
for this stage, submit it for review, and when everyone is happy, we'll
commit it.

And, in the end this will be a great improvement, and the system will
also be more robust (better able to recover from errors) than it has
ever been.

How does that sound for an outline of a plan?


-- 
    Alan Robertson <alanr at unix.sh>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce


More information about the Linux-HA-Dev mailing list