[Linux-ha-dev] Dynamic Modify the timeout values
DAIKI MATSUDA
d.matuda at gmail.com
Thu Aug 23 18:00:36 MDT 2007
> DAIKI MATSUDA wrote:
> >>>>> Hi, All
> >>>>>
> >>>>> I add the new function for heartbeat-2.0.8 and attached its patch file.
> >>>>>
> >>>>> The function is to apply the new timeout parameters ( keepalive,
> >>>>> deadtime, deadping, warntime ) without stopping the heartbeat services.
> >>>>> Currently heartbeat boot scripts supply the 'reload' or 'forcereload'
> >>>>> function, but it, they are same, does stop the services and the HA
> >>>>> services are moved to standby node, because its process kills the forked
> >>>>> heartbeat processes and clients ( crmd etc. ).
> >>>>> So, we think to without suspending the services make the changing
> >>>>> parameters to apply to driving nodes. Current feature is following.
> >>>>> 1. changing ha.cf <http://ha.cf> file for 4 parameters
> >>>>> 2. send working parent heartbeat process signal SIGRTMAX ( e.g. kill -s
> >>>>> SIGRTMAX `cat /var/run/heartbeat.pid` (Why do I choose SIGRTMAX? I do
> >>>>> not find the unused good signal.)
> >>>>>
> >>>>> As we research the heatbeat, it may be safety. And I want to listen to
> >>>>> your issues for patch and functions.
> >>>> Sorry to be coming in so late on this, but I was working on the release
> >>>> for many weeks now. I really like the idea of dynamically modifying the
> >>>> heartbeat configuration - but if you're going to go to the trouble to do
> >>>> it, I'd like to see it done more generally.
> >>>>
> >>>> In other words, I'd like to be able to change nearly any parameter in
> >>>> ha.cf at run time without restarting heartbeat.
> >>>>
> >>>> This would require reworking (and improving) the way heartbeat starts
> >>>> up. This would be probably about twice or three times as much work as
> >>>> what you've done, but it would be much more useful, and much more general.
> >>>>
> >>>> In the end, if done right, it could be groundwork to letting let us
> >>>> eventually be able receive config updates from the CIB. [I know there's
> >>>> a bootstrapping issue, but we can deal with that when we get to deciding
> >>>> to do that work].
> >>>>
> >>>> I have thought about this and have some specific ideas on what kinds of
> >>>> things need to be done to make this happen.
> >>> Hi, Alan.
> >>>
> >>> I understood what you say and think it is very good idea to tread all
> >>> parameters in ha.cf. I thought my implementation is for testing and it
> >>> is better that you, ha-dev team, make its feature.
> >> I don't know quite what you meant by "it is better that you, ha-dev
> >> team, make it's feature".
> >
> > I am sorry for poor English. It means that the feature you think to
> > make is better than what I made.
> > If possible, could you show the schedule
>
> Not a problem. This will all work out.
>
> I don't have a particular schedule in mind. I'm also not sure how long
> it will take, and this kind of thing depends a lot on how well the
> person doing the change knows the code.
>
>
> Here is a suggested approach. At each stage, please test the patch
> some, submit the patch for review and then test it extensively, and
> submit it for re-review if you found more bugs. I would suggest in this
> order - to keep you from spending too much time testing a patch we ask
> you to do over. In fact, on the first stage maybe review your data
> structures first, because that will determine the code in the end.
>
> Step 1 - Further categorize and modularize the configuration.
> There are at least 4 kinds of statements in the configuration
> and there may be more:
> 1. media statements - like ucast, bcast, etc. Things
> which load plugins and start read/write processes
> 2. global statements - which affect some or all of the
> media statements - things like port number, serial
> baud rate, etc. Knowing which global statements
> affect which media statements, may eventually be
> important.
> 3. Respawn statements - things which start child processes
> this includes the implied respawn statements in things
> like 'crm on'.
> 4. Other statements. For each of these, figure out which
> class of processes are affected by each change.
>
> Make it so that each media statement is processed by a single
> function call. Right now, the processing for any given media
> statement is embedded in a loop. This is just restructuring.
>
> If you store all the ha.cf statements in an array, then you can
> make a minor improvement even in this stage. Make a pass
> through the array looking for global statements and execute them
> first. This will fix some known annoying behaviors where these
> need to occur before they're used.
>
> For media and respawn statements, you need to add an association
> between the statements and the child processes they created.
> That way, when we finally get around to processing changes, we
> can kill them when they go away or change. We already have
> a special way to track processes. Use that code, but create
> new associations.
>
> Note that this doesn't implement the feature we are talking
> about, it just lays the groundwork for it. At this point
> the code won't be able to do anything new. That happens
> in step 2. Test this code in CTS, and test it manually.
> Have it reviewed, and repeat until people are happy.
> Then I'll commit it for you.
>
> Step 2 - add the code to deal with changes in the configuration, and
> figure out when to kill things, when to start new ones.
>
> Step 3 - Create CTS tests which change the configuration, then change it
> back, watching for the correct behavior in each case. Run 1000
> instances of this test alone in a CTS run. After you have had the code
> reviewed, and have run these tests, and everyone is happy, then we'll
> commit this stage of the changes.
>
> Suggested Enhancement - after doing this:
> Since you now know how to restart anything in heartbeat, you should also
> be able to restart a pair of read and write children if either should
> die. So, we should be able to then recover from them dying. Add the
> code to do this, and fix up the CTS test which is supposed to kill
> random processes, to know how to kill any process in the system. Turn
> the test back on, and run 1000 instances of this test in CTS. Similarly
> for this stage, submit it for review, and when everyone is happy, we'll
> commit it.
>
> And, in the end this will be a great improvement, and the system will
> also be more robust (better able to recover from errors) than it has
> ever been.
>
> How does that sound for an outline of a plan?
>
>
> --
> Alan Robertson <alanr at unix.sh>
Hi, Alan-san.
I am sorry for delay. And we asked our sponsor and he admit to
research what you suggest. Though I researched the parameters for
ha.cf, they are over 50 and I think that almost parameters are not
needed to be modified dynamically, e.g. crm, use_logd, baud, etc. So,
your issue is ideally, but to realize it takes many costs and it is
not pratical.
Regards
MATSUDA, Daiki
More information about the Linux-HA-Dev
mailing list