[Linux-ha-dev] Dynamic Modify the timeout values

DAIKI MATSUDA d.matuda at gmail.com
Thu Aug 23 18:00:36 MDT 2007


> DAIKI MATSUDA wrote:
> >>>>> Hi, All
> >>>>>
> >>>>> I add the new function for heartbeat-2.0.8 and attached its patch file.
> >>>>>
> >>>>> The function is to apply the new timeout parameters ( keepalive,
> >>>>> deadtime, deadping, warntime ) without stopping the heartbeat services.
> >>>>> Currently heartbeat boot scripts supply the 'reload' or 'forcereload'
> >>>>> function, but it, they are same, does stop the services and the HA
> >>>>> services are moved to standby node, because its process kills the forked
> >>>>> heartbeat processes and clients ( crmd etc. ).
> >>>>> So, we think to without suspending the services make the changing
> >>>>> parameters to apply to driving nodes. Current feature is following.
> >>>>> 1. changing ha.cf <http://ha.cf> file for 4 parameters
> >>>>> 2. send working parent heartbeat process signal SIGRTMAX ( e.g. kill -s
> >>>>> SIGRTMAX `cat /var/run/heartbeat.pid` (Why do I choose SIGRTMAX? I do
> >>>>> not find the unused good signal.)
> >>>>>
> >>>>> As we research the heatbeat, it may be safety. And I want to listen to
> >>>>> your issues for patch and functions.
> >>>> Sorry to be coming in so late on this, but I was working on the release
> >>>> for many weeks now.  I really like the idea of dynamically modifying the
> >>>> heartbeat configuration - but if you're going to go to the trouble to do
> >>>> it, I'd like to see it done more generally.
> >>>>
> >>>> In other words, I'd like to be able to change nearly any  parameter in
> >>>> ha.cf at run time without restarting heartbeat.
> >>>>
> >>>> This would require reworking (and improving) the way heartbeat starts
> >>>> up.  This would be probably about twice or three times as much work as
> >>>> what you've done, but it would be much more useful, and much more general.
> >>>>
> >>>> In the end, if done right, it could be groundwork to letting let us
> >>>> eventually be able receive config updates from the CIB.  [I know there's
> >>>> a bootstrapping issue, but we can deal with that when we get to deciding
> >>>> to do that work].
> >>>>
> >>>> I have thought about this and have some specific ideas on what kinds of
> >>>> things need to be done to make this happen.
> >>> Hi, Alan.
> >>>
> >>> I understood what you say and think it is very good idea to tread all
> >>> parameters in ha.cf. I thought my implementation is for testing and it
> >>> is better that you, ha-dev team, make its feature.
> >> I don't know quite what you meant by "it is better that you, ha-dev
> >> team, make it's feature".
> >
> > I am sorry for poor English. It means that the feature you think to
> > make is better than what I made.
> > If possible, could you show the schedule
>
> Not a problem.  This will all work out.
>
> I don't have a particular schedule in mind.  I'm also not sure how long
> it will take, and this kind of thing depends a lot on how well the
> person doing the change knows the code.
>
>
> Here is a suggested approach.  At each stage, please test the patch
> some, submit the patch for review and then test it extensively, and
> submit it for re-review if you found more bugs.  I would suggest in this
> order - to keep you from spending too much time testing a patch we ask
> you to do over.  In fact, on the first stage maybe review your data
> structures first, because that will determine the code in the end.
>
> Step 1 - Further categorize and modularize the configuration.
>         There are at least 4 kinds of statements in the configuration
>         and there may be more:
>           1. media statements - like ucast, bcast, etc. Things
>                 which load plugins and start read/write processes
>           2. global statements - which affect some or all of the
>                 media statements - things like port number, serial
>                 baud rate, etc.  Knowing which global statements
>                 affect which media statements, may eventually be
>                 important.
>           3. Respawn statements - things which start child processes
>                 this includes the implied respawn statements in things
>                 like 'crm on'.
>           4. Other statements.  For each of these, figure out which
>                 class of processes are affected by each change.
>
>         Make it so that each media statement is processed by a single
>         function call.  Right now, the processing for any given media
>         statement is embedded in a loop.  This is just restructuring.
>
>         If you store all the ha.cf statements in an array, then you can
>         make a minor improvement even in this stage.  Make a pass
>         through the array looking for global statements and execute them
>         first.  This will fix some known annoying behaviors where these
>         need to occur before they're used.
>
>         For media and respawn statements, you need to add an association
>         between the statements and the child processes they created.
>         That way, when we finally get around to processing changes, we
>         can kill them when they go away or change.  We already have
>         a special way to track processes.  Use that code, but create
>         new associations.
>
>         Note that this doesn't implement the feature we are talking
>         about, it just lays the groundwork for it.  At this point
>         the code won't be able to do anything new.  That happens
>         in step 2.  Test this code in CTS, and test it manually.
>         Have it reviewed, and repeat until people are happy.
>         Then I'll commit it for you.
>
> Step 2 - add the code to deal with changes in the configuration, and
> figure out when to kill things, when to start new ones.
>
> Step 3 - Create CTS tests which change the configuration, then change it
> back, watching for the correct behavior in each case.  Run 1000
> instances of this test alone in a CTS run.  After you have had the code
> reviewed, and have run these tests, and everyone is happy, then we'll
> commit this stage of the changes.
>
> Suggested Enhancement - after doing this:
> Since you now know how to restart anything in heartbeat, you should also
> be able to restart a pair of read and write children if either should
> die.   So, we should be able to then recover from them dying.  Add the
> code to do this, and fix up the CTS test which is supposed to kill
> random processes, to know how to kill any process in the system.  Turn
> the test back on, and run 1000 instances of this test in CTS.  Similarly
> for this stage, submit it for review, and when everyone is happy, we'll
> commit it.
>
> And, in the end this will be a great improvement, and the system will
> also be more robust (better able to recover from errors) than it has
> ever been.
>
> How does that sound for an outline of a plan?
>
>
> --
>     Alan Robertson <alanr at unix.sh>

Hi, Alan-san.

I am sorry for delay. And we asked our sponsor and he admit to
research what you suggest. Though I researched the parameters for
ha.cf, they are over 50 and I think that almost parameters are not
needed to be modified dynamically, e.g. crm, use_logd, baud, etc. So,
your issue is ideally, but to realize it takes many costs and it is
not pratical.

Regards
MATSUDA, Daiki


More information about the Linux-HA-Dev mailing list