AW: AW: [Linux-ha-dev] Solaris 10/i386: Heartbeat/Stonithd hangsonshutdown.

David Lee t.d.lee at durham.ac.uk
Fri May 11 05:14:02 MDT 2007


On Fri, 11 May 2007, Otte, Joerg wrote:

> What I can see the difference between Bourne and Bash
> is established in the following process trees:
> Bourne:
> 1125  /usr/sfw/lib/python2.3/heartbeat/heartbeat
>   1129  /usr/sfw/lib/python2.3/heartbeat/heartbeat
>   1130  /usr/sfw/lib/python2.3/heartbeat/heartbeat
>   1131  /usr/sfw/lib/python2.3/heartbeat/heartbeat
>   1132  /usr/sfw/lib/python2.3/heartbeat/heartbeat
>   1133  /usr/sfw/lib/python2.3/heartbeat/heartbeat
>   1135  sh -c /usr/sfw/lib/python2.3/heartbeat/ccm
>     1144  /usr/sfw/lib/python2.3/heartbeat/ccm
>   1136  sh -c /usr/sfw/lib/python2.3/heartbeat/cib
>     1147  /usr/sfw/lib/python2.3/heartbeat/cib
>   1137  sh -c /usr/sfw/lib/python2.3/heartbeat/lrmd -r
>     1143  /usr/sfw/lib/python2.3/heartbeat/lrmd -r
>   1138  sh -c /usr/sfw/lib/python2.3/heartbeat/stonithd
>     1145  /usr/sfw/lib/python2.3/heartbeat/stonithd
>   ..
>
> Bash:
> 2849  /usr/sfw/lib/python2.3/heartbeat/ha_logd -d
>   2858  /usr/sfw/lib/python2.3/heartbeat/ha_logd -d
> 2859  /usr/sfw/lib/python2.3/heartbeat/heartbeat
>   2863  /usr/sfw/lib/python2.3/heartbeat/heartbeat
>   2864  /usr/sfw/lib/python2.3/heartbeat/heartbeat
>   2865  /usr/sfw/lib/python2.3/heartbeat/heartbeat
>   2866  /usr/sfw/lib/python2.3/heartbeat/heartbeat
>   2867  /usr/sfw/lib/python2.3/heartbeat/heartbeat
>   2870  /usr/sfw/lib/python2.3/heartbeat/ccm
>   2871  /usr/sfw/lib/python2.3/heartbeat/cib
>   2872  /usr/sfw/lib/python2.3/heartbeat/lrmd -r
>   2873  /usr/sfw/lib/python2.3/heartbeat/stonithd
>   ..
>
> When using the Bourne shell to start a child prodess I get
> 2 processes (in the case of stonithd 1138,1145)
> When using Bash I get only one child process (2873).
>
> When I try to kill 1138 from command line nothing happens.
> When I kill 1145 from command line both processes (1138,1145)
> disappear. So I think heartbeat always tries to kill 1138
> which does not work.
>
> Just an idear: A solution may be to start child processes
> directly (not via the shell). This would work independently
> of the installed shell.


Many thanks, Joerg!  That's most helpful.

That reasoning seems OK, although I'm no expert on this.  What we now need
is advice from the main heartbeat developers (Alan, Lars, Andrew, ...).

In what follows, I refer to the "dev" tree as of now, 11/May/2007.

File "heartbeat/heartbeat.c", function "start_a_child_client()" does:
-----------------------------
        switch(pid=fork()) {

                case -1:        cl_log(LOG_ERR
                                ,       "start_a_child_client: Cannot fork.");
                                return;

                default:        /* Parent */
                                NewTrackedProc(pid, 1, PT_LOGVERBOSE
                                ,       centry, &ManagedChildTrackOps);
                                hb_pop_deadtime(NULL);
                                return;
-----------------------------

in which the 'default' clause stores the child "pid".  It then proceeds:
-----------------------------
                (void)execl("/bin/sh", "sh", "-c", centry->command
                ,       (const char *)NULL);
-----------------------------

It might be that different implementations of "/bin/sh" do different
things.  For instance, some Bournes (e.g. Solaris) might (I speculate!)
might then spawn "centry->command" as a subprocess (a further fork/exec)
whereas Bash might (speculation continues) do a direct replacement "exec".

It seems that heartbeat requires the 'direct replacement "exec"' style of
operation (stored child pid is the final "centry->command"), which Bash
seems to give but which Bourne (child pid is intermediate "sh") doesn't.

Is there any reason why this 'execl("/bin/sh" ... centry->command ...)'
tries to go via a shell?  Why doesn't it simply go directly, and
unambiguously, to the 'centry->command'?

This is beginning to feel like a Bash-ism bug.



-- 

:  David Lee                                I.T. Service          :
:  Senior Systems Programmer                Computer Centre       :
:  UNIX Team Leader                         Durham University     :
:                                           South Road            :
:  http://www.dur.ac.uk/t.d.lee/            Durham DH1 3LE        :
:  Phone: +44 191 334 2752                  U.K.                  :


More information about the Linux-HA-Dev mailing list