AW: AW: [Linux-ha-dev] Solaris 10/i386: Heartbeat/Stonithd
hangsonshutdown.
David Lee
t.d.lee at durham.ac.uk
Fri May 11 05:14:02 MDT 2007
On Fri, 11 May 2007, Otte, Joerg wrote:
> What I can see the difference between Bourne and Bash
> is established in the following process trees:
> Bourne:
> 1125 /usr/sfw/lib/python2.3/heartbeat/heartbeat
> 1129 /usr/sfw/lib/python2.3/heartbeat/heartbeat
> 1130 /usr/sfw/lib/python2.3/heartbeat/heartbeat
> 1131 /usr/sfw/lib/python2.3/heartbeat/heartbeat
> 1132 /usr/sfw/lib/python2.3/heartbeat/heartbeat
> 1133 /usr/sfw/lib/python2.3/heartbeat/heartbeat
> 1135 sh -c /usr/sfw/lib/python2.3/heartbeat/ccm
> 1144 /usr/sfw/lib/python2.3/heartbeat/ccm
> 1136 sh -c /usr/sfw/lib/python2.3/heartbeat/cib
> 1147 /usr/sfw/lib/python2.3/heartbeat/cib
> 1137 sh -c /usr/sfw/lib/python2.3/heartbeat/lrmd -r
> 1143 /usr/sfw/lib/python2.3/heartbeat/lrmd -r
> 1138 sh -c /usr/sfw/lib/python2.3/heartbeat/stonithd
> 1145 /usr/sfw/lib/python2.3/heartbeat/stonithd
> ..
>
> Bash:
> 2849 /usr/sfw/lib/python2.3/heartbeat/ha_logd -d
> 2858 /usr/sfw/lib/python2.3/heartbeat/ha_logd -d
> 2859 /usr/sfw/lib/python2.3/heartbeat/heartbeat
> 2863 /usr/sfw/lib/python2.3/heartbeat/heartbeat
> 2864 /usr/sfw/lib/python2.3/heartbeat/heartbeat
> 2865 /usr/sfw/lib/python2.3/heartbeat/heartbeat
> 2866 /usr/sfw/lib/python2.3/heartbeat/heartbeat
> 2867 /usr/sfw/lib/python2.3/heartbeat/heartbeat
> 2870 /usr/sfw/lib/python2.3/heartbeat/ccm
> 2871 /usr/sfw/lib/python2.3/heartbeat/cib
> 2872 /usr/sfw/lib/python2.3/heartbeat/lrmd -r
> 2873 /usr/sfw/lib/python2.3/heartbeat/stonithd
> ..
>
> When using the Bourne shell to start a child prodess I get
> 2 processes (in the case of stonithd 1138,1145)
> When using Bash I get only one child process (2873).
>
> When I try to kill 1138 from command line nothing happens.
> When I kill 1145 from command line both processes (1138,1145)
> disappear. So I think heartbeat always tries to kill 1138
> which does not work.
>
> Just an idear: A solution may be to start child processes
> directly (not via the shell). This would work independently
> of the installed shell.
Many thanks, Joerg! That's most helpful.
That reasoning seems OK, although I'm no expert on this. What we now need
is advice from the main heartbeat developers (Alan, Lars, Andrew, ...).
In what follows, I refer to the "dev" tree as of now, 11/May/2007.
File "heartbeat/heartbeat.c", function "start_a_child_client()" does:
-----------------------------
switch(pid=fork()) {
case -1: cl_log(LOG_ERR
, "start_a_child_client: Cannot fork.");
return;
default: /* Parent */
NewTrackedProc(pid, 1, PT_LOGVERBOSE
, centry, &ManagedChildTrackOps);
hb_pop_deadtime(NULL);
return;
-----------------------------
in which the 'default' clause stores the child "pid". It then proceeds:
-----------------------------
(void)execl("/bin/sh", "sh", "-c", centry->command
, (const char *)NULL);
-----------------------------
It might be that different implementations of "/bin/sh" do different
things. For instance, some Bournes (e.g. Solaris) might (I speculate!)
might then spawn "centry->command" as a subprocess (a further fork/exec)
whereas Bash might (speculation continues) do a direct replacement "exec".
It seems that heartbeat requires the 'direct replacement "exec"' style of
operation (stored child pid is the final "centry->command"), which Bash
seems to give but which Bourne (child pid is intermediate "sh") doesn't.
Is there any reason why this 'execl("/bin/sh" ... centry->command ...)'
tries to go via a shell? Why doesn't it simply go directly, and
unambiguously, to the 'centry->command'?
This is beginning to feel like a Bash-ism bug.
--
: David Lee I.T. Service :
: Senior Systems Programmer Computer Centre :
: UNIX Team Leader Durham University :
: South Road :
: http://www.dur.ac.uk/t.d.lee/ Durham DH1 3LE :
: Phone: +44 191 334 2752 U.K. :
More information about the Linux-HA-Dev
mailing list