AW: [Linux-ha-dev] Hb-2.09: cib crashes onstartupundersolaris10/i386
Otte, Joerg
joerg.otte at nsn.com
Wed Jun 6 07:34:58 MDT 2007
Good news:
I had to include the /etc/profile.local into heartbeat's startup script.
The /etc/profile.local mainly defines LD_LIBRARY_PATH with some platform specific
shared object libraries:
--- ./heartbeat/init.d/heartbeat.in 2007-04-23 10:32:16.000000000 +0200
+++ ./heartbeat/init.d/heartbeat.in.patched 2007-06-05 15:56:06.000177000 +0200
@@ -43,6 +43,8 @@
# Default-Stop: 0 6
### END INIT INFO
+# define LD_LIBRARY_PATH
+test -s /etc/profile.local && . /etc/profile.local
HA_DIR=@sysconfdir@/ha.d; export HA_DIR
CONFIG=$HA_DIR/ha.cf
The last reported crashes during last week are all gone now!
Thank you very much for your help.
-----Ursprüngliche Nachricht-----
Von: linux-ha-dev-bounces at lists.linux-ha.org [mailto:linux-ha-dev-bounces at lists.linux-ha.org] Im Auftrag von ext Andrew Beekhof
Gesendet: Dienstag, 5. Juni 2007 12:27
An: High-Availability Linux Development List
Betreff: Re: [Linux-ha-dev] Hb-2.09: cib crashes onstartupundersolaris10/i386
On 6/5/07, Otte, Joerg <joerg.otte at nsn.com> wrote:
>
> > were there any errors before the crash?
> The only error in the logs are several:
> heartbeat[536]: 2007/06/05_10:01:03 WARN: Exiting /usr/sfw/lib/python2.3/heartbeat/cib process 572 killed by signal 9.
signal 9? are you sending that signal?
> heartbeat[536]: 2007/06/05_10:01:03 ERROR: Respawning client "/usr/sfw/lib/python2.3/heartbeat/cib":
thats an odd place to put heartbeat :-)
>
>
> > in the meantime, you can probably just comment out that line as the
> > cib is seconds away from exiting and is "just" cleaning up.
> You mean to remove the statement "hb_conn->llc_ops->delete(hb_conn);" in main.c ?
right
> I already tried this. But now I get different cores.
> In the logs I see only cib crashing but the core is from crmd:
>
>
> bcm20-b:/ # pstack /var/ha/local/lib/heartbeat/cores/hacluster/core
> core '/var/ha/local/lib/heartbeat/cores/hacluster/core' of 576: /usr/sfw/lib/python2.3/heartbeat/crmd
> 080580e8 do_exit (40000000, 0, d, b, 17, 8091528) + ac
> 08055f1f do_fsa_action (40000000, 0, 805803c, 1, 600002a8, a) + c3
> 080561ea s_crmd_fsa_actions (8091528, 2000420, 0, 10000000, 4, 0) + 76
> 080579e1 s_crmd_fsa (d, 808ebf0, 8047738, 805fc47) + 225
> 0805fc69 crm_fsa_trigger (0, 0, 8047778, fef9380f) + 2d
> fef93857 G_TRIG_dispatch (808ebf0, 0, 0, 0) + a7
> fedbc77f g_main_context_dispatch (808b9f8, ffffff9c, 8089dd0, 0) + 1e7
> fedbe065 g_main_context_iterate (1, 808c6c0, 8047858, fedbe141, 80554d7, 1) + 41d
> fedbe2c0 g_main_loop_run (8089db8, 806f248, 806caa4, 806f21d, 0, 0) + 19c
> 080554d7 crmd_init (80478a0, 80553f1, 808900c, 808901c, 0, 806ca2e) + b3
> 08055748 main (1, 80478d0, 80478d8) + f0
> 080552cc _start (1, 8047a40, 0, 8047a66, 8047a78, 8047a87) + 80
> bcm20-b:/ # l /var/ha/local/lib/heartbeat/cores/hacluster/core
> -rw------- 1 hacluster haclient 5672714 Jun 5 10:01 /var/ha/local/lib/heartbeat/cores/hacluster/core
> bcm20-b:/ #
>
> GDB:
> Reading symbols from /lib/libavl.so.1...done.
> Loaded symbols for /lib/libavl.so.1
> #0 0x080580e8 in do_exit (action=Unhandled dwarf expression opcode 0x93
> ) at control.c:188
> 188 fsa_cluster_conn->llc_ops->delete(fsa_cluster_conn);
does the memory pointed to by fsa_cluster_conn seem sane?
if so then its probably the same bug (and the same workaround would apply)
> (gdb) where
> #0 0x080580e8 in do_exit (action=Unhandled dwarf expression opcode 0x93
> ) at control.c:188
> #1 0x08055f1f in do_fsa_action (fsa_data=0x8091528, an_action=Unhandled dwarf expression opcode 0x93
> ) at fsa.c:177
> #2 0x080561ea in s_crmd_fsa_actions (fsa_data=0x8091528) at fsa.c:541
> #3 0x080579e1 in s_crmd_fsa (cause=C_FSA_INTERNAL) at fsa.c:314
> #4 0x0805fc69 in crm_fsa_trigger (user_data=0x0) at callbacks.c:661
> #5 0xfef93857 in G_TRIG_dispatch (source=0x808ebf0, callback=0, user_data=0x0) at GSource.c:1349
> #6 0xfedbc77f in g_main_context_dispatch () from /usr/local/lib/libglib-2.0.so.0
> #7 0xfedbe065 in g_main_context_iterate () from /usr/local/lib/libglib-2.0.so.0
> #8 0xfedbe2c0 in g_main_loop_run () from /usr/local/lib/libglib-2.0.so.0
> #9 0x080554d7 in crmd_init () at main.c:155
> #10 0x08055748 in main (argc=1, argv=0x80478d0) at main.c:122
>
>
>
> -----Ursprüngliche Nachricht-----
> Von: linux-ha-dev-bounces at lists.linux-ha.org [mailto:linux-ha-dev-bounces at lists.linux-ha.org] Im Auftrag von ext Andrew Beekhof
> Gesendet: Montag, 4. Juni 2007 16:45
> An: High-Availability Linux Development List
> Betreff: Re: [Linux-ha-dev] Hb-2.09: cib crashes onstartupundersolaris10/i386
>
> On 6/4/07, Otte, Joerg <joerg.otte at nsn.com> wrote:
> > OK, the patch works.
> > But now I stumbled across the next crash.
> > I am now trying the 2.09 Version from SuSE.
> >
> > Cib crashes shortly after a reboot of the second node:
> >
> >
> > #0 0xfee4d1be in msg2ipcchan (m=0xffffffff, ch=0x8091498) at cl_msg.c:2028
> > 2028 if (ch->ops->send(ch, imsg) != IPC_OK) {
> > (gdb) where
> > #0 0xfee4d1be in msg2ipcchan (m=0xffffffff, ch=0x8091498) at cl_msg.c:2028
> > #1 0xfedc51cb in hb_api_signoff (cinfo=0xffffffff, need_destroy_chan=1) at client_lib.c:470
> > #2 0xfedc5351 in hb_api_delete (ci=0x80a0dc0) at client_lib.c:501
> > #3 0x0805f1a2 in main (argc=1, argv=0x8047770) at main.c:216
> >
> > (gdb) p *ch
> > $2 = {ch_status = 134814800, farside_pid = -1, ch_private = 0x807b9e0, ops = 0xffffffff, msgpad = 0,
> > bytes_remaining = 4294967295, should_send_block = 0, send_queue = 0xffffffff, recv_queue = 0xffffffff,
> > pool = 0xffffffff, high_flow_mark = -1, low_flow_mark = -1, high_flow_userdata = 0xffffffff,
> > low_flow_userdata = 0xffffffff, high_flow_callback = 0xffffffff, low_flow_callback = 0xffffffff, conntype = -1,
> > failreason = '' <repeats 128 times>}
> >
> > (gdb) p *ch.ops
> > Cannot access memory at address 0xffffffff
> > (gdb)
>
> ok, this is a little more serious.
> looks like there is a problem in the heartbeat api :-(
>
> can you create a bug for this please? If you use the "other"
> component it will get assigned to the right person (Alan).
>
> in the meantime, you can probably just comment out that line as the
> cib is seconds away from exiting and is "just" cleaning up.
>
> were there any errors before the crash?
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
>
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev at lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
More information about the Linux-HA-Dev
mailing list