AW: [Linux-ha-dev] Hb-2.09: cib crashes onstartupundersolaris10/i386
Otte, Joerg
joerg.otte at nsn.com
Tue Jun 5 02:21:08 MDT 2007
> were there any errors before the crash?
The only error in the logs are several:
heartbeat[536]: 2007/06/05_10:01:03 WARN: Exiting /usr/sfw/lib/python2.3/heartbeat/cib process 572 killed by signal 9.
heartbeat[536]: 2007/06/05_10:01:03 ERROR: Respawning client "/usr/sfw/lib/python2.3/heartbeat/cib":
> in the meantime, you can probably just comment out that line as the
> cib is seconds away from exiting and is "just" cleaning up.
You mean to remove the statement "hb_conn->llc_ops->delete(hb_conn);" in main.c ?
I already tried this. But now I get different cores.
In the logs I see only cib crashing but the core is from crmd:
bcm20-b:/ # pstack /var/ha/local/lib/heartbeat/cores/hacluster/core
core '/var/ha/local/lib/heartbeat/cores/hacluster/core' of 576: /usr/sfw/lib/python2.3/heartbeat/crmd
080580e8 do_exit (40000000, 0, d, b, 17, 8091528) + ac
08055f1f do_fsa_action (40000000, 0, 805803c, 1, 600002a8, a) + c3
080561ea s_crmd_fsa_actions (8091528, 2000420, 0, 10000000, 4, 0) + 76
080579e1 s_crmd_fsa (d, 808ebf0, 8047738, 805fc47) + 225
0805fc69 crm_fsa_trigger (0, 0, 8047778, fef9380f) + 2d
fef93857 G_TRIG_dispatch (808ebf0, 0, 0, 0) + a7
fedbc77f g_main_context_dispatch (808b9f8, ffffff9c, 8089dd0, 0) + 1e7
fedbe065 g_main_context_iterate (1, 808c6c0, 8047858, fedbe141, 80554d7, 1) + 41d
fedbe2c0 g_main_loop_run (8089db8, 806f248, 806caa4, 806f21d, 0, 0) + 19c
080554d7 crmd_init (80478a0, 80553f1, 808900c, 808901c, 0, 806ca2e) + b3
08055748 main (1, 80478d0, 80478d8) + f0
080552cc _start (1, 8047a40, 0, 8047a66, 8047a78, 8047a87) + 80
bcm20-b:/ # l /var/ha/local/lib/heartbeat/cores/hacluster/core
-rw------- 1 hacluster haclient 5672714 Jun 5 10:01 /var/ha/local/lib/heartbeat/cores/hacluster/core
bcm20-b:/ #
GDB:
Reading symbols from /lib/libavl.so.1...done.
Loaded symbols for /lib/libavl.so.1
#0 0x080580e8 in do_exit (action=Unhandled dwarf expression opcode 0x93
) at control.c:188
188 fsa_cluster_conn->llc_ops->delete(fsa_cluster_conn);
(gdb) where
#0 0x080580e8 in do_exit (action=Unhandled dwarf expression opcode 0x93
) at control.c:188
#1 0x08055f1f in do_fsa_action (fsa_data=0x8091528, an_action=Unhandled dwarf expression opcode 0x93
) at fsa.c:177
#2 0x080561ea in s_crmd_fsa_actions (fsa_data=0x8091528) at fsa.c:541
#3 0x080579e1 in s_crmd_fsa (cause=C_FSA_INTERNAL) at fsa.c:314
#4 0x0805fc69 in crm_fsa_trigger (user_data=0x0) at callbacks.c:661
#5 0xfef93857 in G_TRIG_dispatch (source=0x808ebf0, callback=0, user_data=0x0) at GSource.c:1349
#6 0xfedbc77f in g_main_context_dispatch () from /usr/local/lib/libglib-2.0.so.0
#7 0xfedbe065 in g_main_context_iterate () from /usr/local/lib/libglib-2.0.so.0
#8 0xfedbe2c0 in g_main_loop_run () from /usr/local/lib/libglib-2.0.so.0
#9 0x080554d7 in crmd_init () at main.c:155
#10 0x08055748 in main (argc=1, argv=0x80478d0) at main.c:122
-----Ursprüngliche Nachricht-----
Von: linux-ha-dev-bounces at lists.linux-ha.org [mailto:linux-ha-dev-bounces at lists.linux-ha.org] Im Auftrag von ext Andrew Beekhof
Gesendet: Montag, 4. Juni 2007 16:45
An: High-Availability Linux Development List
Betreff: Re: [Linux-ha-dev] Hb-2.09: cib crashes onstartupundersolaris10/i386
On 6/4/07, Otte, Joerg <joerg.otte at nsn.com> wrote:
> OK, the patch works.
> But now I stumbled across the next crash.
> I am now trying the 2.09 Version from SuSE.
>
> Cib crashes shortly after a reboot of the second node:
>
>
> #0 0xfee4d1be in msg2ipcchan (m=0xffffffff, ch=0x8091498) at cl_msg.c:2028
> 2028 if (ch->ops->send(ch, imsg) != IPC_OK) {
> (gdb) where
> #0 0xfee4d1be in msg2ipcchan (m=0xffffffff, ch=0x8091498) at cl_msg.c:2028
> #1 0xfedc51cb in hb_api_signoff (cinfo=0xffffffff, need_destroy_chan=1) at client_lib.c:470
> #2 0xfedc5351 in hb_api_delete (ci=0x80a0dc0) at client_lib.c:501
> #3 0x0805f1a2 in main (argc=1, argv=0x8047770) at main.c:216
>
> (gdb) p *ch
> $2 = {ch_status = 134814800, farside_pid = -1, ch_private = 0x807b9e0, ops = 0xffffffff, msgpad = 0,
> bytes_remaining = 4294967295, should_send_block = 0, send_queue = 0xffffffff, recv_queue = 0xffffffff,
> pool = 0xffffffff, high_flow_mark = -1, low_flow_mark = -1, high_flow_userdata = 0xffffffff,
> low_flow_userdata = 0xffffffff, high_flow_callback = 0xffffffff, low_flow_callback = 0xffffffff, conntype = -1,
> failreason = '' <repeats 128 times>}
>
> (gdb) p *ch.ops
> Cannot access memory at address 0xffffffff
> (gdb)
ok, this is a little more serious.
looks like there is a problem in the heartbeat api :-(
can you create a bug for this please? If you use the "other"
component it will get assigned to the right person (Alan).
in the meantime, you can probably just comment out that line as the
cib is seconds away from exiting and is "just" cleaning up.
were there any errors before the crash?
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev at lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
More information about the Linux-HA-Dev
mailing list