AW: [Linux-ha-dev] Hb-2.08/stable: crmd crashes on startup under solaris10/i386

Otte, Joerg joerg.otte at nsn.com
Thu May 31 07:48:48 MDT 2007


I must report anather crash. This time in crmd.
After a reboot the following crash occures every
minute again and again.

This crash prevents Heartbeat from starting up.

fead4c7c strlen   (807254e, 8047698, 8046230, 0) + c
feb2d3cb vsnprintf (8046270, 1400, 8072548, 8047698) + 73
fef885b8 cl_log   (6, 8072548, 806e14d, 0, 80713fa, 8071412) + 58
080571af do_state_transition (1000000, 0, 8, 6, 8091b10, 0) + 21b
080579d2 s_crmd_fsa (d, 8090c58, 8047738, 805fc1b) + 20e
0805fc3d crm_fsa_trigger (0, 0, 8047778, fef9380f) + 2d
fef93857 G_TRIG_dispatch (8090c58, 0, 0, 0) + a7
fedbc77f g_main_context_dispatch (808eb80, ffffff9c, 8089d90, 0) + 1e7
fedbe065 g_main_context_iterate (1, 808c2c0, 8047858, fedbe141, 80554d7, 1) + 41d
fedbe2c0 g_main_loop_run (8089d78, 806f228, 806ca84, 806f1fd, 0, 0) + 19c
080554d7 crmd_init (80478a0, 80553f1, 8088fcc, 8088fdc, 0, 806ca02) + b3
08055748 main     (1, 80478d0, 80478d8) + f0
080552cc _start   (1, 8047a40, 0, 8047a66, 8047a78, 8047a87) + 80


-----Ursprüngliche Nachricht-----
Von: linux-ha-dev-bounces at lists.linux-ha.org [mailto:linux-ha-dev-bounces at lists.linux-ha.org] Im Auftrag von ext Otte, Joerg
Gesendet: Montag, 14. Mai 2007 14:55
An: High-Availability Linux Development List
Betreff: AW: [Linux-ha-dev] Hb-2.08/stable: tengine crashes under solaris10/i386

I just found another crash. This time in tengine.
Looks like a similar problem as the last ones: 

bcm20-a:/ # pstack /var/ha/local/lib/heartbeat/cores/hacluster/core
core '/var/ha/local/lib/heartbeat/cores/hacluster/core' of 9048:        /usr/sfw/lib/python2.3/hrtbeat/tengine
 feb14c7c strlen   (8057b56, 8047378, 8045f10, 0) + c
 feb6d3cb vsnprintf (8045f50, 1400, 8057b30, 8047378) + 73
 fef8cde4 cl_log   (4, 8057b30, 8056980, 80afb60, 80affc0, 0) + 58
 08053dd1 match_graph_event (e, 80e1bc0, 80d19e0, 4, 1, 8074688) + 141
 080544c7 process_graph_event (80e1bc0, 80d19e0, 0, 0) + 2cb
 080547f0 extract_event (80df560, 80da4a0, 0, 6) + 2a4
 08054f89 te_update_diff (80e12c0, 80e1ce0, 740df4a0, fef06895, 80dbec0, 8068fc4) + 121
 fef0699a cib_native_notify (806fe00, 80e1ce0, 80ddbe0, fef27f79, 80e1ce0, 806f140) + 116
 feddaa18 g_list_foreach (806c000, fef06884, 80e1ce0, fef0a0c0, 1, 8068fc4) + 1c
 fef0685a cib_native_rcvmsg (806f140, 0, 8047548, fef069f2, 8072420, 8068fc4) + 202
 fef06a3b cib_native_dispatch (8072420, 806f140, 8047598, fef87956) + 57
 fef87b36 G_CH_dispatch_int (8072578, 0, 0, 0) + 252
 feddc77f g_main_context_dispatch (806e700, ffffff9c, 807ab90, 4) + 1e7
 fedde065 g_main_context_iterate (1, 80723e8, 8047678, fedde141, 805655d, 0) + 41d
 fedde2c0 g_main_loop_run (80698f0, 0, 8056c11, 8057689, 60, 80689e8) + 19c
 0805655d init_start (8047730, 80524f9, 8068bb8, 8068bcc, 0, 80567d6) + 3b1
 080566e7 main     (1, 8047768, 8047770) + c3
 080523d4 _start   (1, 80478e8, 0, 8047911, 8047920, 804799f) + 80



-----Ursprüngliche Nachricht-----
Von: linux-ha-dev-bounces at lists.linux-ha.org [mailto:linux-ha-dev-bounces at lists.linux-ha.org] Im Auftrag von ext Andrew Beekhof
Gesendet: Dienstag, 8. Mai 2007 15:25
An: High-Availability Linux Development List
Betreff: Re: [Linux-ha-dev] Hb-2.08/stable: cib crashes under solaris 10/i386

On 5/8/07, Otte, Joerg <joerg.otte at nsn.com> wrote:
> The patches received do work, so far no more crashes.
> Thanks again for the patches.

np - they've been committed for the next release.

please let me know if you encounter any other problems

>
>
>
> -----Ursprüngliche Nachricht-----
> Von: linux-ha-dev-bounces at lists.linux-ha.org [mailto:linux-ha-dev-bounces at lists.linux-ha.org] Im Auftrag von ext Andrew Beekhof
> Gesendet: Freitag, 4. Mai 2007 11:26
> An: High-Availability Linux Development List
> Betreff: Re: [Linux-ha-dev] Hb-2.08/stable: cib crashes under solaris 10/i386
>
> try:
>
> diff -r 5f8bd92d10d4 crm/cib/main.c
> --- a/crm/cib/main.c    Thu May 03 11:32:57 2007 +0200
> +++ b/crm/cib/main.c    Fri May 04 11:26:57 2007 +0200
> @@ -559,7 +559,8 @@ disconnect_cib_client(gpointer key, gpoi
>
>         if(a_client->channel->ch_status == IPC_CONNECT) {
>                 crm_warn("Disconnecting %s/%s...",
> -                        a_client->name, a_client->channel_name);
> +                        crm_str(a_client->name),
> +                        crm_str(a_client->channel_name));
>                 a_client->channel->ops->disconnect(a_client->channel);
>         }
>  }
>
>
> On 5/4/07, Otte, Joerg <joerg.otte at nsn.com> wrote:
> > I just found another crash in cib, looks like a similar problem:
> >
> > bcm20-a:/ # pstack /var/ha/local/lib/heartbeat/cores/hacluster/core
> > core '/var/ha/local/lib/heartbeat/cores/hacluster/core' of 592: /usr/sfw/lib/python2.3/heartbeat/cib
> >  fea54c7c strlen   (805fdf1, 8047588, 8046120, 0) + c
> >  feaad3cb vsnprintf (8046160, 1400, 805fddd, 8047588) + 73
> >  fef8cde4 cl_log   (4, 805fddd, 805ed86, 0, 805f866, 0) + 58
> >  0805dbca disconnect_cib_client (8086760, 807c6e0, 0, 8079a00, fefaedc0, 0) + b6
> >  fee210fc g_hash_table_foreach (8079200, 805db14, 0, 0) + 4c
> >  0805dc31 cib_shutdown (f, 0, 8047638, fef8846b) + 35
> >  fef884b9 G_SIG_dispatch (8074fe8, 0, 0, 0) + ad
> >  fee2c77f g_main_context_dispatch (8075058, ffffff9c, 80e4c80, f) + 1e7
> >  fee2e065 g_main_context_iterate (1, 80ba4a8, 8047718, fee2e141, 805e355, 8075058) + 41d
> >  fee2e2c0 g_main_loop_run (80742b8, 805d5d4, 0, 1, 0, 1) + 19c
> >  0805e355 init_start (8047770, 80540bd, 8073458, 807346c, 0, 805e5ba) + 59d
> >  0805e4ff main     (1, 80477a0, 80477a8) + f7
> >  08053f98 _start   (1, 8047920, 0, 8047945, 804795c, 8047984) + 80
> >
> >
> >
> > -----Ursprüngliche Nachricht-----
> > Von: linux-ha-dev-bounces at lists.linux-ha.org [mailto:linux-ha-dev-bounces at lists.linux-ha.org] Im Auftrag von ext Andrew Beekhof
> > Gesendet: Donnerstag, 3. Mai 2007 11:29
> > An: High-Availability Linux Development List
> > Betreff: Re: [Linux-ha-dev] Hb-2.08/stable: cib crashes under solaris 10/i386
> >
> > On 5/2/07, Otte, Joerg <joerg.otte at nsn.com> wrote:
> > > I am trying to get heartbeat 2.08/stable running under Solaris 10 /
> > > I386.
> > > OS: SunOS bcm20-a 5.10 Generic_125101-03 i86pc i386 i86pc
> > >
> > > Whereas V1 configuration seem to work properly (I didn't go into details
> > > yet),
> > > I currently have the following problem with a V2 configuration:
> > >
> > > Case 1) "The cib process crashes with core dump on the second node."
> >
> > I wonder... could this be as simple as trying to print a NULL pointer
> > as a string?
> >
> > Any chance you could try this patch:
> >
> > --- a/crm/cib/notify.c  Thu May 03 10:03:54 2007 +0200
> > +++ b/crm/cib/notify.c  Thu May 03 11:26:05 2007 +0200
> > @@ -392,11 +392,13 @@ cib_replace_notify(crm_data_t *update, e
> >
> >         if(add_updates != del_updates) {
> >                 crm_info("Replaced: %d.%d.%d -> %d.%d.%d from %s",
> > -                         del_admin_epoch, del_epoch, del_updates,
> > -                        add_admin_epoch, add_epoch, add_updates, origin);
> > +                        del_admin_epoch, del_epoch, del_updates,
> > +                        add_admin_epoch, add_epoch, add_updates,
> > +                        crm_str(origin));
> >         } else if(diff != NULL) {
> >                 crm_info("Local-only Replace: %d.%d.%d from %s",
> > -                         add_admin_epoch, add_epoch, add_updates, origin);
> > +                        add_admin_epoch, add_epoch, add_updates,
> > +                        crm_str(origin));
> >         }
> >
> >         replace_msg = ha_msg_new(8);
> >
> >
> > > Case 2) "Heartbeat/Stonithd hangs on shutdown."
> > >
> > >
> > > Attached logs cover the following situations:
> > >
> > > Case 1)
> > > - Heartbeat on node-a ("bcm20-a") came up successfully with a fresh
> > > cib.xml. Resources are
> > >   successfully started.
> > > - When I then start Heartbeat on node b ("bcm20-b") the cib process
> > > crashes on node b.
> > >
> > > This is the stack dump of the cib process on node b:
> > >
> > > core '/var/ha/local/lib/heartbeat/cores/hacluster/core' of 576:
> > > /usr/sfw/lib/python2.3/heartbeat/cib
> > >  fea54c7c strlen   (8061466, 80472f8, 8045e90, 0) + c
> > >  feaad3cb vsnprintf (8045ed0, 1400, 806143c, 80472f8) + 73
> > >  fef8cde4 cl_log   (6, 806143c, 805e9a1, 0, 0, 0) + 58
> > >  080590af cib_replace_notify (80a8120, 0, 80aa740, 809bac0) + 1ab
> > >  08057383 cib_process_replace (80a6590, 11100000, 0, 809dee0, 80bf1f0,
> > > 80473f4) + 197
> > >  0805a44e cib_process_command (8085520, 8047460, 8047464, 1, fea549d0,
> > > 0) + 30e
> > >  0805af60 cib_process_request (8085520, 0, 1, 1, 0) + 1e4
> > >  0805c264 cib_peer_callback (8085520, 8075f08, 80475a8, fef02825) + 1d8
> > >  fef02839 read_msg_w_callbacks (8075f08, 0, 80475c8, fef025d1) + 209
> > >  fef02c26 rcvmsg   (8075f08, 0, 5, 0) + 1e
> > >  0805c02e cib_ha_dispatch (807a058, 8075f08, 8047668, fef87956) + 86
> > >  fef87b36 G_CH_dispatch_int (807ccd0, 0, 0, 0) + 252
> > >  fee2c77f g_main_context_dispatch (8075058, 0, 8080490, d) + 1e7
> > >  fee2e065 g_main_context_iterate (1, 80ba4a8, 8047748, fee2e141,
> > > 805e355, 8075058) + 41d
> > >  fee2e2c0 g_main_loop_run (80742b8, 805d5d4, 0, 1, 0, 1) + 19c
> > >  0805e355 init_start (80477a0, 80540bd, 8073458, 807346c, 0, 805e5ba) +
> > > 59d
> > >  0805e4ff main     (1, 80477cc, 80477d4) + f7
> > >  08053f98 _start   (1, 8047948, 0, 804796d, 8047984, 80479ac) + 80
> > >
> > > attached files: case1.bcm20-a.tar.gz case1.bcm20-b.tar.gz
> > >
> > > Case 2)
> > > When I shutdown heartbeat it tells me:
> > > > bcm20-a:/ # /etc/init.d/heartbeat stop
> > > > Stopping High-Availability services:
> > > > Done.
> > >
> > > But the following processes remain running:
> > > > bcm20-a:/ # ptree -a 1125
> > > > 1     /sbin/init
> > > >   1125  /usr/sfw/lib/python2.3/heartbeat/heartbeat
> > > >     1129  /usr/sfw/lib/python2.3/heartbeat/heartbeat
> > > >     1130  /usr/sfw/lib/python2.3/heartbeat/heartbeat
> > > >     1131  /usr/sfw/lib/python2.3/heartbeat/heartbeat
> > > >     1132  /usr/sfw/lib/python2.3/heartbeat/heartbeat
> > > >     1133  /usr/sfw/lib/python2.3/heartbeat/heartbeat
> > > >     1135  sh -c /usr/sfw/lib/python2.3/heartbeat/ccm
> > > >       1144  /usr/sfw/lib/python2.3/heartbeat/ccm
> > > >     1136  sh -c /usr/sfw/lib/python2.3/heartbeat/cib
> > > >       1147  /usr/sfw/lib/python2.3/heartbeat/cib
> > > >     1137  sh -c /usr/sfw/lib/python2.3/heartbeat/lrmd -r
> > > >       1143  /usr/sfw/lib/python2.3/heartbeat/lrmd -r
> > > >     1138  sh -c /usr/sfw/lib/python2.3/heartbeat/stonithd
> > > >       1145  /usr/sfw/lib/python2.3/heartbeat/stonithd
> > >
> > > stonithd has the following file descriptors still open:
> > > > bcm20-a:/ # pfiles 1145
> > > > 1145:   /usr/sfw/lib/python2.3/heartbeat/stonithd
> > > >   Current rlimit: 256 file descriptors
> > > >    0: S_IFCHR mode:0666 dev:270,0 ino:6815752 uid:0 gid:3 rdev:13,2
> > > >       O_RDONLY|O_LARGEFILE
> > > >       /devices/pseudo/mm at 0:null
> > > >    1: S_IFCHR mode:0666 dev:270,0 ino:6815752 uid:0 gid:3 rdev:13,2
> > > >       O_WRONLY|O_LARGEFILE
> > > >       /devices/pseudo/mm at 0:null
> > > >    2: S_IFCHR mode:0666 dev:270,0 ino:6815752 uid:0 gid:3 rdev:13,2
> > > >       O_WRONLY|O_LARGEFILE
> > > >       /devices/pseudo/mm at 0:null
> > > >    3: S_IFDOOR mode:0444 dev:279,0 ino:53 uid:0 gid:0 size:0
> > > >       O_RDONLY|O_LARGEFILE FD_CLOEXEC  door to nscd[106]
> > > >       /var/run/name_service_door
> > > >    4: S_IFCHR mode:0000 dev:270,0 ino:39275 uid:0 gid:0 rdev:21,88
> > > >       O_WRONLY FD_CLOEXEC
> > > >       /devices/pseudo/log at 0:conslog
> > > >    5: S_IFSOCK mode:0666 dev:276,0 ino:17792 uid:0 gid:0 size:0
> > > >       O_RDWR|O_NONBLOCK
> > > >         SOCK_STREAM
> > > >         SO_SNDBUF(16384),SO_RCVBUF(5120)
> > > >         sockname: AF_UNIX
> > > >         peername: AF_UNIX /var/ha/local/run/heartbeat/register
> > > >    6: S_IFSOCK mode:0666 dev:276,0 ino:20034 uid:0 gid:0 size:0
> > > >       O_RDWR|O_NONBLOCK
> > > >         SOCK_STREAM
> > > >         SO_SNDBUF(16384),SO_RCVBUF(5120)
> > > >         sockname: AF_UNIX /var/ha/local/run/heartbeat/stonithd
> > > >    7: S_IFSOCK mode:0666 dev:276,0 ino:19832 uid:0 gid:0 size:0
> > > >       O_RDWR|O_NONBLOCK
> > > >         SOCK_STREAM
> > > >         SO_SNDBUF(16384),SO_RCVBUF(5120)
> > > >         sockname: AF_UNIX
> > > /var/ha/local/run/heartbeat/stonithd_callback
> > >
> > >
> > > attached files: case2.bcm20-a.tar.gz
> > >
> > > Shutdown proceeds normally if I kill stonithd (1145).
> > >
> > >
> > >
> > > Any help would be appreciated.
> > >
> > > Joerg
> > >
> > > _______________________________________________________
> > > Linux-HA-Dev: Linux-HA-Dev at lists.linux-ha.org
> > > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> > > Home Page: http://linux-ha.org/
> > >
> > >
> > >
> > _______________________________________________________
> > Linux-HA-Dev: Linux-HA-Dev at lists.linux-ha.org
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> > Home Page: http://linux-ha.org/
> > _______________________________________________________
> > Linux-HA-Dev: Linux-HA-Dev at lists.linux-ha.org
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> > Home Page: http://linux-ha.org/
> >
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
>
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev at lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev at lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


More information about the Linux-HA-Dev mailing list