[Linux-ha-dev] Hb-2.08/stable: cib crashes under solaris 10/i386

Otte, Joerg joerg.otte at nsn.com
Wed May 2 06:03:19 MDT 2007


I am trying to get heartbeat 2.08/stable running under Solaris 10 /
I386.
OS: SunOS bcm20-a 5.10 Generic_125101-03 i86pc i386 i86pc

Whereas V1 configuration seem to work properly (I didn't go into details
yet), 
I currently have the following problem with a V2 configuration:

Case 1) "The cib process crashes with core dump on the second node."
Case 2) "Heartbeat/Stonithd hangs on shutdown."


Attached logs cover the following situations:

Case 1)
- Heartbeat on node-a ("bcm20-a") came up successfully with a fresh
cib.xml. Resources are
  successfully started.
- When I then start Heartbeat on node b ("bcm20-b") the cib process
crashes on node b.

This is the stack dump of the cib process on node b:

core '/var/ha/local/lib/heartbeat/cores/hacluster/core' of 576:
/usr/sfw/lib/python2.3/heartbeat/cib
 fea54c7c strlen   (8061466, 80472f8, 8045e90, 0) + c
 feaad3cb vsnprintf (8045ed0, 1400, 806143c, 80472f8) + 73
 fef8cde4 cl_log   (6, 806143c, 805e9a1, 0, 0, 0) + 58
 080590af cib_replace_notify (80a8120, 0, 80aa740, 809bac0) + 1ab
 08057383 cib_process_replace (80a6590, 11100000, 0, 809dee0, 80bf1f0,
80473f4) + 197
 0805a44e cib_process_command (8085520, 8047460, 8047464, 1, fea549d0,
0) + 30e
 0805af60 cib_process_request (8085520, 0, 1, 1, 0) + 1e4
 0805c264 cib_peer_callback (8085520, 8075f08, 80475a8, fef02825) + 1d8
 fef02839 read_msg_w_callbacks (8075f08, 0, 80475c8, fef025d1) + 209
 fef02c26 rcvmsg   (8075f08, 0, 5, 0) + 1e
 0805c02e cib_ha_dispatch (807a058, 8075f08, 8047668, fef87956) + 86
 fef87b36 G_CH_dispatch_int (807ccd0, 0, 0, 0) + 252
 fee2c77f g_main_context_dispatch (8075058, 0, 8080490, d) + 1e7
 fee2e065 g_main_context_iterate (1, 80ba4a8, 8047748, fee2e141,
805e355, 8075058) + 41d
 fee2e2c0 g_main_loop_run (80742b8, 805d5d4, 0, 1, 0, 1) + 19c
 0805e355 init_start (80477a0, 80540bd, 8073458, 807346c, 0, 805e5ba) +
59d
 0805e4ff main     (1, 80477cc, 80477d4) + f7
 08053f98 _start   (1, 8047948, 0, 804796d, 8047984, 80479ac) + 80

attached files: case1.bcm20-a.tar.gz case1.bcm20-b.tar.gz

Case 2)
When I shutdown heartbeat it tells me:
> bcm20-a:/ # /etc/init.d/heartbeat stop
> Stopping High-Availability services:
> Done.

But the following processes remain running:
> bcm20-a:/ # ptree -a 1125
> 1     /sbin/init
>   1125  /usr/sfw/lib/python2.3/heartbeat/heartbeat
>     1129  /usr/sfw/lib/python2.3/heartbeat/heartbeat
>     1130  /usr/sfw/lib/python2.3/heartbeat/heartbeat
>     1131  /usr/sfw/lib/python2.3/heartbeat/heartbeat
>     1132  /usr/sfw/lib/python2.3/heartbeat/heartbeat
>     1133  /usr/sfw/lib/python2.3/heartbeat/heartbeat
>     1135  sh -c /usr/sfw/lib/python2.3/heartbeat/ccm
>       1144  /usr/sfw/lib/python2.3/heartbeat/ccm
>     1136  sh -c /usr/sfw/lib/python2.3/heartbeat/cib
>       1147  /usr/sfw/lib/python2.3/heartbeat/cib
>     1137  sh -c /usr/sfw/lib/python2.3/heartbeat/lrmd -r
>       1143  /usr/sfw/lib/python2.3/heartbeat/lrmd -r
>     1138  sh -c /usr/sfw/lib/python2.3/heartbeat/stonithd
>       1145  /usr/sfw/lib/python2.3/heartbeat/stonithd

stonithd has the following file descriptors still open:
> bcm20-a:/ # pfiles 1145
> 1145:   /usr/sfw/lib/python2.3/heartbeat/stonithd
>   Current rlimit: 256 file descriptors
>    0: S_IFCHR mode:0666 dev:270,0 ino:6815752 uid:0 gid:3 rdev:13,2
>       O_RDONLY|O_LARGEFILE
>       /devices/pseudo/mm at 0:null
>    1: S_IFCHR mode:0666 dev:270,0 ino:6815752 uid:0 gid:3 rdev:13,2
>       O_WRONLY|O_LARGEFILE
>       /devices/pseudo/mm at 0:null
>    2: S_IFCHR mode:0666 dev:270,0 ino:6815752 uid:0 gid:3 rdev:13,2
>       O_WRONLY|O_LARGEFILE
>       /devices/pseudo/mm at 0:null
>    3: S_IFDOOR mode:0444 dev:279,0 ino:53 uid:0 gid:0 size:0
>       O_RDONLY|O_LARGEFILE FD_CLOEXEC  door to nscd[106]
>       /var/run/name_service_door
>    4: S_IFCHR mode:0000 dev:270,0 ino:39275 uid:0 gid:0 rdev:21,88
>       O_WRONLY FD_CLOEXEC
>       /devices/pseudo/log at 0:conslog
>    5: S_IFSOCK mode:0666 dev:276,0 ino:17792 uid:0 gid:0 size:0
>       O_RDWR|O_NONBLOCK
>         SOCK_STREAM
>         SO_SNDBUF(16384),SO_RCVBUF(5120)
>         sockname: AF_UNIX
>         peername: AF_UNIX /var/ha/local/run/heartbeat/register
>    6: S_IFSOCK mode:0666 dev:276,0 ino:20034 uid:0 gid:0 size:0
>       O_RDWR|O_NONBLOCK
>         SOCK_STREAM
>         SO_SNDBUF(16384),SO_RCVBUF(5120)
>         sockname: AF_UNIX /var/ha/local/run/heartbeat/stonithd
>    7: S_IFSOCK mode:0666 dev:276,0 ino:19832 uid:0 gid:0 size:0
>       O_RDWR|O_NONBLOCK
>         SOCK_STREAM
>         SO_SNDBUF(16384),SO_RCVBUF(5120)
>         sockname: AF_UNIX
/var/ha/local/run/heartbeat/stonithd_callback


attached files: case2.bcm20-a.tar.gz

Shutdown proceeds normally if I kill stonithd (1145).



Any help would be appreciated.

Joerg
-------------- next part --------------
A non-text attachment was scrubbed...
Name: case1.bcm20-a.tar.gz
Type: application/x-gzip
Size: 8880 bytes
Desc: case1.bcm20-a.tar.gz
Url : http://lists.community.tummy.com/pipermail/linux-ha-dev/attachments/20070502/caf693ee/case1.bcm20-a.tar-0001.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: case1.bcm20-b.tar.gz
Type: application/x-gzip
Size: 7847 bytes
Desc: case1.bcm20-b.tar.gz
Url : http://lists.community.tummy.com/pipermail/linux-ha-dev/attachments/20070502/caf693ee/case1.bcm20-b.tar-0001.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: case2.bcm20-a.tar.gz
Type: application/x-gzip
Size: 41084 bytes
Desc: case2.bcm20-a.tar.gz
Url : http://lists.community.tummy.com/pipermail/linux-ha-dev/attachments/20070502/caf693ee/case2.bcm20-a.tar-0001.bin


More information about the Linux-HA-Dev mailing list