[Linux-HA] stonith core dumping on Solaris -> Finally Solved + Fixed :-)

Andrew Voumard andrewv at melbpc.org.au
Mon Oct 10 06:29:06 MDT 2005


Hi,

*********************************************
I believe I have found the reason for stonith
(and perhaps other things in heartbeat) core
dumping on Solaris. Looking back in the
mailing lists, this one has been around for
a _long_ time.
*********************************************
Thanks to john_kavadias A T h o t m a i l D O T c o m,
whom I work with, who also helped find the problem.
(he also posts to this list sometimes).
*********************************************

Details:

Environment:
============

HA 1.2.3
Solaris 2.8
gcc 3.4.4

To Fix (for Solaris - the same fix would
work in linux too, but it would waste a
little memory):
========================================

In replace/scandir.c, change line 130 from:

if (copy = (struct dirent *) malloc (sizeof (struct dirent))

TO:

if (copy = (struct dirent *) malloc (sizeof (struct dirent) + 
strlen(entry->d_name))

WHY?:

At line 139, the code does a:

strcpy (copy->d_name, entry->d_name);

Which is fine in linux land, where in /usr/include/bits/dirent.h:

================
struct dirent
   {
#ifndef __USE_FILE_OFFSET64
     __ino_t d_ino;
     __off_t d_off;
#else
     __ino64_t d_ino;
     __off64_t d_off;
#endif
     unsigned short int d_reclen;
     unsigned char d_type;
     char d_name[256];           /* We must not include limits.h! */
   };

But on Solaris:
===============

In /usr/include/sys/dirent.h:

/*
  * File-system independent directory entry.
  */
typedef struct dirent {
         ino_t           d_ino;          /* "inode number" of entry */
         off_t           d_off;          /* offset of disk directory entry */
         unsigned short  d_reclen;       /* length of this record */
         char            d_name[1];      /* name of file */
} dirent_t;


Note the 256 chars padding at the *end* of the struct
in linux, compared to the 1 char padding in the Solaris
variant !!!!!!!!!!!!

So, on Solaris, dirent is used as a "header" in
an allocated chunk of memory, and the d_name struct must
be followed by strlen(d_name) bytes (d_name on Solaris
does contribute one char to the d_name string storage,
which can count towards storing the '\0' string terminator.

Without the additional allocation, whatever followed
in memory after the dirent struct was being overwritten
("clobbered") by the strcpy().

At the bottom of scandir.c, it does a qsort also. Does this
move around dirent structures, or just pointers to them ? If
it's the former, then the trailing string will need to be
moved too.

I am not familiar with replace/scandir.c, but perhaps Alan ?
could check over it, to see if there are any other places
where struct dirent is effectively initialized, copied,
assigned to, or destroyed, as a similar fix would be needed
to "copy across" the "trailing" d_name data (on Solaris).

After fixing this against 1.2.3, it ran without core
dumping on Solaris. Note that we have had the same core
dump while testing with HA 2.0.2 on Solaris.

Thanks again for the software folks,
Andrew



More information about the Linux-HA mailing list