[LinuxFailSafe] crsd problem

rf@q-leap.de rf@q-leap.de
Wed, 2 Jul 2003 07:38:36 +0200


>>>>> "Paddy" == Padmanabhan Sreenivasan <paddy@sgi.com> writes:

    Paddy> rf@q-leap.de wrote:
    >>  >>>>> "Lars" == Lars Marowsky-Bree <lmb@suse.de> writes:

    Lars> On 2003-06-06T21:43:06, rf@q-leap.de said:
    >>  >> We have a problem with the crsd daemon. What happens is that always
    >> >> after a certain amount of time (approx. 6 days) after the crsd has >>
    >> started, there is a problem with its ipc communication. This can have >>
    >> the unfortunate effect, that resetting will not work anymore when a >>
    >> failover has to be done. The fact that this always happens after the >>
    >> same time period suggests that some integer counter is overflowing.
    >> 
    Lars> Does anything cleanout the /tmp directory and remove the IPC
    Lars> socket...?
    >>  No, the ipc file is still there after the error message (it is not a
    >> socket but a mmapped file):
    >> 
    >> -rwx------ 1 root root 8220 Jun 4 19:14
    >> /var/run/failsafe/comm/crsd-ipc_ha-test-1

    Paddy> This is a bug in libcrs.so.

    Paddy> Workaround is to restart cluster processes.

    Paddy> or make the following source change and rebuild

    Paddy> FailSafe/cluster_services/lib/libcrs/src/crsl_register.c

Paddy,

thanks for your reply. Unfortunately, the patch below won't fix the problem,
since it is already included in the source code. Or should the correct line be 

      ipcclnt_ctl(newchan->ipchdl, CI_IPC_NON_BLOCK|CI_IPC_NON_PULSE); ?

It really looks like a counter overflow because the error always occurs at
exactly the same length of time after crsd has started.

Roland

    Paddy> @@ -137,7 +137,7 @@ goto done;
 
    Paddy>      /* Make the handle nonblocking so we do not block trying to
    Paddy> connect. */ - ipcclnt_ctl(newchan->ipchdl,
    Paddy> CI_IPC_NON_BLOCK|CI_IPC_NON_PULSE); + ipcclnt_ctl(newchan->ipchdl,
    Paddy> CI_IPC_NON_BLOCK);
 
    Paddy>      if ((err = ipcclnt_connect(newchan->ipchdl,
    Paddy> crslsp->ipc_conn_file, CRS_DAEMON, CI_IPC_NOSIG)) != CI_SUCCESS)