[LinuxFailSafe] crsd problem
Padmanabhan Sreenivasan
paddy@engr.sgi.com
Wed, 02 Jul 2003 08:07:48 -0700
rf@q-leap.de wrote:
> >>>>> "Paddy" == Padmanabhan Sreenivasan <paddy@sgi.com> writes:
>
> Paddy> rf@q-leap.de wrote:
> >> >>>>> "Lars" == Lars Marowsky-Bree <lmb@suse.de> writes:
>
> Lars> On 2003-06-06T21:43:06, rf@q-leap.de said:
> >> >> We have a problem with the crsd daemon. What happens is that always
> >> >> after a certain amount of time (approx. 6 days) after the crsd has >>
> >> started, there is a problem with its ipc communication. This can have >>
> >> the unfortunate effect, that resetting will not work anymore when a >>
> >> failover has to be done. The fact that this always happens after the >>
> >> same time period suggests that some integer counter is overflowing.
> >>
> Lars> Does anything cleanout the /tmp directory and remove the IPC
> Lars> socket...?
> >> No, the ipc file is still there after the error message (it is not a
> >> socket but a mmapped file):
> >>
> >> -rwx------ 1 root root 8220 Jun 4 19:14
> >> /var/run/failsafe/comm/crsd-ipc_ha-test-1
>
> Paddy> This is a bug in libcrs.so.
>
> Paddy> Workaround is to restart cluster processes.
>
> Paddy> or make the following source change and rebuild
>
> Paddy> FailSafe/cluster_services/lib/libcrs/src/crsl_register.c
>
> Paddy,
>
> thanks for your reply. Unfortunately, the patch below won't fix the problem,
> since it is already included in the source code. Or should the correct line be
>
> ipcclnt_ctl(newchan->ipchdl, CI_IPC_NON_BLOCK|CI_IPC_NON_PULSE); ?
The correct line should have CI_IPC_NON_BLOCK|CI_IPC_NON_PULSE
flags passed to ipcclnt_ctl().
Paddy
>
>
> It really looks like a counter overflow because the error always occurs at
> exactly the same length of time after crsd has started.
>
> Roland
>
> Paddy> @@ -137,7 +137,7 @@ goto done;
>
> Paddy> /* Make the handle nonblocking so we do not block trying to
> Paddy> connect. */ - ipcclnt_ctl(newchan->ipchdl,
> Paddy> CI_IPC_NON_BLOCK|CI_IPC_NON_PULSE); + ipcclnt_ctl(newchan->ipchdl,
> Paddy> CI_IPC_NON_BLOCK);
>
> Paddy> if ((err = ipcclnt_connect(newchan->ipchdl,
> Paddy> crslsp->ipc_conn_file, CRS_DAEMON, CI_IPC_NOSIG)) != CI_SUCCESS)