[LinuxFailSafe] crsd problem
rf@q-leap.de
rf@q-leap.de
Wed, 2 Jul 2003 07:38:36 +0200
>>>>> "Paddy" == Padmanabhan Sreenivasan <paddy@sgi.com> writes:
Paddy> rf@q-leap.de wrote:
>> >>>>> "Lars" == Lars Marowsky-Bree <lmb@suse.de> writes:
Lars> On 2003-06-06T21:43:06, rf@q-leap.de said:
>> >> We have a problem with the crsd daemon. What happens is that always
>> >> after a certain amount of time (approx. 6 days) after the crsd has >>
>> started, there is a problem with its ipc communication. This can have >>
>> the unfortunate effect, that resetting will not work anymore when a >>
>> failover has to be done. The fact that this always happens after the >>
>> same time period suggests that some integer counter is overflowing.
>>
Lars> Does anything cleanout the /tmp directory and remove the IPC
Lars> socket...?
>> No, the ipc file is still there after the error message (it is not a
>> socket but a mmapped file):
>>
>> -rwx------ 1 root root 8220 Jun 4 19:14
>> /var/run/failsafe/comm/crsd-ipc_ha-test-1
Paddy> This is a bug in libcrs.so.
Paddy> Workaround is to restart cluster processes.
Paddy> or make the following source change and rebuild
Paddy> FailSafe/cluster_services/lib/libcrs/src/crsl_register.c
Paddy,
thanks for your reply. Unfortunately, the patch below won't fix the problem,
since it is already included in the source code. Or should the correct line be
ipcclnt_ctl(newchan->ipchdl, CI_IPC_NON_BLOCK|CI_IPC_NON_PULSE); ?
It really looks like a counter overflow because the error always occurs at
exactly the same length of time after crsd has started.
Roland
Paddy> @@ -137,7 +137,7 @@ goto done;
Paddy> /* Make the handle nonblocking so we do not block trying to
Paddy> connect. */ - ipcclnt_ctl(newchan->ipchdl,
Paddy> CI_IPC_NON_BLOCK|CI_IPC_NON_PULSE); + ipcclnt_ctl(newchan->ipchdl,
Paddy> CI_IPC_NON_BLOCK);
Paddy> if ((err = ipcclnt_connect(newchan->ipchdl,
Paddy> crslsp->ipc_conn_file, CRS_DAEMON, CI_IPC_NOSIG)) != CI_SUCCESS)