[LinuxFailSafe] problem reseting node

Tabitha Taylor tabtaylor@excite.com
Tue, 24 Sep 2002 11:29:33 -0400 (EDT)


Could anyone help me out with these crsd errors?  I have no
idea what they mean.


I am using the latest failsafe build ver. 1.0.4 from oss.sgi.com

PACKAGE_NAME = FailSafe
PACKAGE_RELEASE = 1
PACKAGE_VERSION = 1.0.4
PACKAGE_DISTRIBUTION = SGI 1.0


I can reset with no problem using the stonith cmd through command-line.


rpc.cfg file contents:

/dev/ttyS0 HA2 0


using command:  /usr/local/sbin/stonith -t rps10  HA2

resets the node HA2 every time.



Here is the crsd_HA1 log after trying to reset HA2 in failsafe:

Tue Sep 24 11:01:07.210  
Reset request from sysctlrKill:1 (local) for HA2:2.
Tue Sep 24 11:01:07.245  
Running pre-reset power fail test.
Tue Sep 24 11:01:07.246  
Powerfail (node HA2) called with status = CRSRS_NORESPONSE:6, 
last heartbeat = 13835050774238853472, reset request time = 4617690025151166504.
Tue Sep 24 11:01:07.246 
Port type = stonith status = Enabled CRSMS_LIVE:3, command fail time = 0
last response time = 0.
Tue Sep 24 11:01:07.246 
 Min net delay = 250 max net delay = 500.
Tue Sep 24 11:01:07.246 
 Maximum safe network failure detection interval = 300000.
Tue Sep 24 11:01:07.246 
 Maximum safe serial line failure detection interval = 15000.
Tue Sep 24 11:01:07.247 
 Power fail algorithm returning reset status = CRSRS_NORESPONSE:6.
Tue Sep 24 11:01:07.247 
 Pre-reset power fail test failed, issuing real reset.
Tue Sep 24 11:01:17.300 
RESET FAILED.
Tue Sep 24 11:01:17.300 
Reset returned with err = CI_CRSERR_NOTFOUND.
Tue Sep 24 11:01:17.301 
Reset failed, running post-reset power fail test.
Tue Sep 24 11:01:17.301 
Powerfail (node HA2) called with status = CRSRS_NORESPONSE:6,
last heartbeat = 13835050774238853472, 
reset request time = 4617690025151166504.
Tue Sep 24 11:01:17.301  
Port type = stonith status = Enabled CRSMS_LIVE:3, command fail time = 0
last response time = 0.
Tue Sep 24 11:01:17.301 
Min net delay = 250 max net delay = 500.
Tue Sep 24 11:01:17.301  
Maximum safe network failure detection interval = 300000.
Tue Sep 24 11:01:17.301  
Maximum safe serial line failure detection interval = 15000.
Tue Sep 24 11:01:17.302  
Power fail algorithm returning reset status = CRSRS_NORESPONSE:6.
Tue Sep 24 11:01:17.302  
Post-reset power fail test failed.


Any idea on what is going wrong?

Thanks,

Tabitha



------------------------------------------------
Changed your e-mail?  Keep your contacts!  Use this free e-mail change of address service from Return Path.  Register now!