[Linux-ha-dev] errors with 76c25be5c854

Alan Robertson alanr at unix.sh
Thu Dec 6 08:33:41 MST 2007


Tadashiro Yoshida wrote:
> Hi,
> 
> We detected some errors in the ComponentFail test while running CTS with a dev version. It might be a CTS's problem with message handling.
> Please check it if it happens something wrong.
> 
> Dev version: 76c25be5c854
> # python CTSlab.py -v2 -r -c
>   --facility local7 -L /var/log/ha-log-local7 500 2>&1 | tee cts.log
> 
> -----------
> Nov 26 19:22:09 x3650a CTS: Running test ComponentFail (x3650b) [16]
> Nov 26 19:22:10 x3650b heartbeat: [27967]: WARN:
>                 Managed /usr/lib64/heartbeat/stonithd process 27980
>                 killed by signal 9 [SIGKILL - Kill, unblockable].
> Nov 26 19:22:10 x3650b heartbeat: [27967]: ERROR:
>                 Respawning client "/usr/lib64/heartbeat/stonithd":
> Nov 26 19:22:10 x3650b heartbeat: [27967]: info:
>                 Starting child client "/usr/lib64/heartbeat/stonithd"(0,0)
> Nov 26 19:22:10 x3650b stonithd: [30753]: notice:
>                 /usr/lib64/heartbeat/stonithd start up successfully.
>    :
> Nov 26 19:32:41 x3650a CTS: Patterns not found:
>                 ['x3650c crmd:.*LOST:.* x3650b ',
>                  'Updating node state to member for x3650b']
> Nov 26 19:32:41 x3650a CTS: Test ComponentFail failed
>                 [reason:Didn't find all expected patterns]
> Nov 26 19:32:41 x3650a CTS: Test ComponentFail (x3650b) [FAILED]
> -----------

I think this was a pattern problem in the messages-to-ignore, which I 
believe is now fixed.
http://hg.linux-ha.org/dev/rev/e4a4c6fd5649

> 
> Besides, it seams there are some failures in the stonithd testing, although the final message says it was succeeded. 
> 
> -----------
> Nov 26 19:54:11 x3650a CTS: BadNews: Nov 26 19:46:26 x3650b stonithd: [26162]:
>   CRIT: command ssh -q -x -n -l root "x3650c" "echo 'sleep 2;
>   /sbin/reboot -nf' | SHELL=/bin/sh at now >/dev/null 2>&1" failed
> Nov 26 19:54:11 x3650a CTS: BadNews: Nov 26 19:46:53 x3650b stonithd: [23258]: 
>   ERROR: Failed to STONITH the node x3650c: optype=RESET,
>   op_result=TIMEOUT
> Nov 26 19:54:11 x3650a CTS: BadNews: Nov 26 19:46:53 x3650b tengine: [26116]:
>   ERROR: tengine_stonith_callback: Stonith of x3650c failed (2)...
>   aborting transition.
> -----------

I need to change my testing setup to look at this.  I'd heard a rumor 
that this was happening, but it wasn't happening to me, and no bugzilla 
was filed.  But, I'm pretty sure it's an indication of a fault in the 
stonith ssh module.

I changed the code to fail-fast, which is vastly safer when you don't 
have STONITH available, and not harmful when you have real STONITH. 
However, if the ssh STONITH module can't connect to the machine it will 
show a failure like this one.  So, I think the thing to do is to figure 
out how to report success in this case - in the testing STONITH module.


-- 
     Alan Robertson <alanr at unix.sh>

"Openness is the foundation and preservative of friendship...  Let me 
claim from you at all times your undisguised opinions." - William 
Wilberforce


More information about the Linux-HA-Dev mailing list