[Linux-HA] Resource becomes unmanaged when trying to reboot server
Tobias Appel
tappel at eso.org
Fri Jan 9 02:34:22 MST 2009
On Mon, 2008-12-22 at 19:42 +0100, Dejan Muhamedagic wrote:
> Hi,
>
> On Mon, Dec 22, 2008 at 12:18:06PM +0100, Tobias Appel wrote:
> > Hi,
> >
> > sorry to bug you guys again before christmas but I have a very weird
> > error.
> > I have a 2 node setup with drbd and Heartbeat 2.14. One resource group
> > which contains Nagios (something like BigBrother).
> >
> > Now I configured everything and did some tests with starting and stoping
> > heartbeat service on the servers - the failover worked.
> >
> > But if I run 'shutdown -r now' on the active node the server will not
> > reboot and the resource group will not be moved to the passive node.
> > When I run crm_mon I can see:
> > nagios-core (lsb:nagios): Started node01 (unmanaged) FAILED
> >
> > The server will do nothing then. It will not reboot, the rest of the
> > resource group is still running! The log file from nagios tells me it
> > correctly shutdown. I did browse through the big big ha-log but I
> > couldn't find anything that would help me.
> >
> > pengine[27246]: 2008/12/22_11:47:11 WARN: unpack_rsc_op: Processing
> > failed op nagios-core_stop_0 on node01: Error
> >
> > I really have no idea what to look for or what to do.
>
> A resource failed to stop. That's typically a reason to kill the
> node, but you probably don't have stonith setup. If a resource
> can't be stopped and there's no stonith enabled, then that
> resource can't be started anywhere.
>
> Thanks,
>
> Dejan
Hi,
and happy new year everybody - just came back from holiday.
You are right I don't have stonith enabled because I don't really
understand it fully yet. I know what it means and what it should do but
I thought it works as fencing in conjunction with a UPS or fibre-channel
switch device.
It is correct that the problem is that the resource can not be stopped -
or at least the CRM thinks it can not be stopped. I had the same problem
with the RedHat Cluster Software on the same server - it also could not
stop the nagios resource and the cluster was in a failed state.
Now what you are saying is that stonith would be my solution. When I
turn off one cluster node and the resource goes into an unmanaged state,
the other node could declare it as dead and go online?
Can anyone please point me to a stonith how-to which is not based on a
UPS or something like this? I also can't much about in in the book from
Dr. Schwartzkopff :(
This would be really helpful.
Thanks in advance,
Tobias
More information about the Linux-HA
mailing list