[Linux-HA] Heartbeat cannot stop
beekhof at gmail.com
Thu Nov 15 02:39:06 MST 2007
On Nov 15, 2007, at 6:22 AM, Junko IKEDA wrote:
>> The root cause seems to be that heartbeat is not providing client
>> status messages (to say that the crmd processes are active) once the
>> split-brain heals.
>> crmd: 2007/11/08_10:38:43 info: join_make_offer: Peer process
>> dl380g5c is not active (yet?)
>> crmd: 2007/11/08_10:40:11 WARN: do_state_transition: Only 1
>> of 2
>> cluster nodes are eligible to run resources - continue 0
>> Because of this, the crm doesn't consider dl380g5c online and the PE
>> can't shut it down.
>> I think you need to file a bug for alan about this.
> I found the similar case.
> During recovering from a split brain,
> one node could not join the membership after all.
> crmd: 2007/11/15_14:04:11 debug: crmd_ha_msg_callback:
> Ignoring HA
> message (op=noop) from prec370d: not in our membership list (size=1)
according to the ccm on prec370e, prec370d really isn't part of the
cluster... what does the other node think?
looks like some sort of communications or ccm bug, if you attach the
logs from prec370d it might be possible to say which.
> and loop its State transition,
> from S_FINALIZE_JOIN -> S_INTEGRATION to S_INTEGRATION ->
> and so on.
yeah, given the circumstances (conflicting data from heartbeat and the
ccm) that is to be expected unfortunately.
> even worse the system was reboot for unexplained reasons...
> Message from syslogd at prec370d at Thu Nov 15 14:06:03 2007 ...
> prec370d heartbeat: : EMERG: Rebooting system. Reason:
thats alan's new suicide code in action... you'll have to take its
existence up with him
> I think crmd is not the underlying cause of this case...
> this case is poorly-reproducible, seems to be a matter of timing.
> The logs were very big, so filed them here;
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> See also: http://linux-ha.org/ReportingProblems
More information about the Linux-HA