[Linux-HA] New problem(s) with heartbeat 2.0.3 and STONITH
Sun Jiang Dong
hasjd at cn.ibm.com
Fri Oct 28 04:15:38 MDT 2005
Alan Robertson wrote:
> Stefan Peinkofer wrote:
>
>> Hello everybody,
>>
>> unforunately I have new prolbems with the heartbeat 2.0.3 cvs version
>> and stonith.
>>
>> I ran a cvs heartbeat which was checked out on 2005-10-18 and
>> encountered a problem with stonithd which was killed by signal 11.
>> The effects were that the stonith resources were NOT_ACTIVE and when I
>> initiated a split brain no node could fence the other off.
>>
>> I thought maybe it's already fixed in cvs and checkout a version today
>> (2005-10-26). But unfortunately this version seems to contain a even
>> worse problem with stonith.
>>
>> After I startup heartbeat on the two nodes, and wait until it's started
>> up completely I initiated the split brain situation. I had expected that
>> this works as expected because both stonith resources were active.
>>
>> In the logs I saw:
>> Oct 26 17:30:53 spock pengine: [20031]: WARN: mask(stages.c:stage6):
>> Scheduling Node sarek for STONITH
>> Thats what I want :)
>> But then the following message appeared:
>> Oct 26 17:31:03 spock tengine: [20030]: ERROR: stonithd_node_fence:
>> cannot add field to ha_msg.
>
>
> This is some kind of an issue in the lib/fencing/stonithd_lib.c file
>
> if ( (ha_msg_add_int(request, F_STONITHD_OPTYPE, op->optype) !=
> HA_OK )
> ||(ha_msg_add(request, F_STONITHD_NODE, op->node_name ) !=
> HA_OK)
> ||(op->node_uuid == NULL
> || ha_msg_add(request, F_STONITHD_NODE_UUID,
> op->node_uuid) != HA_OK)
> ||(op->private_data == NULL
> || ha_msg_add(request, F_STONITHD_PDATA,
> op->private_data) != HA_OK)
> ||(ha_msg_add_int(request, F_STONITHD_TIMEOUT, op->timeout)
> != HA_OK) ) {
> stdlib_log(LOG_ERR, "stonithd_node_fence: "
> "cannot add field to ha_msg.");
> ZAPMSG(request);
> return ST_FAIL;
> }
>
> My guess is that op->node_name or op->optype is NULL. The code should
> have validated those. Since they're critical, and they come from
> who-knows-where (meaning some doofus user process), they should
> definitely have been error checked, and there should be a clear message
> about their errors.
>
Should be op->private_data == NULL. This condition is not reasonable.
I'll fix it.
> Things I don't quite understand...
> UUIDs are normally special portable binary values with their own type in
> the structure world... Having this be a string violates the law of
> least surprise. If they're not really uuids, then they shouldn't be
> CALLED uuids.
There is a long story regarding this, it's required by Andrew.
>
> Normally private_data is also binary. If either of this is actually
> binary, then this would also be wrong. Having them be strings violates
> the law of least surprise... So, as a design element, it's odd to have
> them not be binary blobs. Of course, sending the private data as binary
> would cause it's own problems with portability.
Yes.
>
> But, renaming it to private_string_data or something would alleviate the
> confusion, and make it clearer.
It makes sense, i'll rename it.
>
>
--
BRs,
Sun Jiang Dong
More information about the Linux-HA
mailing list