[Linux-ha-dev] Is this a problem of heartbeat not delivering client notifications in some cases ?

Andrew Beekhof beekhof at gmail.com
Fri Jun 8 03:01:44 MDT 2007


On 6/6/07, kisalay <kisalay at gmail.com> wrote:
> Hi,
>
>  I am running 2.0.8 linux-ha on a 2 node system.
>  I ran into a problem while failover from one node to another.
>
>  The sequence of actions:
>  1. The two nodes are running on unequal h/w. int01 is running on dell-2950
> with single cpu 4 gb ram. int02 is running on dell-2950 with dual cpu and
> 8gb ram.
>  2. The node int02 was initially active and int01 was standby.

as in just not running any resources or really in standby mode where
its not allowed to run resources?

>  3. In my setup, whenever the failover happens, the node taking over
> restarts the earlier active heartbeat to make it forget all the failcounts.
> This is done in the start scripts of the first resource.
>  4. Due to resource failovers ( process kills ), the failover happens from
> int02 to int01.
>  5. When int01 becomes active, int02 is still the dc and is running pengine
> and tengine.
>  6. When int01 starts its resources, the restart is issued to the heartbeat
> on the int02.

automatically or manually?

>  7. When the int02 heartbeat is shutting down, following logs are seen:
>  May 27 17:02:47 indica-int02 cib: [1686]: info: cib_shutdown: Disconnected
> 0 clients
>  May 27 17:02:47 indica-int02 cib: [1686]: info: cib_process_disconnect: All
> clients disconnected...
>  May 27 17:02:47 indica-int02 cib: [1686]: info: initiate_exit: Sending
> disconnect notification to 2 peers...
>  May 27 17:02:52 indica-int02 cib: [1686]: notice: cib_force_exit: Forcing
> exit!

this means that when it sent out a message saying "i'm outta here",
that it didn't get a response from its peer - which is odd but not
necessarily a problem

>  May 27 17:02:52 indica-int02 cib: [1686]: info: terminate_ha_connection:
> cib_force_exit: Disconnecting heartbeat
>  May 27 17:02:52 indica-int02 cib: [1686]: info: cib_ha_connection_destroy:
> Heartbeat disconnection complete... exiting
>  May 27 17:02:52 indica-int02 cib: [1686]: info: uninitializeCib: The CIB
> has been deallocated.
>
>  This suggests that cib force exited on int02. Is it because of anomaly?
>
>  8. When int02 comes up, the following logs are seen:
>  May 27 17:04:01 indica-int02 crmd: [21363]: info: do_state_transition:
> indica-int02.pune.nevisnetworks.com: State transition
> S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_IPC_MESSAGE
> origin=route_message ]
>  May 27 17:04:01 indica-int02 cib: [21735]: info: write_cib_contents: Wrote
> version 0.12.441 of the CIB to disk (digest:
> f1cf5bc300318744927e2fa7c6a48d75)
>  May 27 17:04:03 indica-int02 cib: [21359]: WARN: cib_peer_callback:
> Discarding cib_shutdown_req message (529d0) from
> indica-int01.pune.nevisnetworks.com: not in our membership
>  May 27 17:04:08 indica-int02 cib: [21359]: WARN: cib_peer_callback:
> Discarding cib_update message (529dd) from
> indica-int01.pune.nevisnetworks.com: not in our membership
>
>  also at similar time, the int01 logs say:
>  May 27 17:04:15 indica-int01 crmd: [32642]: info: ccm_event_detail: NEW
> MEMBERSHIP: trans=13, nodes=1, new=0, lost=1 n_idx=0, new_idx=1, old_idx=3
>  May 27 17:04:15 indica-int01 crmd: [32642]: info: ccm_event_detail:
> CURRENT: indica-int01.pune.nevisnetworks.com [nodeid=0,
> born=13]
>  May 27 17:04:15 indica-int01 cib: [32638]: info: cib_diff_notify: Update
> (client: 21363, call:23): 0.12.427 -> 0.12.428 (ok)
>  May 27 17:04:15 indica-int01 crmd: [32642]: info: ccm_event_detail:
> LOST:    indica-int02.pune.nevisnetworks.com [nodeid=1,
> born=10]
>  May 27 17:04:15 indica-int01 crmd: [32642]: info: do_election_check: Still
> waiting on 1 non-votes (1 total)
>
>  So on one hand, int02 fails to recognise the int01 to be in the cluster,
> and on the other hand, int01 tells that int02 is offline.

looks to me like the restart of heartbeat is happening too fast for
the CCM to handle.

>
>  9. After this, int01 has the following logs:
>  May 27 17:04:19 indica-int01 crmd: [32642]: info: ccm_event_detail: NEW
> MEMBERSHIP: trans=15, nodes=2, new=1, lost=0 n_idx=0, new_idx=2, old_idx=4
>  May 27 17:04:19 indica-int01 crmd: [32642]: info: ccm_event_detail:
> CURRENT: indica-int02.pune.nevisnetworks.com [nodeid=1,
> born=1]
>  May 27 17:04:19 indica-int01 crmd: [32642]: info: ccm_event_detail:
> CURRENT: indica-int01.pune.nevisnetworks.com [nodeid=0,
> born=15]
>  May 27 17:04:19 indica-int01 crmd: [32642]: info: ccm_event_detail:
> NEW:     indica-int02.pune.nevisnetworks.com [nodeid=1,
> born=1]
>  May 27 17:04:19 indica-int01 crmd: [32642]: info: do_election_check: Still
> waiting on 2 non-votes (2 total)
>  May 27 17:04:19 indica-int01 crmd: [32642]: notice:
> crmd_ha_status_callback: Status update: Node
> indica-int02.pune.nevisnetworks.com now has status [init]
>
>  10. The resources have failedover to int01. But the int02 fails to
> recognise the int01.
>  The pengine and tengine are restarted on int02 and they try to restart the
> resources on int02 also while they are still running on int01.

classic split-brain behavior i'm afraid :-(
the biggest problem being that there is no reason for the CCM to think
this is one

>  I went through the release notes which says:
>  - When running a cluster of nodes of very different speeds temporary
>  membership anomalies may occasionally be seen. These correct
>  themselves and don't appear to be harmful. They typically
>
>  include a message something like this:
>  WARN: Ignoring HA message (op=vote) from XXX: not in our
>
>
>  membership list
>
>  and also through the description of bug 1367.  Is the problem I saw somehow
> related to these issues already reported? If so.. then is there any
> deterministic way of avoiding/reproducing the issue?

not that i'm happy about saying this, but maybe try stilling "sleep
30" between when you stop and start heartbeat.  that should give the
CCM time to sort itself out before the node comes back again.

and out of interest, why are you restarting heartbeat?


More information about the Linux-HA-Dev mailing list