[LinuxFailSafe] database inconsistent after restart
Martin Bene
martin.bene@icomedias.com
Thu, 1 Aug 2002 18:10:31 +0200
Hi,=20
I'm seeing quite a strange effects with my (1.04) failsafe cluster after =
a
complete shutdown (UPS failure).
Startup of both systems, cdbd on 2nd node started to receive database
update(s), but never finished:
Thu Aug 1 13:39:34.653 webc2 cdbd - Checking quorum with 2 members for =
any
unknown members.
Thu Aug 1 13:39:34.653 webc2 cdbd - All quorum member machines are =
known to
us.
Thu Aug 1 13:39:34.653 webc2 cdbd - Starting to receive CDB sync =
series
from machine 1
Thu Aug 1 13:56:05.849 webc2 cdbd - terminating on signal 15 (pid =
1436)
(I gave up after 15 minutes, and shut down node webc2).
I started the cluster using just node1, started failsafe services on =
node1
and now have all services running on node1.
Now, startup node2:
Thu Aug 1 17:44:18.628 webc2 cdbd - Local CDB obsolete for new quorum:
cluster id: 0x00000000.0x3cd80f3a274e58f0, master: 1, sequence: 57, =
member
count: 2, members: 2, 1
Thu Aug 1 17:44:47.178 webc2 cdbd - Finished receiving CDB sync series =
from
machine 1
Thu Aug 1 17:44:47.178 webc2 cdbd - Starting to receive CDB sync =
series
from machine 1
Thu Aug 1 17:44:47.186 webc2 cdbd - Finished receiving CDB sync series =
from
machine 1
Thu Aug 1 17:44:47.186 webc2 cdbd - Checking quorum with 2 members for =
any
unknown members.
Thu Aug 1 17:44:47.186 webc2 cdbd - All quorum member machines are =
known to
us.
Thu Aug 1 17:44:47.229 webc2 cdbd - New quorum: cluster id:
0x00000000.0x3cd80f3a274e58f0, master: 1, sequence: 57, member count: 2,
members: 2, 1
Thu Aug 1 17:44:47.229 webc2 cdbd - New quorum: cluster id:
0x00000000.0x3cd80f3a274e58f0, master: 1, sequence: 57, member count: 2,
members: 2, 1
Thu Aug 1 17:44:47.233 webc2 cdbd - CDB on node webc2 (2) marked =
current in
fs2proc_machine_register_1
As you see, node2 told me that db db replication finished OK; node2 =
should
now have an identical view of what's going on in the cluster as node1 =
has,
shouldn't it? Fact is: it doesn't.
when running the gui connected to node1, I see the cluster in status
"Normal", Node1 as "UP", Node2 as "inactive" and both resource groups =
running
& online on node1.
connecting the gui to node2 I see node1 as "unknown", node2 as =
"inactive" and
both resource groups in "online ready" status.
Node2 as inactive is correct - I haven't started ha services on that =
node
yet, but I'm not at all sure that actually doing so would be a good idea
given the mismatched status information I get from the two nodes.
Any Idea what's going wrong here?
Thanks, Martin