[LinuxFailSafe] database inconsistent after restart

Martin Bene martin.bene@icomedias.com
Thu, 1 Aug 2002 18:10:31 +0200


Hi,=20

I'm seeing quite a strange effects with my (1.04) failsafe cluster after =
a
complete shutdown (UPS failure).

Startup of both systems, cdbd on 2nd node started to receive database
update(s), but never finished:

Thu Aug  1 13:39:34.653 webc2 cdbd  - Checking quorum with 2 members for =
any
unknown members.
Thu Aug  1 13:39:34.653 webc2 cdbd  - All quorum member machines are =
known to
us.
Thu Aug  1 13:39:34.653 webc2 cdbd  - Starting to receive CDB sync =
series
from machine 1
Thu Aug  1 13:56:05.849 webc2 cdbd  - terminating on signal 15 (pid =
1436)

(I gave up after 15 minutes, and shut down node webc2).

I started the cluster using just node1, started failsafe services on =
node1
and now have all services running on node1.

Now, startup node2:

Thu Aug  1 17:44:18.628 webc2 cdbd  - Local CDB obsolete for new quorum:
cluster id: 0x00000000.0x3cd80f3a274e58f0, master: 1, sequence: 57, =
member
count: 2, members:  2,  1
Thu Aug  1 17:44:47.178 webc2 cdbd  - Finished receiving CDB sync series =
from
machine 1
Thu Aug  1 17:44:47.178 webc2 cdbd  - Starting to receive CDB sync =
series
from machine 1
Thu Aug  1 17:44:47.186 webc2 cdbd  - Finished receiving CDB sync series =
from
machine 1
Thu Aug  1 17:44:47.186 webc2 cdbd  - Checking quorum with 2 members for =
any
unknown members.
Thu Aug  1 17:44:47.186 webc2 cdbd  - All quorum member machines are =
known to
us.
Thu Aug  1 17:44:47.229 webc2 cdbd  - New quorum: cluster id:
0x00000000.0x3cd80f3a274e58f0, master: 1, sequence: 57, member count: 2,
members:  2,  1
Thu Aug  1 17:44:47.229 webc2 cdbd  - New quorum: cluster id:
0x00000000.0x3cd80f3a274e58f0, master: 1, sequence: 57, member count: 2,
members:  2,  1
Thu Aug  1 17:44:47.233 webc2 cdbd  - CDB on node webc2 (2) marked =
current in
fs2proc_machine_register_1

As you see, node2 told me that db db replication finished OK; node2 =
should
now have an identical view of what's going on in the cluster as node1 =
has,
shouldn't it? Fact is: it doesn't.

when running the gui connected to node1, I see the cluster in status
"Normal", Node1 as "UP", Node2 as "inactive" and both resource groups =
running
& online on node1.

connecting the gui to node2 I see node1 as "unknown", node2 as =
"inactive" and
both resource groups in "online ready" status.

Node2 as  inactive is correct - I haven't started ha services on that =
node
yet, but I'm not at all sure that actually doing so would be a good idea
given the mismatched status information I get from the two nodes.

Any Idea what's going wrong here?

Thanks, Martin