[LinuxFailSafe] Question about membership_compute_and_validate:
Kashif Shaikh
kshaikh@consensys.com
20 Dec 2002 17:22:56 -0500
On Fri, 2002-12-13 at 18:54, Padmanabhan Sreenivasan wrote:
>
> There can be software failures. Due to bugs, packets can be
> sent but not received by a server.
Agreed, but I encountered something else. I have recently set up a
4-node cluster to witness cms in action. When a transient cluster
partition(including reset network) occurs resulting in a 2/2 split AND
one side of the partition cms restarts(new incarnation) one one node you
get a weird, but sensible distribution of recvsets: 0xC, 0xF(new inc),
0xB, 0xB. Packets will be rejected between nodes after cluster partition
is gone; this is until the cmsds go lonely and restart with new inc.
The point here is to show everyone my original reasoning was wrong, and
it possible to end up with different recvsets.
Couple of quirks in failsafe confirming membership: it is kind of odd
you can only have one posted membership waiting to be confirmed when you
have a proposed membership waiting in the queue. Depending on the
failstop enforcement process length(max default 60 seconds), a node
which was declared down before could be UP now. Posted membership that
occurs at time t1 should be dropped when new valid membership comes at
time t2.
Also the way reset is handled is weird too, i.e. all nodes try to reset
DOWN'd nodes at same time -- which really isn't necessary as local crsd
forwards reset requests to the appropriate remote crsd.
Kashif
>
> It is unlikely to see hardware causing such
> failures.
>
> Paddy
> >
> > Regards,
> >
> > Kashif Shaikh
> >
> > _______________________________________________
> > LinuxFailSafe mailing list
> > LinuxFailSafe@lists.community.tummy.com
> > http://lists.community.tummy.com/mailman/listinfo/linuxfailsafe
> _______________________________________________
> LinuxFailSafe mailing list
> LinuxFailSafe@lists.community.tummy.com
> http://lists.community.tummy.com/mailman/listinfo/linuxfailsafe