[Linux-HA] Feedback: Master/Slave RA for Postgres / Slony Cluster?
Dominik Klein
dk at in-telegence.net
Wed Nov 7 00:26:53 MST 2007
Hi Andrew
thanks for your reply.
>> So I thought I could implement "demote" as "return 0", as "promote" on
>> the other machine will do the job anyway. Well, not the best idea as a
>> "monitor" action on the apparently demoted machine will still return
>> Master Status until "promote" on the second machine finished.
>
> What if the crm delayed the slave's monitor until after the other side
> was promoted... would that help significantly?
That would propably prevent one failed monitor action in this very
special case.
>> Furthermore, the switchover command will fail if the other machine is
>> not responding. In case the current master really has a problem, all
>> you can do get a writeable database on the current slave is to use the
>> failover command. But Linux-HA only knows "promote" and "demote".
>>
>> So I implemented some promote and demote the following way:
>>
>> #### promote
>> if switchover_to_me
>> then
>> return 0
>> else
>> if ! switchover_to_me
>> then
>> failover_to_me
>> return $?
>> fi
>> fi
>> ####
>>
>> #### demote
>> switchover_to_other_machine
>> # dont care if this works as it cannot work if
>> # the other machine is not healthy
>> return 0
>> ####
>>
>> What you also need to know about slony-1 is the fact that you need to
>> resync the COMPLETE data after a failover. In slony-1 it is not
>> possible to let a failed node rejoin the slony-Cluster (even if it was
>> healthy when the failover command was issued). It has to fetch ALL
>> data from the new master. So you want to avoid failover if it is not
>> absolutely necessary.
>>
>> Up to now I thought my RA could handle a few cases and it turns out:
>> SOME it can handle (like master reboot or slave reboot or controlled
>> switchover). But a simple thing as killing postgres on the master
>> machine causes a failover. Why?:
>>
>> Say A is master, B is slave at this moment
>>
>> 1. monitor on A fails
>> 2. Linux-HA executes demote on A
>> -> As you see above, this will work even if it does nothing
>> 3. Linux-HA executes promote on B
>> -> This, as postgres on A is not running, will end up in a failover
>> (see above)
>
> Notifications might help.
> The Filesystem agent (when operating in OCFS2 mode) keeps a list of who
> its peers are.
> If you did the same then I think you'd be able to recognize that you're
> all alone and that it was ok to switchover_to_me instead.
Read my first post again. Switchover is not possible if the other
postgres instance is not available. The only way to make a single slave
the new master is to use the failover command.
What *would* help here is:
1. monitor on A fails -> OCF_NOT_RUNNING
Now, instead of "demote A, promote B":
2. Stop/Start the resource on A
Iirc "start" includes a monitor action (or "probe" called sometimes in
this case). This would report "OCF_RUNNING_MASTER", so the problem would
be solved.
On the other hand, this is propably a pretty big change in Linux-HA's
master/slave handling and this should be discussed.
Regards
Dominik
More information about the Linux-HA
mailing list