[Linux-HA] Feedback: Master/Slave RA for Postgres / Slony Cluster?

Dominik Klein dk at in-telegence.net
Wed Nov 7 00:26:53 MST 2007


Hi Andrew

thanks for your reply.

>> So I thought I could implement "demote" as "return 0", as "promote" on 
>> the other machine will do the job anyway. Well, not the best idea as a 
>> "monitor" action on the apparently demoted machine will still return 
>> Master Status until "promote" on the second machine finished.
> 
> What if the crm delayed the slave's monitor until after the other side 
> was promoted... would that help significantly?

That would propably prevent one failed monitor action in this very 
special case.

>> Furthermore, the switchover command will fail if the other machine is 
>> not responding. In case the current master really has a problem, all 
>> you can do get a writeable database on the current slave is to use the 
>> failover command. But Linux-HA only knows "promote" and "demote".
>>
>> So I implemented some promote and demote the following way:
>>
>> #### promote
>> if switchover_to_me
>> then
>>     return 0
>> else
>>     if ! switchover_to_me
>>     then
>>         failover_to_me
>>         return $?
>>     fi
>> fi
>> ####
>>
>> #### demote
>> switchover_to_other_machine
>> # dont care if this works as it cannot work if
>> # the other machine is not healthy
>> return 0
>> ####
>>
>> What you also need to know about slony-1 is the fact that you need to 
>> resync the COMPLETE data after a failover. In slony-1 it is not 
>> possible to let a failed node rejoin the slony-Cluster (even if it was 
>> healthy when the failover command was issued). It has to fetch ALL 
>> data from the new master. So you want to avoid failover if it is not 
>> absolutely necessary.
>>
>> Up to now I thought my RA could handle a few cases and it turns out: 
>> SOME it can handle (like master reboot or slave reboot or controlled 
>> switchover). But a simple thing as killing postgres on the master 
>> machine causes a failover. Why?:
>>
>> Say A is master, B is slave at this moment
>>
>> 1. monitor on A fails
>> 2. Linux-HA executes demote on A
>> -> As you see above, this will work even if it does nothing
>> 3. Linux-HA executes promote on B
>> -> This, as postgres on A is not running, will end up in a failover 
>> (see above)
> 
> Notifications might help.
> The Filesystem agent (when operating in OCFS2 mode) keeps a list of who 
> its peers are.
> If you did the same then I think you'd be able to recognize that you're 
> all alone and that it was ok to switchover_to_me instead.

Read my first post again. Switchover is not possible if the other 
postgres instance is not available. The only way to make a single slave 
the new master is to use the failover command.

What *would* help here is:

1. monitor on A fails -> OCF_NOT_RUNNING
Now, instead of "demote A, promote B":
2. Stop/Start the resource on A
Iirc "start" includes a monitor action (or "probe" called sometimes in 
this case). This would report "OCF_RUNNING_MASTER", so the problem would 
be solved.

On the other hand, this is propably a pretty big change in Linux-HA's 
master/slave handling and this should be discussed.

Regards
Dominik


More information about the Linux-HA mailing list