[Linux-ha-dev] pgsql RA improvements

Andrew Beekhof beekhof at gmail.com
Fri Feb 23 04:22:30 MST 2007


On 2/23/07, Keisuke MORI <kskmori at intellilink.co.jp> wrote:
> Hi,
>
> We have found a several problems with pgsql RA through our testing.
> It 'fails to failover' in some scenarios.
> I'm proposing a patch to fix them.
>
> Problem description:
>
> 1) The first 'monitor' may fail even if the postmaster was
>    successfully launched.
>
>    This is because 'start' of the pgsql may return before the
>    postmaster gets ready to answer to a psql query issued by
>    'monitor', since it only checks the existance of postmaster
>    process. The postmaster can take a few minitues to get ready
>    to answer, particularly when it needs to recover the database
>    after a crash. Even if no recovery is necessary, we observed
>    that it sometimes fails in some of our test cases.
>
> 2) The postmaster fails to startup when 'postmaster.pid' file
>    was left over from the previous crash.
>
> 3) 'stop' doest not execute the fast mode shutdown effectively,
>    because it executes the immediate mode shutdown at the very
>    next moment.  The fast mode shutdown can take a few minutes
>    to complete to flush the database log.
>
>    This isn't a critical problem, but it may result to take a
>    time longer to complete the failover (according to our
>    database team). It is preferable to wait to complete the fast
>    mode shutdown as long as possible.
>
>
> Proposals to fix:
>
> 1) In 'start', wait until the postmaster gets ready to answer by
>    checking as same as 'monitor' does.
>    The maximum wait time to complete to startup can be
>    customized by an additional parameter 'start_wait'.
>
> 2) Add a cleanup code for 'postmaster.pid' when stop and before starting.
>
> 3) In 'stop', wait until the postmaster completes to the fast
>    mode shutdown.
>    The maximum wait time to complete to shutdown can be
>    customized by an additional parameter 'stop_wait.
>
>
> The attached patch is for the latest -dev.

I'd be more inclined to go with something like the patch below.

The function of start_wait and stop_wait is just as easily achieved by
setting the action's timeout.  Its also harder to mess up (ie. by
setting start_wait to longer than the start action's timeout).

diff -r 959f2c429fc3 resources/OCF/pgsql.in
--- a/resources/OCF/pgsql.in    Fri Feb 23 10:59:12 2007 +0100
+++ b/resources/OCF/pgsql.in    Fri Feb 23 12:18:53 2007 +0100
@@ -197,15 +197,12 @@ pgsql_start() {
        return $OCF_ERR_GENERIC
     fi

-    if ! pgsql_status
-    then
-       sleep 5
-       if ! pgsql_status
-       then
-           echo "ERROR: PostgreSQL is not running!"
-            return $OCF_ERR_GENERIC
-       fi
-    fi
+
+    active=0
+    while [ $active != 0 ]; do
+       pgsql_monitor
+       active=$?
+    done

     return $OCF_SUCCESS
 }
@@ -227,6 +224,13 @@ pgsql_stop() {
        runasowner "$PGCTL -D $PGDATA stop -m immediate > /dev/null 2>&1"
     fi

+    active=$OCF_NOT_RUNNING
+    while [ $active != $OCF_NOT_RUNNING ]; do
+       pgsql_monitor
+       active=$?
+    done
+
+    rm -f $PIDFILE
     return $OCF_SUCCESS
 }


More information about the Linux-HA-Dev mailing list