[Linux-ha-dev] [PATCH] Process monitor daemon

Keisuke MORI kskmori at intellilink.co.jp
Wed Jan 16 02:48:06 MST 2008


Hello all,

We have developed a new feature that detects a process failure directly
to reduce the failover time.

If you're interested in, please try this and give me your comments.

See attached README for details about how to use this.
The patch is made for heartbeat-2.1.3.


Any comments and suggestions are really appreciated.

Best regards,

Keisuke MORI
NTT DATA Intellilink Corporation
-------------- next part --------------
                       Process monitor daemon

1. Introduction
    This tool's purpose is to monitor each process and detect its failure
in faster and lightweight manner than ResourceAgent. It keeps watching on
their /procd/<PID> directories directly to detect the process failure.
    You can monitor two or more processes per RA as you like.
    It also works for monitoring clone or master/slave resource's processes.

    This tool is composed of 2 objects:
       i) procd    : a daemon module to monitor and notify processes' failures.
		     works as a respawn module.
      ii) procdctl : work as RA.
                     you have to group it together with monitor target's RA,
                     and describe the information about processes to monitor
                     as its attributes on cib.xml.
		     The fail-count of this RA will be increased when
		     it detects a process failure.

2. Build
    To build this tool, do the following 3 steps.

      i) Apply the patch.
          $ cd <root directory of heartbeat's source>
          $ patch -p 1 < procd-1.0.patch
     ii) Make tools
          $ cd tools/
          $ make
          $ su
          # make install
          # exit
    iii) Make OCF
          $ cd ../resources/OCF/
          $ make
          # make install

3. How to Use
    To use this tool, do the following 3 steps.
      i) Add this line to /etc/ha.d/ha.cf.
           (for 32bitOS) respawn root /usr/lib/heartbeat/procd
           (for 64bitOS) respawn root /usr/lib64/heartbeat/procd
     ii) Rewrite full path of procd module on /usr/lib/ocf/resource.d/heartbeat
         /procdctl according to ha.cf. 
           ex.) In the case of 64bitOS:
             (before) PROCD="/usr/lib/heartbeat/procd"
             (after)  PROCD="/usr/lib64/heartbeat/procd"
    iii) Describe cib.xml to group procdctl together with monitor object's RA
         and to add processes' information to monitor as its attributes.
         (See "4. How to specify process")

    NOTE:    In procdctl's start method, it make procd start to monitor.
          Stop method is the same. So, make procdctl lower than monitor 
          object's RA when you group them. Because monitoring with procd has to
          start after processes started, and has to stop after they stopped.

4. How to specify process
    Add information of processes which you want to monitor in <nvpair> tag of 
procdctl.
    ex.) In the case of PostgreSQL
    <nvpair id="pc1PostgreSQLDB1" name="proc_prmPc1PostgreSQLDB_1" value="/home
    /postgres/pgsql/bin/postgres -D /home/postgres/pgdata -p 5"/>

      id      : attribute's id. unique strings(to follow DTD).
      name    : attribute's name. unique strings(to follow DTD).
      value   : char strings for specifying process. like command path and 
                arguments. You can obtain this string by executing this command
		during monitor object process is running, and copy & paste it
		on cib.xml.
                  # cat /proc/<PID>/cmdline | tr '\000’ '\040’
                or
                  # ps ax | grep postgres

    NOTE: procd search /proc/<PID>/cmdline to find specified process's pid 
          with prefix search. So, if only one postgres process on your server, 
          it does not matter that you describe only command path like 
          "/usr/local/pgsql/bin/postgres".

    NOTE: You can monitor two or more processes per RA. If you want to monitor
          not only postgres's master process but writer process or stats 
          collector process and so on, add <nvpair> tag the following.
          <nvpair id="pc1PostgreSQLDB2" name="proc_prmPc1PostgreSQLDB_2" value=
          "postgres: writer process"/>
          <nvpair id="pc1PostgreSQLDB3" name="proc_prmPc1PostgreSQLDB_3" value=
          "postgres: stats collector process"/>
          <nvpair ...(as you like).../>

5. The details of monitoring
    procd daemon checks each process's PID directory in /proc every one second.
    It judges that the process is dead when:
        The PID directory doesn't exist.
        The process's status on PID/stat file is "X" or "Z".
    And then, it notifies heartbeat that 'procdctl' (which is in the same 
    group with dead process's RA) detects an error, like crm_resource -F does.
    Then, the resources in this group failover to other server if it is 
    possible.

6. Differences: RA only vs. procd
    Here is some results of how much the failover time is reduced by this tool.
    The details of examination is ...
        i) The cluster consists of 2 nodes (ACT and SBY).
       ii) Create a shared disk, and PGDATA on it.
      iii) Set one group resource, consists of Filesystem, IPaddr, and 
           PostgreSQL.
               NOTE: In the case of with procd, add procdctl RA to them.
       iv) In the case of with RA only, PostgreSQL RA's monitor interval is 
           30 sec. 
        v) To apply CPU load, execute 'pgbench -S' (select only mode) during
           the test. scaling_factor: 500, client: 100, transaction: 100.
           Thereby, the CPU load (user + system) becomes 70% order.

    The following is the time required from killing postgres process on ACT 
    node to starting PostgreSQL on SBY one.

          |  with RA only  |  with procd   |
    ------+----------------+---------------+--
     1st. |      26   sec. |     11   sec. |
     2nd. |      35   sec. |     11   sec. |
     3rd. |      23   sec. |     10   sec. |
     4th. |      23   sec. |      8   sec. |
     5th. |      23   sec. |     11   sec. |
    ==========================================
     AVG. |      26.0 sec. |     10.2 sec. |

7. Appendix
    A sample cib.xml to monitor PostgreSQL database.

    <group id="group0">
      <primitive id="prmApPostgreSQLDB" class="ocf" type="pgsql" provider="hear
    tbeat">
        <operations>
          <op id="apPostgreSQLDB_start" name="start" timeout="120s" on_fail="re
    start"/>
          <op id="apPostgreSQLDB_monitor" name="monitor" interval="30s" timeout
    ="60s" on_fail="restart"/>
          <op id="apPostgreSQLDB_stop" name="stop" timeout="60s" on_fail="resta
    rt"/>
        </operations>
        <instance_attributes id="atrApPostgreSQLDB">
          <attributes>
            <nvpair id="pgctl01" name="pgctl" value="/home/postgres/pgsql/bin/p
    g_ctl"/>
            <nvpair id="start_opt01" name="start_opt" value="-p 5432 -h 172.16.
    251.250"/>
            <nvpair id="psql01" name="psql" value="/home/postgres/pgsql/bin/psq
    l"/>
            <nvpair id="pgdata01" name="pgdata" value="/home/postgres/pgdata"/>
            <nvpair id="pgdba01" name="pgdba" value="postgres"/>
            <nvpair id="pgdb01" name="pgdb" value="template1"/>
            <nvpair id="logfile01" name="logfile" value="/var/log/pgsql.log"/>
          </attributes>
        </instance_attributes>
      </primitive>
      <primitive id="prmPc1PostgreSQLDB" class="ocf" type="procdctl" provider="
    heartbeat">
        <operations>
          <op id="pc1_start" name="start" timeout="60s" on_fail="restart"/>
          <op id="pc1_monitor" name="monitor" interval="10s" timeout="60s" on_f
    ail="restart"/>
          <op id="pc1_stop" name="stop" timeout="60s" on_fail="restart"/>
        </operations>
        <instance_attributes id="atrPc1PostgreSQLDB">
          <attributes>
            <nvpair id="pc1PostgreSQLDB1" name="proc_prmPc1PostgreSQLDB_1" valu
    e="/home/postgres/pgsql/bin/postgres -D /home/postgres/pgdata -p 5"/>
            <nvpair id="pc1PostgreSQLDB2" name="proc_prmPc1PostgreSQLDB_2" valu
    e="postgres: writer process"/>
            <nvpair id="pc1PostgreSQLDB3" name="proc_prmPc1PostgreSQLDB_3" valu
    e="postgres: stats collector process"/>
          </attributes>
        </instance_attributes>
      </primitive>
    </group>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: procd-1.0.patch
Type: text/x-patch
Size: 69160 bytes
Desc: procd-1.0.patch
Url : http://lists.community.tummy.com/pipermail/linux-ha-dev/attachments/20080116/45de0243/procd-1.0-0001.bin


More information about the Linux-HA-Dev mailing list