[Linux-ha-dev] [PATCH] Process monitor daemon
Keisuke MORI
kskmori at intellilink.co.jp
Wed Jan 16 02:48:06 MST 2008
Hello all,
We have developed a new feature that detects a process failure directly
to reduce the failover time.
If you're interested in, please try this and give me your comments.
See attached README for details about how to use this.
The patch is made for heartbeat-2.1.3.
Any comments and suggestions are really appreciated.
Best regards,
Keisuke MORI
NTT DATA Intellilink Corporation
-------------- next part --------------
Process monitor daemon
1. Introduction
This tool's purpose is to monitor each process and detect its failure
in faster and lightweight manner than ResourceAgent. It keeps watching on
their /procd/<PID> directories directly to detect the process failure.
You can monitor two or more processes per RA as you like.
It also works for monitoring clone or master/slave resource's processes.
This tool is composed of 2 objects:
i) procd : a daemon module to monitor and notify processes' failures.
works as a respawn module.
ii) procdctl : work as RA.
you have to group it together with monitor target's RA,
and describe the information about processes to monitor
as its attributes on cib.xml.
The fail-count of this RA will be increased when
it detects a process failure.
2. Build
To build this tool, do the following 3 steps.
i) Apply the patch.
$ cd <root directory of heartbeat's source>
$ patch -p 1 < procd-1.0.patch
ii) Make tools
$ cd tools/
$ make
$ su
# make install
# exit
iii) Make OCF
$ cd ../resources/OCF/
$ make
# make install
3. How to Use
To use this tool, do the following 3 steps.
i) Add this line to /etc/ha.d/ha.cf.
(for 32bitOS) respawn root /usr/lib/heartbeat/procd
(for 64bitOS) respawn root /usr/lib64/heartbeat/procd
ii) Rewrite full path of procd module on /usr/lib/ocf/resource.d/heartbeat
/procdctl according to ha.cf.
ex.) In the case of 64bitOS:
(before) PROCD="/usr/lib/heartbeat/procd"
(after) PROCD="/usr/lib64/heartbeat/procd"
iii) Describe cib.xml to group procdctl together with monitor object's RA
and to add processes' information to monitor as its attributes.
(See "4. How to specify process")
NOTE: In procdctl's start method, it make procd start to monitor.
Stop method is the same. So, make procdctl lower than monitor
object's RA when you group them. Because monitoring with procd has to
start after processes started, and has to stop after they stopped.
4. How to specify process
Add information of processes which you want to monitor in <nvpair> tag of
procdctl.
ex.) In the case of PostgreSQL
<nvpair id="pc1PostgreSQLDB1" name="proc_prmPc1PostgreSQLDB_1" value="/home
/postgres/pgsql/bin/postgres -D /home/postgres/pgdata -p 5"/>
id : attribute's id. unique strings(to follow DTD).
name : attribute's name. unique strings(to follow DTD).
value : char strings for specifying process. like command path and
arguments. You can obtain this string by executing this command
during monitor object process is running, and copy & paste it
on cib.xml.
# cat /proc/<PID>/cmdline | tr '\000’ '\040’
or
# ps ax | grep postgres
NOTE: procd search /proc/<PID>/cmdline to find specified process's pid
with prefix search. So, if only one postgres process on your server,
it does not matter that you describe only command path like
"/usr/local/pgsql/bin/postgres".
NOTE: You can monitor two or more processes per RA. If you want to monitor
not only postgres's master process but writer process or stats
collector process and so on, add <nvpair> tag the following.
<nvpair id="pc1PostgreSQLDB2" name="proc_prmPc1PostgreSQLDB_2" value=
"postgres: writer process"/>
<nvpair id="pc1PostgreSQLDB3" name="proc_prmPc1PostgreSQLDB_3" value=
"postgres: stats collector process"/>
<nvpair ...(as you like).../>
5. The details of monitoring
procd daemon checks each process's PID directory in /proc every one second.
It judges that the process is dead when:
The PID directory doesn't exist.
The process's status on PID/stat file is "X" or "Z".
And then, it notifies heartbeat that 'procdctl' (which is in the same
group with dead process's RA) detects an error, like crm_resource -F does.
Then, the resources in this group failover to other server if it is
possible.
6. Differences: RA only vs. procd
Here is some results of how much the failover time is reduced by this tool.
The details of examination is ...
i) The cluster consists of 2 nodes (ACT and SBY).
ii) Create a shared disk, and PGDATA on it.
iii) Set one group resource, consists of Filesystem, IPaddr, and
PostgreSQL.
NOTE: In the case of with procd, add procdctl RA to them.
iv) In the case of with RA only, PostgreSQL RA's monitor interval is
30 sec.
v) To apply CPU load, execute 'pgbench -S' (select only mode) during
the test. scaling_factor: 500, client: 100, transaction: 100.
Thereby, the CPU load (user + system) becomes 70% order.
The following is the time required from killing postgres process on ACT
node to starting PostgreSQL on SBY one.
| with RA only | with procd |
------+----------------+---------------+--
1st. | 26 sec. | 11 sec. |
2nd. | 35 sec. | 11 sec. |
3rd. | 23 sec. | 10 sec. |
4th. | 23 sec. | 8 sec. |
5th. | 23 sec. | 11 sec. |
==========================================
AVG. | 26.0 sec. | 10.2 sec. |
7. Appendix
A sample cib.xml to monitor PostgreSQL database.
<group id="group0">
<primitive id="prmApPostgreSQLDB" class="ocf" type="pgsql" provider="hear
tbeat">
<operations>
<op id="apPostgreSQLDB_start" name="start" timeout="120s" on_fail="re
start"/>
<op id="apPostgreSQLDB_monitor" name="monitor" interval="30s" timeout
="60s" on_fail="restart"/>
<op id="apPostgreSQLDB_stop" name="stop" timeout="60s" on_fail="resta
rt"/>
</operations>
<instance_attributes id="atrApPostgreSQLDB">
<attributes>
<nvpair id="pgctl01" name="pgctl" value="/home/postgres/pgsql/bin/p
g_ctl"/>
<nvpair id="start_opt01" name="start_opt" value="-p 5432 -h 172.16.
251.250"/>
<nvpair id="psql01" name="psql" value="/home/postgres/pgsql/bin/psq
l"/>
<nvpair id="pgdata01" name="pgdata" value="/home/postgres/pgdata"/>
<nvpair id="pgdba01" name="pgdba" value="postgres"/>
<nvpair id="pgdb01" name="pgdb" value="template1"/>
<nvpair id="logfile01" name="logfile" value="/var/log/pgsql.log"/>
</attributes>
</instance_attributes>
</primitive>
<primitive id="prmPc1PostgreSQLDB" class="ocf" type="procdctl" provider="
heartbeat">
<operations>
<op id="pc1_start" name="start" timeout="60s" on_fail="restart"/>
<op id="pc1_monitor" name="monitor" interval="10s" timeout="60s" on_f
ail="restart"/>
<op id="pc1_stop" name="stop" timeout="60s" on_fail="restart"/>
</operations>
<instance_attributes id="atrPc1PostgreSQLDB">
<attributes>
<nvpair id="pc1PostgreSQLDB1" name="proc_prmPc1PostgreSQLDB_1" valu
e="/home/postgres/pgsql/bin/postgres -D /home/postgres/pgdata -p 5"/>
<nvpair id="pc1PostgreSQLDB2" name="proc_prmPc1PostgreSQLDB_2" valu
e="postgres: writer process"/>
<nvpair id="pc1PostgreSQLDB3" name="proc_prmPc1PostgreSQLDB_3" valu
e="postgres: stats collector process"/>
</attributes>
</instance_attributes>
</primitive>
</group>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: procd-1.0.patch
Type: text/x-patch
Size: 69160 bytes
Desc: procd-1.0.patch
Url : http://lists.community.tummy.com/pipermail/linux-ha-dev/attachments/20080116/45de0243/procd-1.0-0001.bin
More information about the Linux-HA-Dev
mailing list