I'm having some issues with getting some cluster  monitoring setup and configured on a 3 node multi-state cluster.   I'm using Florian's blog as an example http://floriancrouzat.net/2013/01/monitor-a-pacemaker-cluster-with-ocfpacemakerclustermon-andor-external-agent/.

When I create the primitive resource it starts on one of my nodes but spawns multiple instances of crm_mon.  I don't see any reason that would cause it to spawn multiple instances, its very odd behavior.

I was also looking for some clarification on what this resource provides….it looks to me that it kicks off a crm_mon in daemon mode that will update a .html file and with -E it will run an external script.  But the resource itself doesn't trigger anything if another resource changes state only if the crm_mon process ( monitored with PID ) fails and it has to restart.  If this is correct what is the best practice for monitoring additional resource states?



Below are some additional data points. 

Creating the Resource

[root at pgdb2 tmp]# crm configure primitive SNMPMon ocf:pacemaker:ClusterMon \
>         params user="root" update="30" extra_options="-E /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net" \
>         op monitor on-fail="restart" interval="60"

Manual crm_mon output

Last updated: Thu May  9 10:24:30 2013
Last change: Thu May  9 10:20:49 2013 via cibadmin on pgdb2.example.com
Stack: cman
Current DC: pgdb1.example.com - partition with quorum
Version: 1.1.8-7.el6-394e906
3 Nodes configured, unknown expected votes
6 Resources configured.

Node pgdb1.example.com: standby
Online: [ pgdb2.example.com pgdb3.example.com ]

 PG_REP_VIP	(ocf::heartbeat:IPaddr2):	Started pgdb2.example.com
 PG_CLI_VIP	(ocf::heartbeat:IPaddr2):	Started pgdb2.example.com
 Master/Slave Set: msPGSQL [PGSQL]
     Masters: [ pgdb2.example.com ]
     Slaves: [ pgdb3.example.com ]
     Stopped: [ PGSQL:2 ]
 SNMPMon	(ocf::pacemaker:ClusterMon):	Started pgdb3.example.com

PS to check for process on pgdb3

[root at pgdb3 tmp]# ps aux | grep crm_mon
root     16097  0.0  0.0  82624  2784 ?        S    10:20   0:00 /usr/sbin/crm_mon -p /tmp/ClusterMon_SNMPMon.pid -d -i 0 -E /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net -h /tmp/ClusterMon_SNMPMon.html
root     16099  0.0  0.0  82624  2660 ?        S    10:20   0:00 /usr/sbin/crm_mon -p /tmp/ClusterMon_SNMPMon.pid -d -i 0 -E /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net -h /tmp/ClusterMon_SNMPMon.html
root     16104  0.0  0.0  82624  2448 ?        S    10:20   0:00 /usr/sbin/crm_mon -p /tmp/ClusterMon_SNMPMon.pid -d -i 0 -E /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net -h /tmp/ClusterMon_SNMPMon.html
root     16515  0.0  0.0 103244   852 pts/0    S+   10:21   0:00 grep crm_mon

Output from corosync.log

May 09 10:20:51 [3100] pgdb3.cha.arin.net       lrmd:     info: process_lrmd_get_rsc_info:      Resource 'SNMPMon' not found (3 active resources)
May 09 10:20:51 [3100] pgdb3.cha.arin.net       lrmd:     info: process_lrmd_rsc_register:      Added 'SNMPMon' to the rsc list (4 active resources)
May 09 10:20:52 [3103] pgdb3.cha.arin.net       crmd:     info: services_os_action_execute:     Managed ClusterMon_meta-data_0 process 16010 exited with rc=0
May 09 10:20:52 [3103] pgdb3.cha.arin.net       crmd:   notice: process_lrm_event:      LRM operation SNMPMon_monitor_0 (call=61, rc=7, cib-update=28, confirmed=true) not running
May 09 10:20:52 [3103] pgdb3.cha.arin.net       crmd:   notice: process_lrm_event:      LRM operation SNMPMon_start_0 (call=64, rc=0, cib-update=29, confirmed=true) ok
May 09 10:20:52 [3103] pgdb3.cha.arin.net       crmd:   notice: process_lrm_event:      LRM operation SNMPMon_monitor_60000 (call=67, rc=0, cib-update=30, confirmed=false) ok
