[Pacemaker] ClusterMon Resource starting multiple instances of crm_mon

Sun May 12 19:24:52 EDT 2013

On 10/05/2013, at 8:08 PM, Steven Bambling <smbambling at arin.net> wrote:

> 
> On May 10, 2013, at 5:35 AM, Steven Bambling <smbambling at arin.net> wrote:
> 
>> 
>> On May 9, 2013, at 8:05 PM, Andrew Beekhof <andrew at beekhof.net> wrote:
>> 
>>> 
>>> On 10/05/2013, at 12:40 AM, Steven Bambling <smbambling at arin.net> wrote:
>>> 
>>>> I'm having some issues with getting some cluster  monitoring setup and configured on a 3 node multi-state cluster.   I'm using Florian's blog as an example http://floriancrouzat.net/2013/01/monitor-a-pacemaker-cluster-with-ocfpacemakerclustermon-andor-external-agent/.
>>>> 
>>>> When I create the primitive resource it starts on one of my nodes but spawns multiple instances of crm_mon.  I don't see any reason that would cause it to spawn multiple instances, its very odd behavior.
>>> 
>>> If you run:
>>> 
>>> /usr/sbin/crm_mon -p /tmp/ClusterMon_SNMPMon.pid -d -i 0 -E /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net -h /tmp/ClusterMon_SNMPMon.html
>>> 
>>> manually a few times, what happens?  Multiple processes?
>> 
>> Yep for some reason its spawning multiple processes.
>> 
>> root at pgdb3 ~]# ps aux | grep crm_mon
>> root     30678  0.0  0.0 103244   856 pts/0    S+   05:30   0:00 grep crm_mon
>> [root at pgdb3 ~]# /usr/sbin/crm_mon -p /tmp/ClusterMon_SNMPMon.pid -d -i 0 -E /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net -h /tmp/ClusterMon_SNMPMon.html
>> [root at pgdb3 ~]# /usr/sbin/crm_mon -p /tmp/ClusterMon_SNMPMon.pid -d -i 0 -E /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net -h /tmp/ClusterMon_SNMPMon.html
>> [root at pgdb3 ~]# /usr/sbin/crm_mon -p /tmp/ClusterMon_SNMPMon.pid -d -i 0 -E /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net -h /tmp/ClusterMon_SNMPMon.html
>> [root at pgdb3 ~]# /usr/sbin/crm_mon -p /tmp/ClusterMon_SNMPMon.pid -d -i 0 -E /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net -h /tmp/ClusterMon_SNMPMon.html
>> [root at pgdb3 ~]# ps aux | grep crm_mon
>> root     30772  0.0  0.0  82744  2816 ?        S    05:30   0:00 /usr/sbin/crm_mon -p /tmp/ClusterMon_SNMPMon.pid -d -i 0 -E /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net -h /tmp/ClusterMon_SNMPMon.html
>> root     30781  0.0  0.0  82744  2668 ?        S    05:30   0:00 /usr/sbin/crm_mon -p /tmp/ClusterMon_SNMPMon.pid -d -i 0 -E /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net -h /tmp/ClusterMon_SNMPMon.html
>> root     30784  0.0  0.0  82744  2476 ?        S    05:30   0:00 /usr/sbin/crm_mon -p /tmp/ClusterMon_SNMPMon.pid -d -i 0 -E /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net -h /tmp/ClusterMon_SNMPMon.html
>> root     31134  0.0  0.0 103244   856 pts/0    S+   05:30   0:00 grep crm_mon
>> 
>> Put the .pid file in the tmp dir only lists 1 pid
>> [root at pgdb3 ~]# cat /tmp/ClusterMon_SNMPMon.pid
>>   30772
> 
> I take that back I doubled checked and the SNMPMon resource was still started which was creating the multiple processes.  After I stopped the resource I pkill'd all the crm_mon process and then ran the command again manually.  Now it seems to squash the additional processes and only allows 1 process to be running.
> 
> [root at pgdb3 tmp]# ps aux | grep crm_mon
> root     30955  0.0  0.0  82492  2632 pts/0    S    06:05   0:00 /usr/sbin/crm_mon -p /tmp/ClusterMon_SNMPMon.pid -d -i 0 -E /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net -h /tmp/ClusterMon_SNMPMon.html
> root     31991  0.0  0.0 103244   852 pts/0    S+   06:05   0:00 grep crm_mon
> [root at pgdb3 tmp]# /usr/sbin/crm_mon -p /tmp/ClusterMon_SNMPMon.pid -d -i 0 -E /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net -h /tmp/ClusterMon_SNMPMon.html
> [root at pgdb3 tmp]# /usr/sbin/crm_mon -p /tmp/ClusterMon_SNMPMon.pid -d -i 0 -E /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net -h /tmp/ClusterMon_SNMPMon.html
> [root at pgdb3 tmp]# /usr/sbin/crm_mon -p /tmp/ClusterMon_SNMPMon.pid -d -i 0 -E /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net -h /tmp/ClusterMon_SNMPMon.html
> [root at pgdb3 tmp]# /usr/sbin/crm_mon -p /tmp/ClusterMon_SNMPMon.pid -d -i 0 -E /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net -h /tmp/ClusterMon_SNMPMon.html
> [root at pgdb3 tmp]# ps aux | grep crm_mon
> root     30955  0.0  0.0  82492  2632 pts/0    S    06:05   0:00 /usr/sbin/crm_mon -p /tmp/ClusterMon_SNMPMon.pid -d -i 0 -E /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net -h /tmp/ClusterMon_SNMPMon.html
> root     32545  0.0  0.0 103244   856 pts/0    S+   06:06   0:00 grep crm_mon

Its possibly due to a stale pid file.
I'm believe the following patches should fix the problem.

+ Andrew Beekhof (12 days ago) 479c5cc: Fix: crm_mon: Check if a process can be daemonized before forking so the parent can report an error 
+ Andrew Beekhof (12 days ago) e549770: Fix: crm_mon: Ensure stale pid files are updated when a new process is started 

https://github.com/beekhof/pacemaker/commit/e549770
https://github.com/beekhof/pacemaker/commit/479c5cc

Weird that it works from the command line but not the resource agent.
Are the permissions on /tmp/ClusterMon_SNMPMon.pid ok?

> 
> STEVE
> 
>> 
>>> 
>>>> 
>>>> I was also looking for some clarification on what this resource provides….it looks to me that it kicks off a crm_mon in daemon mode that will update a .html file and with -E it will run an external script.  But the resource itself doesn't trigger anything if another resource changes state only if the crm_mon process ( monitored with PID ) fails and it has to restart.
>>> 
>>> Correct, it just updates the html file which you can see in your browser.
>>> Or, with -E, it can send an email or snmp alert.
>>> 
>>>> If this is correct what is the best practice for monitoring additional resource states?
>>> 
>>> Define "additional"?
>>> If the resource fails we'll normally recover it automatically.
>> An example of an additional resource would be a vip using ( IPaddr2 ).  Also I have a multi-state pgsql resource, so if the resource fails it will either try to restart or promote another node in the cluster to Master.
>> 
>> v/r
>> 
>> STEVE
>> 
>>> 
>>>> 
>>>> v/r
>>>> 
>>>> STEVE
>>>> 
>>>> 
>>>> Below are some additional data points. 
>>>> 
>>>> 
>>>> Creating the Resource
>>>> 
>>>> [root at pgdb2 tmp]# crm configure primitive SNMPMon ocf:pacemaker:ClusterMon \
>>>>>     params user="root" update="30" extra_options="-E /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net" \
>>>>>     op monitor on-fail="restart" interval="60"
>>>> 
>>>> 
>>>> Manual crm_mon output
>>>> 
>>>> Last updated: Thu May  9 10:24:30 2013
>>>> Last change: Thu May  9 10:20:49 2013 via cibadmin on pgdb2.example.com
>>>> Stack: cman
>>>> Current DC: pgdb1.example.com - partition with quorum
>>>> Version: 1.1.8-7.el6-394e906
>>>> 3 Nodes configured, unknown expected votes
>>>> 6 Resources configured.
>>>> 
>>>> 
>>>> Node pgdb1.example.com: standby
>>>> Online: [ pgdb2.example.com pgdb3.example.com ]
>>>> 
>>>> PG_REP_VIP	(ocf::heartbeat:IPaddr2):	Started pgdb2.example.com
>>>> PG_CLI_VIP	(ocf::heartbeat:IPaddr2):	Started pgdb2.example.com
>>>> Master/Slave Set: msPGSQL [PGSQL]
>>>>  Masters: [ pgdb2.example.com ]
>>>>  Slaves: [ pgdb3.example.com ]
>>>>  Stopped: [ PGSQL:2 ]
>>>> SNMPMon	(ocf::pacemaker:ClusterMon):	Started pgdb3.example.com
>>>> 
>>>> PS to check for process on pgdb3
>>>> 
>>>> [root at pgdb3 tmp]# ps aux | grep crm_mon
>>>> root     16097  0.0  0.0  82624  2784 ?        S    10:20   0:00 /usr/sbin/crm_mon -p /tmp/ClusterMon_SNMPMon.pid -d -i 0 -E /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net -h /tmp/ClusterMon_SNMPMon.html
>>>> root     16099  0.0  0.0  82624  2660 ?        S    10:20   0:00 /usr/sbin/crm_mon -p /tmp/ClusterMon_SNMPMon.pid -d -i 0 -E /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net -h /tmp/ClusterMon_SNMPMon.html
>>>> root     16104  0.0  0.0  82624  2448 ?        S    10:20   0:00 /usr/sbin/crm_mon -p /tmp/ClusterMon_SNMPMon.pid -d -i 0 -E /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net -h /tmp/ClusterMon_SNMPMon.html
>>>> root     16515  0.0  0.0 103244   852 pts/0    S+   10:21   0:00 grep crm_mon
>>>> 
>>>> Output from corosync.log
>>>> 
>>>> May 09 10:20:51 [3100] pgdb3.cha.arin.net       lrmd:     info: process_lrmd_get_rsc_info:      Resource 'SNMPMon' not found (3 active resources)
>>>> May 09 10:20:51 [3100] pgdb3.cha.arin.net       lrmd:     info: process_lrmd_rsc_register:      Added 'SNMPMon' to the rsc list (4 active resources)
>>>> May 09 10:20:52 [3103] pgdb3.cha.arin.net       crmd:     info: services_os_action_execute:     Managed ClusterMon_meta-data_0 process 16010 exited with rc=0
>>>> May 09 10:20:52 [3103] pgdb3.cha.arin.net       crmd:   notice: process_lrm_event:      LRM operation SNMPMon_monitor_0 (call=61, rc=7, cib-update=28, confirmed=true) not running
>>>> May 09 10:20:52 [3103] pgdb3.cha.arin.net       crmd:   notice: process_lrm_event:      LRM operation SNMPMon_start_0 (call=64, rc=0, cib-update=29, confirmed=true) ok
>>>> May 09 10:20:52 [3103] pgdb3.cha.arin.net       crmd:   notice: process_lrm_event:      LRM operation SNMPMon_monitor_60000 (call=67, rc=0, cib-update=30, confirmed=false) ok
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>> 
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>> 
>>> 
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> 
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>> 
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org