[Pacemaker] monitor operation stopped running

Mon Jan 17 04:42:57 EST 2011

On Fri, Dec 17, 2010 at 10:56 AM, Chris Picton <chris at ecntelecoms.com> wrote:
> On Thu, 16 Dec 2010 08:27:51 +0100, Andrew Beekhof wrote:
>
>> On Wed, Dec 15, 2010 at 8:30 AM, Chris Picton
>
>>> Why would a resource cleanup remove the resource from the lrm, even
>>> though it is still running correctly,
>>
>> Thats what cleanup does.
>> What is supposed to happen next however, is that the cluster runs a
>> non-recurring monitor operation to re-determine the current state of the
>> cluster and go from there.
>> Also, any recurring actions should have been cancelled at the point the
>> resource was removed from the lrm.
>>
>> What versions of pacemaker and cluster-glue do you have?  Distro?
>>
>
> I am using the clusterlabs rpms
> pacemaker-1.0.9.1-1.15.el5
> cluster-glue-1.0.6-1.6.el5
>
> I see the following in the output of mon_mon -rf1t (I'm only showing the
> resources which are showing rc != 0)
> * Node sbc-tpna2-06.ecntelecoms.za.net:  pingd=100
>   megaswitch:5: migration-threshold=1000000
>    + (53) probe: last-rc-change='Fri Nov 26 09:17:38 2010' last-run='Fri
> Nov 26 09:17:38 2010' exec-time=30ms queue-time=0ms rc=1 (unknown error)
>    + (55) stop: last-rc-change='Fri Nov 26 09:17:41 2010' last-run='Fri
> Nov 26 09:17:41 2010' exec-time=20ms queue-time=0ms rc=0 (ok)
>    + (56) start: last-rc-change='Fri Nov 26 09:17:42 2010' last-run='Fri
> Nov 26 09:17:42 2010' exec-time=1040ms queue-time=0ms rc=0 (ok)
>    + (57) monitor: interval=8000ms last-rc-change='Fri Nov 26 09:17:44
> 2010' last-run='Fri Nov 26 09:17:44 2010' exec-time=260ms queue-time=0ms
> rc=0 (ok)
> * Node sbc-tpna2-05.ecntelecoms.za.net:  pingd=100
>   megaswitch:4: migration-threshold=1000000
>    + (58) probe: last-rc-change='Fri Nov 26 09:17:38 2010' last-run='Fri
> Nov 26 09:17:38 2010' exec-time=30ms queue-time=0ms rc=1 (unknown error)
>    + (60) stop: last-rc-change='Fri Nov 26 09:17:41 2010' last-run='Fri
> Nov 26 09:17:41 2010' exec-time=20ms queue-time=0ms rc=0 (ok)
>    + (61) start: last-rc-change='Fri Nov 26 09:17:42 2010' last-run='Fri
> Nov 26 09:17:42 2010' exec-time=1040ms queue-time=0ms rc=0 (ok)
>    + (62) monitor: interval=8000ms last-rc-change='Fri Nov 26 09:17:44
> 2010' last-run='Fri Nov 26 09:17:44 2010' exec-time=260ms queue-time=0ms
> rc=0 (ok)
>
> Would this affect the result of the 'non-recurring monitor
> operation' (the probe operations having rc=1)

Definitely.  They tell us the resource is unhealthy and needs to be stopped.

>
> I am not 100% sure why the errors are there - the log on the server for
> that day shows:
> ----
> Nov 26 09:17:39 sbc-tpna2-06 crmd: [29893]: info: do_lrm_rsc_op:
> Performing key=36:2184:7:c83a06e0-913e-4546-92e5-19f784dcaf5c
> op=megaswitch:5_monitor_0 )
> Nov 26 09:17:39 sbc-tpna2-06 lrmd: [29890]: info: rsc:megaswitch:5:53:
> probe
> Nov 26 09:17:39 sbc-tpna2-06 lrmd: [29890]: WARN: Managed
> megaswitch:5:monitor process 24823 exited with return code 1.
> Nov 26 09:17:39 sbc-tpna2-06 lrmd: [29890]: WARN: Managed
> megaswitch:5:monitor process 24823 exited with return code 1.
> Nov 26 09:17:39 sbc-tpna2-06 crmd: [29893]: info: process_lrm_event: LRM
> operation megaswitch:5_monitor_0 (call=53, rc=1, cib-update=68,
> confirmed=true) unknown error
> ----
>
> If they are affecting it, how would I clear them, so pacemaker sees
> everything as OK?

Clearing them wont help, because we'll just go and check the status
again - which will fail again.
You need to fix the agent.

>
> Thanks for the help
>
> Chris
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>