[Pacemaker] monitor operation stopped running

Thu Dec 16 07:27:51 UTC 2010

On Wed, Dec 15, 2010 at 8:30 AM, Chris Picton <chris at ecntelecoms.com> wrote:
> On Tue, 14 Dec 2010 18:55:06 +0100, Dejan Muhamedagic wrote:
>
>> Hi,
>>
>> On Tue, Dec 14, 2010 at 12:16:22PM +0200, Chris Picton wrote:
>>> Hi
>>>
>>> I have noticed this happening a few times on various of my clusters.
>>> The monitor operation for some resources stops running, and thus
>>> resource failures are not detected.  If I edit the cib, and change
>>> something regarding the resource (generally I change the monitor
>>> interval), the resource starts monitoring again, detects the failure
>>> and restarts correctly
>>>
>>> I am using pacemaker 1.0.9 live, and 1.0.10 in test.
>>>
>>> This has happened with both clone and non-clone resources.
>>>
>>> I have attached a log which shows the behaviour.  I have a resource
>>> (megaswitch) running cloned over 6 nodes.
>>>
>>> Until 06:48:22, the monitor is running correctly (the app logs the
>>> "Deleting context for MONTEST-" line when the monitor is run) After
>>> that, the monitor is not run again on this node
>>>
>>> I have the logs for the other nodes, if they are needed to try and
>>> debug this.
>>
>> Nov 28 06:48:26 sbc-tpna2-01 crmd: [4863]: info: do_lrm_invoke: Removing
>> resource megaswitch:3 from the LRM Nov 28 06:48:26 sbc-tpna2-01 crmd:
>> [4863]: info: do_lrm_invoke: Resource 'megaswitch:3' deleted for
>> 19511_crm_resource on sbc-tpna2-06.ecntelecoms.za.net Nov 28 06:48:26
>> sbc-tpna2-01 crmd: [4863]: info: notify_deleted: Notifying
>> 19511_crm_resource on sbc-tpna2-06.ecntelecoms.za.net that megaswitch:3
>> was deleted
>>
>> Somebody/something on sbc-tpna2-06.ecntelecoms.za.net ran crm_resource
>> (or perhaps the crm shell) and removed megaswitch from LRM. Any
>> suspicious cron jobs over there?
>
> on sbc-tpna2-06
> ---------------
> Nov 28 06:48:19 sbc-tpna2-06 crm_resource: [19476]: info: Invoked:
> crm_resource -C -r group_megaswitch:0 -H sbc-tpna2-01.ecntelecoms.za.net
> Nov 28 06:48:21 sbc-tpna2-06 crm_resource: [19482]: info: Invoked:
> crm_resource -C -r group_megaswitch:1 -H sbc-tpna2-01.ecntelecoms.za.net
> Nov 28 06:48:24 sbc-tpna2-06 crm_resource: [19506]: info: Invoked:
> crm_resource -C -r group_megaswitch:2 -H sbc-tpna2-01.ecntelecoms.za.net
> Nov 28 06:48:24 sbc-tpna2-06 crmd: [29893]: ERROR: send_msg_via_ipc:
> Unknown Sub-system (19482_crm_resource)... discarding message.
> Nov 28 06:48:24 sbc-tpna2-06 crmd: [29893]: ERROR: send_msg_via_ipc:
> Unknown Sub-system (19482_crm_resource)... discarding message.
> Nov 28 06:48:26 sbc-tpna2-06 crm_resource: [19511]: info: Invoked:
> crm_resource -C -r group_megaswitch:3 -H sbc-tpna2-01.ecntelecoms.za.net
> Nov 28 06:48:27 sbc-tpna2-06 cib: [19512]: info: write_cib_contents:
> Archived previous version as /var/lib/heartbeat/crm/cib-21.raw
> Nov 28 06:48:27 sbc-tpna2-06 cib: [19512]: info: write_cib_contents:
> Wrote version 0.232.0 of the CIB to disk (digest:
> 6aaa4d35d37a179b8f42c7045220690a)
> Nov 28 06:48:27 sbc-tpna2-06 cib: [19512]: info: retrieveCib: Reading
> cluster configuration from: /var/lib/heartbeat/crm/cib.tmgWhm (digest: /
> var/lib/heartbeat/crm/cib.NqXOtl)
> Nov 28 06:48:27 sbc-tpna2-06 cib: [29889]: info: Managed
> write_cib_contents process 19512 exited with return code 0.
> Nov 28 06:48:27 sbc-tpna2-06 attrd: [29892]: info: attrd_ha_callback:
> flush message from sbc-tpna2-01.ecntelecoms.za.net
> Nov 28 06:48:27 sbc-tpna2-06 cib: [19527]: info: write_cib_contents:
> Archived previous version as /var/lib/heartbeat/crm/cib-22.raw
> Nov 28 06:48:27 sbc-tpna2-06 cib: [19527]: info: write_cib_contents:
> Wrote version 0.233.0 of the CIB to disk (digest:
> 8e39a0b125878ab28f8bed81789f5a59)
> Nov 28 06:48:27 sbc-tpna2-06 cib: [19527]: info: retrieveCib: Reading
> cluster configuration from: /var/lib/heartbeat/crm/cib.mwt8EZ (digest: /
> var/lib/heartbeat/crm/cib.hZ74d0)
> Nov 28 06:48:27 sbc-tpna2-06 cib: [29889]: info: Managed
> write_cib_contents process 19527 exited with return code 0.
> Nov 28 06:48:28 sbc-tpna2-06 crm_resource: [19528]: info: Invoked:
> crm_resource -C -r group_megaswitch:4 -H sbc-tpna2-01.ecntelecoms.za.net
> Nov 28 06:48:30 sbc-tpna2-06 crm_resource: [19534]: info: Invoked:
> crm_resource -C -r group_megaswitch:5 -H sbc-tpna2-01.ecntelecoms.za.net
>
>
> It looks like a 'crm resource cleanup megaswitch-clone' command was
> executed
>
> On the other nodes, they all log  similar entries
> ---
> sbc-tpna2-05.ecntelecoms.za.net.16.small:Nov 28 06:49:17 sbc-tpna2-05
> crmd: [30350]: info: do_lrm_invoke: Removing resource megaswitch:4 from
> the LRM
> sbc-tpna2-05.ecntelecoms.za.net.16.small-Nov 28 06:49:17 sbc-tpna2-05
> crmd: [30350]: info: do_lrm_invoke: Resource 'megaswitch:4' deleted for
> 19697_crm_resource on sbc-tpna2-06.ecntelecoms.za.net
> sbc-tpna2-05.ecntelecoms.za.net.16.small-Nov 28 06:49:17 sbc-tpna2-05
> crmd: [30350]: info: notify_deleted: Notifying 19697_crm_resource on sbc-
> tpna2-06.ecntelecoms.za.net that megaswitch:4 was deleted
> --
>
>
> So I have 2 questions:
> 1) Why would a resource cleanup remove the resource from the lrm, even
> though it is still running correctly,

Thats what cleanup does.
What is supposed to happen next however, is that the cluster runs a
non-recurring monitor operation to re-determine the current state of
the cluster and go from there.
Also, any recurring actions should have been cancelled at the point
the resource was removed from the lrm.

What versions of pacemaker and cluster-glue do you have?  Distro?

> and the monitor operation are
> succeeding
> 2) How can I programatically detect and fix this state so I can get a
> cron in place for now to 'fix' it
>
> Thanks for the help
>
> Chris
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>