[ClusterLabs] Strange monitor return code log for LSB resource

Thu Jul 4 23:50:14 EDT 2019

Thanks Jan for taking the time to provide such a detailed response. I was able to resolve an issue which caused these logs to disappear. The problem was that the LSB script was named "logging" and the daemon that it controlled was also called "logging". The init script uses "start-stop-daemon --name" to start and stop the logging daemon, but "start-stop-daemon --stop --name logging" was causing both the init script and daemon to be terminated, and hence an error code would be returned for the stop operation.

I have no idea why crm_resource logs were produced, or why the logs seemed to indicate that that it was treating an LSB resource as OCF (expected return code of 7 - not running), but the logs were no longer generated once I renamed the init script. I've not seen them since. There were definitely no external monitoring tools installed or in use (it's a very isolated system with very few resources installed), although I was using crm_monitor at the time. Does that run crm_resource under the hood?

Regards,
Harvey

________________________________________
From: Harvey Shepherd
Sent: Wednesday, 26 June 2019 9:26 a.m.
To: users at clusterlabs.org
Subject: Re: EXTERNAL: Re: [ClusterLabs] Strange monitor return code log for LSB resource

Thanks for your reply Andrei. There is no external monitoring software running. The logs that I posted are from the pacemaker log with debug enabled. The Linux hostname of the node is "ctr_qemu", but the Pacemaker name is "primary". It is actually the same node.

I've just run the same test again and interestingly this time the source of the monitor logs seems to be shared between "crm_resource" and "pacemaker-schedulerd"

Jun 25 21:18:31 ctr_qemu pacemaker-execd     [1234] (recurring_action_timer)    debug: Scheduling another invocation of logging_status_10000
Jun 25 21:18:31 ctr_qemu pacemaker-execd     [1234] (operation_finished)        debug: logging_status_10000:1181 - exited with rc=0
Jun 25 21:18:31 ctr_qemu pacemaker-execd     [1234] (operation_finished)        debug: logging_status_10000:1181:stdout [ Remote syslog service is running ]
Jun 25 21:18:31 ctr_qemu pacemaker-execd     [1234] (log_finished)      debug: finished - rsc:logging action:monitor call_id:127 pid:1181 exit-code:0 exec-time:0ms queue-time:0ms
Jun 25 21:18:35 ctr_qemu crm_resource        [1268] (determine_op_status)       debug: logging_monitor_10000 on primary returned 'not running' (7) instead of the expected value: 'ok' (0)
Jun 25 21:18:35 ctr_qemu crm_resource        [1268] (unpack_rsc_op_failure)     warning: Processing failed monitor of logging on primary: not running | rc=7
Jun 25 21:18:36 ctr_qemu pacemaker-schedulerd[1236] (determine_op_status)       debug: logging_monitor_10000 on primary returned 'not running' (7) instead of the expected value: 'ok' (0)
Jun 25 21:18:36 ctr_qemu pacemaker-schedulerd[1236] (unpack_rsc_op_failure)     warning: Processing failed monitor of logging on primary: not running | rc=7
Jun 25 21:18:36 ctr_qemu pacemaker-schedulerd[1236] (common_print)      info: logging   (lsb:logging):      Started primary
Jun 25 21:18:36 ctr_qemu pacemaker-schedulerd[1236] (common_apply_stickiness)   debug: Resource logging: preferring current location (node=primary, weight=1)
Jun 25 21:18:36 ctr_qemu pacemaker-schedulerd[1236] (native_assign_node)        debug: Assigning primary to logging
Jun 25 21:18:36 ctr_qemu pacemaker-schedulerd[1236] (LogActions)        info: Leave   logging   (Started primary)

Regards,
Harvey

________________________________________
From: Users <users-bounces at clusterlabs.org> on behalf of Andrei Borzenkov <arvidjaar at gmail.com>
Sent: Wednesday, 26 June 2019 4:47 a.m.
To: users at clusterlabs.org
Subject: EXTERNAL: Re: [ClusterLabs] Strange monitor return code log for LSB resource

25.06.2019 16:53, Harvey Shepherd пишет:
> Hi All,
>
>
> I have a 2 node cluster running under Pacemaker 2.0.2, with around 20 resources configured, the majority of which are LSB resources, but there are also a few OCF ones. One of the LSB resources is controlled via an init script called "logging" and runs only on the master node. The CIB configuration for it is as follows:
>
>
>     <primitive id="logging" class="lsb" type="logging">
>       <operations>
>         <op name="monitor" interval="10s" id="logging-monitor-10s"/>
>         <op name="start" interval="0" id="logging-start-30s"/>
>         <op name="stop" interval="0" on-fail="restart" id="logging-stop-30s"/>
>       </operations>
>     </primitive>
>
>
> There is a global setting which sets the default timeout:
>
>
>   <op_defaults>
>     <meta_attributes id="op-options">
>       <nvpair name="timeout" value="30s" id="op-options-timeout"/>
>     </meta_attributes>
>   </op_defaults>
>
>
> All of the other LSB resources are configured in the same way, and seem to work correctly, but for some reason I see the following recurring logs for the logging resource:
>
>
> Jun 25 13:16:22 ctr_qemu pacemaker-execd     [1234] (recurring_action_timer)    debug: Scheduling another invocation of logging_status_10000
> Jun 25 13:16:22 ctr_qemu pacemaker-execd     [1234] (operation_finished)        debug: logging_status_10000:8423 - exited with rc=0
> Jun 25 13:16:22 ctr_qemu pacemaker-execd     [1234] (operation_finished)        debug: logging_status_10000:8423:stdout [ Remote syslog service is running ]
> Jun 25 13:16:22 ctr_qemu pacemaker-execd     [1234] (log_finished)      debug: finished - rsc:logging action:monitor call_id:123 pid:8423 exit-code:0 exec-time:0ms queue-time:0ms
> Jun 25 13:16:23 ctr_qemu crm_resource        [8436] (determine_op_status)       debug: logging_monitor_10000 on primary returned 'not running' (7) instead of the expected value: 'ok' (0)

This is not from lrmd but from crm_resource command; also it says "on
primary" while log is on ctr_qemu. Do you have some monitoring tool that
calls crm_resource?

> Jun 25 13:16:23 ctr_qemu crm_resource        [8436] (unpack_rsc_op_failure)     warning: Processing failed monitor of logging on primary: not running | rc=7
> Jun 25 13:16:30 ctr_qemu crm_resource        [8571] (determine_op_status)       debug: logging_monitor_10000 on primary returned 'not running' (7) instead of the expected value: 'ok' (0)
> Jun 25 13:16:30 ctr_qemu crm_resource        [8571] (unpack_rsc_op_failure)     warning: Processing failed monitor of logging on primary: not running | rc=7
> Jun 25 13:16:32 ctr_qemu pacemaker-execd     [1234] (recurring_action_timer)    debug: Scheduling another invocation of logging_status_10000
> Jun 25 13:16:32 ctr_qemu pacemaker-execd     [1234] (operation_finished)        debug: logging_status_10000:8670 - exited with rc=0
> Jun 25 13:16:32 ctr_qemu pacemaker-execd     [1234] (operation_finished)        debug: logging_status_10000:8670:stdout [ Remote syslog service is running ]
> Jun 25 13:16:32 ctr_qemu pacemaker-execd     [1234] (log_finished)      debug: finished - rsc:logging action:monitor call_id:123 pid:8670 exit-code:0 exec-time:0ms queue-time:0ms
> Jun 25 13:16:33 ctr_qemu crm_resource        [8683] (determine_op_status)       debug: logging_monitor_10000 on primary returned 'not running' (7) instead of the expected value: 'ok' (0)
> Jun 25 13:16:33 ctr_qemu crm_resource        [8683] (unpack_rsc_op_failure)     warning: Processing failed monitor of logging on primary: not running | rc=7
> Jun 25 13:16:40 ctr_qemu crm_resource        [8818] (determine_op_status)       debug: logging_monitor_10000 on primary returned 'not running' (7) instead of the expected value: 'ok' (0)
> Jun 25 13:16:40 ctr_qemu crm_resource        [8818] (unpack_rsc_op_failure)     warning: Processing failed monitor of logging on primary: not running | rc=7
>
> Pacemaker is reporting failed resource actions, but fail-count is not incremented for the resource:
>
>
> Migration Summary:
> * Node primary:
> * Node secondary:
>
> Failed Resource Actions:
> * logging_monitor_10000 on primary 'not running' (7): call=119, status=complete, exitreason='',
>     last-rc-change='Tue Jun 25 13:13:12 2019', queued=0ms, exec=0ms
>
>
> I have checked the operation of the LSB script manually and it always correctly exits with a return code of 0 when I run it manually, indicating that the resource is running. So my questions are:
>
>
> 1. Why does Pacemaker seem to be running a monitor operation in parallel with a status operation, with conflicting results? A monitor operation returning 7 "not running" would only make sense for an OCF resource, but it is clearly defined as LSB in the CIB.
>
> 2. Why does the status operation always return 0 (running) and the monitor operation always returns 7 (not running)?
>
> 2. Why is fail-count not being incremented even though failures are being logged?
>
>
> I would really appreciate any pointers that anyone could give me. Perhaps I've made an error in the configuration.
>
>
>
> Thanks,
>
> Harvey Shepherd
>
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/