[ClusterLabs] Corosync/pacemaker seeing monitored process as FAILED
Klaus Wenninger
kwenning at redhat.com
Mon Jan 2 00:43:15 EST 2017
Hi Suresh!
Have you tried lsb-status in a shell?
Does it show anything interesting or is it hanging?
Regards,
Klaus
On 12/30/2016 08:45 AM, Suresh Rajagopalan wrote:
> Cluster running centos 6.8 with pacemaker/corosync. This config was
> running well for quite sometime. All of a sudden we see regular
> restarts of the monitored process where corosync thinks it has
> failed(even though it really has not failed). I am showing the
> relevant logs and config below. Any pointers appreciated as it is not
> clear why this would occur.
>
> Thanks
> Suresh
>
> Dec 28 13:18:20 [2198] a.b.com <http://a.b.com> pengine: info:
> LogActions: Leave mycustomprog (Started a.b.com <http://a.b.com>)
> Dec 28 13:22:03 [2199] a.b.com <http://a.b.com> crmd: info:
> process_lrm_event: Operation mycustomprog_monitor_10000: not running
> (node=a.b.com <http://a.b.com>, call=29, rc=7, cib-update=1427,
> confirmed=false)
> Dec 28 13:22:03 [2199] a.b.com <http://a.b.com> crmd: notice:
> process_lrm_event: a.b.com-mycustomprog_monitor_10000:29 [
> mycustomprogram (pid 15657) is running...\n ]
> Dec 28 13:22:03 [2194] a.b.com <http://a.b.com> cib: info:
> cib_perform_op: ++ /cib/status/node_state[@id='a.b.com
> <http://a.b.com>']/lrm[@id='a.b.com
> <http://a.b.com>']/lrm_resources/lrm_resource[@id='mycustomprog']:
> <lrm_rsc_op id="mycustomprog_last_failure_0"
> operation_key="mycustomprog_monitor_10000" operation="monitor"
> crm-debug-origin="do_update_resource" crm_feature_set="3.0.10"
> transition-key="7:462:0:a9dbbd47-975b-4aee-8b4a-de56e0a8e7a7"
> transition-magic="0
> Dec 28 13:22:03 [2199] a.b.com <http://a.b.com> crmd: info:
> abort_transition_graph: Transition aborted by
> mycustomprog_monitor_10000 'create' on a.b.com <http://a.b.com>: Old
> event (magic=0:7;7:462:0:a9dbbd47-975b-4aee-8b4a-de56e0a8e7a7,
> cib=0.48.2038786, source=process_graph_event:605, 1)
> Dec 28 13:22:03 [2199] a.b.com <http://a.b.com> crmd: info:
> update_failcount: Updating failcount for mycustomprog on a.b.com
> <http://a.b.com> after failed monitor: rc=7 (update=value++,
> time=1482931323)
> Dec 28 13:22:03 [2199] a.b.com <http://a.b.com> crmd: info:
> process_graph_event: Detected action (462.7)
> mycustomprog_monitor_10000.29=not running: failed
> Dec 28 13:22:03 [2197] a.b.com <http://a.b.com> attrd: notice:
> attrd_trigger_update: Sending flush op to all hosts for:
> fail-count-mycustomprog (1)
> Dec 28 13:22:03 [2197] a.b.com <http://a.b.com> attrd: notice:
> attrd_perform_update: Sent update 18: fail-count-mycustomprog=1
> Dec 28 13:22:03 [2194] a.b.com <http://a.b.com> cib: info:
> cib_perform_op: ++ /cib/status/node_state[@id='a.b.com
> <http://a.b.com>']/transient_attributes[@id='a.b.com
> <http://a.b.com>']/instance_attributes[@id='status-a.b.com
> <http://status-a.b.com>']: <nvpair
> id="status-a.b.com-fail-count-mycustomprog"
> name="fail-count-mycustomprog" value="1"/>
> Dec 28 13:22:03 [2197] a.b.com <http://a.b.com> attrd: notice:
> attrd_trigger_update: Sending flush op to all hosts for:
> last-failure-mycustomprog (1482931323)
> Dec 28 13:22:03 [2197] a.b.com <http://a.b.com> attrd: notice:
> attrd_perform_update: Sent update 20:
> last-failure-mycustomprog=1482931323
> Dec 28 13:22:03 [2197] a.b.com <http://a.b.com> attrd: notice:
> attrd_perform_update: Sent update 20:
> last-failure-mycustomprog=1482931323
> Dec 28 13:22:03 [2194] a.b.com <http://a.b.com> cib: info:
> cib_perform_op: ++ /cib/status/node_state[@id='a.b.com
> <http://a.b.com>']/transient_attributes[@id='a.b.com
> <http://a.b.com>']/instance_attributes[@id='status-a.b.com
> <http://status-a.b.com>']: <nvpair
> id="status-a.b.com-last-failure-mycustomprog"
> name="last-failure-mycustomprog" value="1482931323"/>
> Dec 28 13:22:04 [2199] a.b.com <http://a.b.com> crmd: info:
> abort_transition_graph: Transition aborted by
> status-a.b.com-fail-count-mycustomprog, fail-count-mycustomprog=1:
> Transient attribute change (create cib=0.48.2038787,
> source=abort_unless_down:329, path=/cib/status/node_state[@id='a.b.com
> <http://a.b.com>']/transient_attributes[@id='a.b.com
> <http://a.b.com>']/instance_attributes[@id='status-a.b.com
> <http://status-a.b.com>
> Dec 28 13:22:04 [2199] a.b.com <http://a.b.com> crmd: info:
> abort_transition_graph: Transition aborted by
> status-a.b.com-last-failure-mycustomprog,
> last-failure-mycustomprog=1482931323: Transient attribute change
> (create cib=0.48.2038788, source=abort_unless_down:329,
> path=/cib/status/node_state[@id='a.b.com
> <http://a.b.com>']/transient_attributes[@id='a.b.com
> <http://a.b.com>']/instance_attributes[@id='status-macshii00002-hva.gs.r11.
> Dec 28 13:22:04 [2198] a.b.com <http://a.b.com> pengine: warning:
> unpack_rsc_op_failure: Processing failed op monitor for
> mycustomprog on a.b.com <http://a.b.com>: not running (7)
> Dec 28 13:22:04 [2198] a.b.com <http://a.b.com> pengine: info:
> native_print: mycustomprog (lsb:mycustomprog):
> FAILED a.b.com <http://a.b.com>
> Dec 28 13:22:04 [2198] a.b.com <http://a.b.com> pengine: info:
> get_failcount_full: mycustomprog has failed 1 times on a.b.com
> <http://a.b.com>
> Dec 28 13:22:04 [2198] a.b.com <http://a.b.com> pengine: info:
> common_apply_stickiness: mycustomprog can fail 999999 more times
> on a.b.com <http://a.b.com> before being forced off
> Dec 28 13:22:04 [2198] a.b.com <http://a.b.com> pengine: info:
> RecurringOp: Start recurring monitor (10s) for mycustomprog on
> a.b.com <http://a.b.com>
> Dec 28 13:22:04 [2198] a.b.com <http://a.b.com> pengine: notice:
> LogActions: Recover mycustomprog (Started a.b.com <http://a.b.com>)
> Dec 28 13:22:04 [2199] a.b.com <http://a.b.com> crmd: notice:
> te_rsc_command: Initiating action 5: stop mycustomprog_stop_0 on
> a.b.com <http://a.b.com> (local)
> Dec 28 13:22:04 [2196] a.b.com <http://a.b.com> lrmd: info:
> cancel_recurring_action: Cancelling lsb operation
> mycustomprog_status_10000Dec 28 13:18:20 [2198] a.b.com
> <http://a.b.com> pengine: info: LogActions: Leave
> mycustomprog (Started a.b.com <http://a.b.com>)
> Dec 28 13:22:03 [2199] a.b.com <http://a.b.com> crmd: info:
> process_lrm_event: Operation mycustomprog_monitor_10000: not running
> (node=a.b.com <http://a.b.com>, call=29, rc=7, cib-update=1427,
> confirmed=false)
> Dec 28 13:22:03 [2199] a.b.com <http://a.b.com> crmd: notice:
> process_lrm_event: a.b.com-mycustomprog_monitor_10000:29 [
> mycustomprogram (pid 15657) is running...\n ]
> Dec 28 13:22:03 [2194] a.b.com <http://a.b.com> cib: info:
> cib_perform_op: ++ /cib/status/node_state[@id='a.b.com
> <http://a.b.com>']/lrm[@id='a.b.com
> <http://a.b.com>']/lrm_resources/lrm_resource[@id='mycustomprog']:
> <lrm_rsc_op id="mycustomprog_last_failure_0"
> operation_key="mycustomprog_monitor_10000" operation="monitor"
> crm-debug-origin="do_update_resource" crm_feature_set="3.0.10"
> transition-key="7:462:0:a9dbbd47-975b-4aee-8b4a-de56e0a8e7a7"
> transition-magic="0
> Dec 28 13:22:03 [2199] a.b.com <http://a.b.com> crmd: info:
> abort_transition_graph: Transition aborted by
> mycustomprog_monitor_10000 'create' on a.b.com <http://a.b.com>: Old
> event (magic=0:7;7:462:0:a9dbbd47-975b-4aee-8b4a-de56e0a8e7a7,
> cib=0.48.2038786, source=process_graph_event:605, 1)
> Dec 28 13:22:03 [2199] a.b.com <http://a.b.com> crmd: info:
> update_failcount: Updating failcount for mycustomprog on a.b.com
> <http://a.b.com> after failed monitor: rc=7 (update=value++,
> time=1482931323)
> Dec 28 13:22:03 [2199] a.b.com <http://a.b.com> crmd: info:
> process_graph_event: Detected action (462.7)
> mycustomprog_monitor_10000.29=not running: failed
> Dec 28 13:22:03 [2197] a.b.com <http://a.b.com> attrd: notice:
> attrd_trigger_update: Sending flush op to all hosts for:
> fail-count-mycustomprog (1)
> Dec 28 13:22:03 [2197] a.b.com <http://a.b.com> attrd: notice:
> attrd_perform_update: Sent update 18: fail-count-mycustomprog=1
> Dec 28 13:22:03 [2194] a.b.com <http://a.b.com> cib: info:
> cib_perform_op: ++ /cib/status/node_state[@id='a.b.com
> <http://a.b.com>']/transient_attributes[@id='a.b.com
> <http://a.b.com>']/instance_attributes[@id='status-a.b.com
> <http://status-a.b.com>']: <nvpair
> id="status-a.b.com-fail-count-mycustomprog"
> name="fail-count-mycustomprog" value="1"/>
> Dec 28 13:22:03 [2197] a.b.com <http://a.b.com> attrd: notice:
> attrd_trigger_update: Sending flush op to all hosts for:
> last-failure-mycustomprog (1482931323)
> Dec 28 13:22:03 [2197] a.b.com <http://a.b.com> attrd: notice:
> attrd_perform_update: Sent update 20:
> last-failure-mycustomprog=1482931323
> Dec 28 13:22:03 [2197] a.b.com <http://a.b.com> attrd: notice:
> attrd_perform_update: Sent update 20:
> last-failure-mycustomprog=1482931323
> Dec 28 13:22:03 [2194] a.b.com <http://a.b.com> cib: info:
> cib_perform_op: ++ /cib/status/node_state[@id='a.b.com
> <http://a.b.com>']/transient_attributes[@id='a.b.com
> <http://a.b.com>']/instance_attributes[@id='status-a.b.com
> <http://status-a.b.com>']: <nvpair
> id="status-a.b.com-last-failure-mycustomprog"
> name="last-failure-mycustomprog" value="1482931323"/>
> Dec 28 13:22:04 [2199] a.b.com <http://a.b.com> crmd: info:
> abort_transition_graph: Transition aborted by
> status-a.b.com-fail-count-mycustomprog, fail-count-mycustomprog=1:
> Transient attribute change (create cib=0.48.2038787,
> source=abort_unless_down:329, path=/cib/status/node_state[@id='a.b.com
> <http://a.b.com>']/transient_attributes[@id='a.b.com
> <http://a.b.com>']/instance_attributes[@id='status-a.b.com
> <http://status-a.b.com>
> Dec 28 13:22:04 [2199] a.b.com <http://a.b.com> crmd: info:
> abort_transition_graph: Transition aborted by
> status-a.b.com-last-failure-mycustomprog,
> last-failure-mycustomprog=1482931323: Transient attribute change
> (create cib=0.48.2038788, source=abort_unless_down:329,
> path=/cib/status/node_state[@id='a.b.com
> <http://a.b.com>']/transient_attributes[@id='a.b.com
> <http://a.b.com>']/instance_attributes[@id='status-a.b.com
> <http://status-a.b.com>
> Dec 28 13:22:04 [2198] a.b.com <http://a.b.com> pengine: warning:
> unpack_rsc_op_failure: Processing failed op monitor for
> mycustomprog on a.b.com <http://a.b.com>: not running (7)
> Dec 28 13:22:04 [2198] a.b.com <http://a.b.com> pengine: info:
> native_print: mycustomprog (lsb:mycustomprog):
> FAILED a.b.com <http://a.b.com>
> Dec 28 13:22:04 [2198] a.b.com <http://a.b.com> pengine: info:
> get_failcount_full: mycustomprog has failed 1 times on a.b.com
> <http://a.b.com>
> Dec 28 13:22:04 [2198] a.b.com <http://a.b.com> pengine: info:
> common_apply_stickiness: mycustomprog can fail 999999 more times
> on a.b.com <http://a.b.com> before being forced off
> Dec 28 13:22:04 [2198] a.b.com <http://a.b.com> pengine: info:
> RecurringOp: Start recurring monitor (10s) for mycustomprog on
> a.b.com <http://a.b.com>
> Dec 28 13:22:04 [2198] a.b.com <http://a.b.com> pengine: notice:
> LogActions: Recover mycustomprog (Started a.b.com <http://a.b.com>)
> Dec 28 13:22:04 [2199] a.b.com <http://a.b.com> crmd: notice:
> te_rsc_command: Initiating action 5: stop mycustomprog_stop_0 on
> a.b.com <http://a.b.com> (local)
> Dec 28 13:22:04 [2196] a.b.com <http://a.b.com> lrmd: info:
> cancel_recurring_action: Cancelling lsb operation
> mycustomprog_status_10000
>
>
>
> pcs config:
>
> source settings.rc
>
> pcs property set stonith-enabled=false
> pcs property set no-quorum-policy=ignore
> pcs resource create ClusterIP2 IPaddr2 ip=$MYVIP cidr_netmask=$NETMASKVIP1
> pcs resource create ClusterIP3 IPaddr2 ip=$MYVIP2
> cidr_netmask=$NETMASKVIP2
> pcs resource create mycustomprog lsb:mycustomprog op monitor
> interval="10s"
> pcs constraint colocation add ClusterIP3 with ClusterIP2 INFINITY
> pcs constraint colocation add mycustomprog with ClusterIP2 INFINITY
> pcs property set start-failure-is-fatal=false
> pcs resource defaults resource-stickiness=100
> pcs constraint colocation add chkhealth with ClusterIP2 INFINITY
>
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Users
mailing list