[ClusterLabs] Corosync/pacemaker seeing monitored process as FAILED

Mon Jan 2 11:32:53 EST 2017

Hi Klaus

service mycustomprog status returns fine that is  no errors.   It does not
hang.

Suresh

On Sun, Jan 1, 2017 at 9:43 PM, Klaus Wenninger <kwenning at redhat.com> wrote:

> Hi Suresh!
>
> Have you tried lsb-status in a shell?
> Does it show anything interesting or is it hanging?
>
> Regards,
> Klaus
>
> On 12/30/2016 08:45 AM, Suresh Rajagopalan wrote:
> > Cluster running centos 6.8 with pacemaker/corosync.    This config was
> > running well for quite sometime. All of a sudden we see regular
> > restarts of the monitored process where corosync thinks it has
> > failed(even though it really has not failed).  I am showing the
> > relevant logs and config below.  Any pointers appreciated as it is not
> > clear why this would occur.
> >
> > Thanks
> > Suresh
> >
> > Dec 28 13:18:20 [2198] a.b.com <http://a.b.com>    pengine:     info:
> > LogActions:  Leave   mycustomprog       (Started a.b.com <http://a.b.com
> >)
> > Dec 28 13:22:03 [2199] a.b.com <http://a.b.com>       crmd:     info:
> > process_lrm_event:   Operation mycustomprog_monitor_10000: not running
> > (node=a.b.com <http://a.b.com>, call=29, rc=7, cib-update=1427,
> > confirmed=false)
> > Dec 28 13:22:03 [2199] a.b.com <http://a.b.com>       crmd:   notice:
> > process_lrm_event:   a.b.com-mycustomprog_monitor_10000:29 [
> > mycustomprogram (pid  15657) is running...\n ]
> > Dec 28 13:22:03 [2194] a.b.com <http://a.b.com>        cib:     info:
> > cib_perform_op:      ++ /cib/status/node_state[@id='a.b.com
> > <http://a.b.com>']/lrm[@id='a.b.com
> > <http://a.b.com>']/lrm_resources/lrm_resource[@id='mycustomprog']:
> >  <lrm_rsc_op id="mycustomprog_last_failure_0"
> > operation_key="mycustomprog_monitor_10000" operation="monitor"
> > crm-debug-origin="do_update_resource" crm_feature_set="3.0.10"
> > transition-key="7:462:0:a9dbbd47-975b-4aee-8b4a-de56e0a8e7a7"
> > transition-magic="0
> > Dec 28 13:22:03 [2199] a.b.com <http://a.b.com>       crmd:     info:
> > abort_transition_graph:      Transition aborted by
> > mycustomprog_monitor_10000 'create' on a.b.com <http://a.b.com>: Old
> > event (magic=0:7;7:462:0:a9dbbd47-975b-4aee-8b4a-de56e0a8e7a7,
> > cib=0.48.2038786, source=process_graph_event:605, 1)
> > Dec 28 13:22:03 [2199] a.b.com <http://a.b.com>       crmd:     info:
> > update_failcount:    Updating failcount for mycustomprog on a.b.com
> > <http://a.b.com> after failed monitor: rc=7 (update=value++,
> > time=1482931323)
> > Dec 28 13:22:03 [2199] a.b.com <http://a.b.com>       crmd:     info:
> > process_graph_event: Detected action (462.7)
> > mycustomprog_monitor_10000.29=not running: failed
> > Dec 28 13:22:03 [2197] a.b.com <http://a.b.com>      attrd:   notice:
> > attrd_trigger_update:        Sending flush op to all hosts for:
> > fail-count-mycustomprog (1)
> > Dec 28 13:22:03 [2197] a.b.com <http://a.b.com>      attrd:   notice:
> > attrd_perform_update:        Sent update 18: fail-count-mycustomprog=1
> > Dec 28 13:22:03 [2194] a.b.com <http://a.b.com>        cib:     info:
> > cib_perform_op:      ++ /cib/status/node_state[@id='a.b.com
> > <http://a.b.com>']/transient_attributes[@id='a.b.com
> > <http://a.b.com>']/instance_attributes[@id='status-a.b.com
> > <http://status-a.b.com>']:  <nvpair
> > id="status-a.b.com-fail-count-mycustomprog"
> > name="fail-count-mycustomprog" value="1"/>
> > Dec 28 13:22:03 [2197] a.b.com <http://a.b.com>      attrd:   notice:
> > attrd_trigger_update:        Sending flush op to all hosts for:
> > last-failure-mycustomprog (1482931323)
> > Dec 28 13:22:03 [2197] a.b.com <http://a.b.com>      attrd:   notice:
> > attrd_perform_update:        Sent update 20:
> > last-failure-mycustomprog=1482931323
> > Dec 28 13:22:03 [2197] a.b.com <http://a.b.com>      attrd:   notice:
> > attrd_perform_update:        Sent update 20:
> > last-failure-mycustomprog=1482931323
> > Dec 28 13:22:03 [2194] a.b.com <http://a.b.com>        cib:     info:
> > cib_perform_op:      ++ /cib/status/node_state[@id='a.b.com
> > <http://a.b.com>']/transient_attributes[@id='a.b.com
> > <http://a.b.com>']/instance_attributes[@id='status-a.b.com
> > <http://status-a.b.com>']:  <nvpair
> > id="status-a.b.com-last-failure-mycustomprog"
> > name="last-failure-mycustomprog" value="1482931323"/>
> > Dec 28 13:22:04 [2199] a.b.com <http://a.b.com>       crmd:     info:
> > abort_transition_graph:      Transition aborted by
> > status-a.b.com-fail-count-mycustomprog, fail-count-mycustomprog=1:
> > Transient attribute change (create cib=0.48.2038787,
> > source=abort_unless_down:329, path=/cib/status/node_state[@id='a.b.com
> > <http://a.b.com>']/transient_attributes[@id='a.b.com
> > <http://a.b.com>']/instance_attributes[@id='status-a.b.com
> > <http://status-a.b.com>
> > Dec 28 13:22:04 [2199] a.b.com <http://a.b.com>       crmd:     info:
> > abort_transition_graph:      Transition aborted by
> > status-a.b.com-last-failure-mycustomprog,
> > last-failure-mycustomprog=1482931323: Transient attribute change
> > (create cib=0.48.2038788, source=abort_unless_down:329,
> > path=/cib/status/node_state[@id='a.b.com
> > <http://a.b.com>']/transient_attributes[@id='a.b.com
> > <http://a.b.com>']/instance_attributes[@id='status-macshii00
> 002-hva.gs.r11.
> > Dec 28 13:22:04 [2198] a.b.com <http://a.b.com>    pengine:  warning:
> > unpack_rsc_op_failure:       Processing failed op monitor for
> > mycustomprog on a.b.com <http://a.b.com>: not running (7)
> > Dec 28 13:22:04 [2198] a.b.com <http://a.b.com>    pengine:     info:
> > native_print:        mycustomprog       (lsb:mycustomprog):
> >  FAILED a.b.com <http://a.b.com>
> > Dec 28 13:22:04 [2198] a.b.com <http://a.b.com>    pengine:     info:
> > get_failcount_full:  mycustomprog has failed 1 times on a.b.com
> > <http://a.b.com>
> > Dec 28 13:22:04 [2198] a.b.com <http://a.b.com>    pengine:     info:
> > common_apply_stickiness:     mycustomprog can fail 999999 more times
> > on a.b.com <http://a.b.com> before being forced off
> > Dec 28 13:22:04 [2198] a.b.com <http://a.b.com>    pengine:     info:
> > RecurringOp:  Start recurring monitor (10s) for mycustomprog on
> > a.b.com <http://a.b.com>
> > Dec 28 13:22:04 [2198] a.b.com <http://a.b.com>    pengine:   notice:
> > LogActions:  Recover mycustomprog       (Started a.b.com <http://a.b.com
> >)
> > Dec 28 13:22:04 [2199] a.b.com <http://a.b.com>       crmd:   notice:
> > te_rsc_command:      Initiating action 5: stop mycustomprog_stop_0 on
> > a.b.com <http://a.b.com> (local)
> > Dec 28 13:22:04 [2196] a.b.com <http://a.b.com>       lrmd:     info:
> > cancel_recurring_action:     Cancelling lsb operation
> > mycustomprog_status_10000Dec 28 13:18:20 [2198] a.b.com
> > <http://a.b.com>    pengine:     info: LogActions:  Leave
> > mycustomprog       (Started a.b.com <http://a.b.com>)
> > Dec 28 13:22:03 [2199] a.b.com <http://a.b.com>       crmd:     info:
> > process_lrm_event:   Operation mycustomprog_monitor_10000: not running
> > (node=a.b.com <http://a.b.com>, call=29, rc=7, cib-update=1427,
> > confirmed=false)
> > Dec 28 13:22:03 [2199] a.b.com <http://a.b.com>       crmd:   notice:
> > process_lrm_event:   a.b.com-mycustomprog_monitor_10000:29 [
> > mycustomprogram (pid  15657) is running...\n ]
> > Dec 28 13:22:03 [2194] a.b.com <http://a.b.com>        cib:     info:
> > cib_perform_op:      ++ /cib/status/node_state[@id='a.b.com
> > <http://a.b.com>']/lrm[@id='a.b.com
> > <http://a.b.com>']/lrm_resources/lrm_resource[@id='mycustomprog']:
> >  <lrm_rsc_op id="mycustomprog_last_failure_0"
> > operation_key="mycustomprog_monitor_10000" operation="monitor"
> > crm-debug-origin="do_update_resource" crm_feature_set="3.0.10"
> > transition-key="7:462:0:a9dbbd47-975b-4aee-8b4a-de56e0a8e7a7"
> > transition-magic="0
> > Dec 28 13:22:03 [2199] a.b.com <http://a.b.com>       crmd:     info:
> > abort_transition_graph:      Transition aborted by
> > mycustomprog_monitor_10000 'create' on a.b.com <http://a.b.com>: Old
> > event (magic=0:7;7:462:0:a9dbbd47-975b-4aee-8b4a-de56e0a8e7a7,
> > cib=0.48.2038786, source=process_graph_event:605, 1)
> > Dec 28 13:22:03 [2199] a.b.com <http://a.b.com>       crmd:     info:
> > update_failcount:    Updating failcount for mycustomprog on a.b.com
> > <http://a.b.com> after failed monitor: rc=7 (update=value++,
> > time=1482931323)
> > Dec 28 13:22:03 [2199] a.b.com <http://a.b.com>       crmd:     info:
> > process_graph_event: Detected action (462.7)
> > mycustomprog_monitor_10000.29=not running: failed
> > Dec 28 13:22:03 [2197] a.b.com <http://a.b.com>      attrd:   notice:
> > attrd_trigger_update:        Sending flush op to all hosts for:
> > fail-count-mycustomprog (1)
> > Dec 28 13:22:03 [2197] a.b.com <http://a.b.com>      attrd:   notice:
> > attrd_perform_update:        Sent update 18: fail-count-mycustomprog=1
> > Dec 28 13:22:03 [2194] a.b.com <http://a.b.com>        cib:     info:
> > cib_perform_op:      ++ /cib/status/node_state[@id='a.b.com
> > <http://a.b.com>']/transient_attributes[@id='a.b.com
> > <http://a.b.com>']/instance_attributes[@id='status-a.b.com
> > <http://status-a.b.com>']:  <nvpair
> > id="status-a.b.com-fail-count-mycustomprog"
> > name="fail-count-mycustomprog" value="1"/>
> > Dec 28 13:22:03 [2197] a.b.com <http://a.b.com>      attrd:   notice:
> > attrd_trigger_update:        Sending flush op to all hosts for:
> > last-failure-mycustomprog (1482931323)
> > Dec 28 13:22:03 [2197] a.b.com <http://a.b.com>      attrd:   notice:
> > attrd_perform_update:        Sent update 20:
> > last-failure-mycustomprog=1482931323
> > Dec 28 13:22:03 [2197] a.b.com <http://a.b.com>      attrd:   notice:
> > attrd_perform_update:        Sent update 20:
> > last-failure-mycustomprog=1482931323
> > Dec 28 13:22:03 [2194] a.b.com <http://a.b.com>        cib:     info:
> > cib_perform_op:      ++ /cib/status/node_state[@id='a.b.com
> > <http://a.b.com>']/transient_attributes[@id='a.b.com
> > <http://a.b.com>']/instance_attributes[@id='status-a.b.com
> > <http://status-a.b.com>']:  <nvpair
> > id="status-a.b.com-last-failure-mycustomprog"
> > name="last-failure-mycustomprog" value="1482931323"/>
> > Dec 28 13:22:04 [2199] a.b.com <http://a.b.com>       crmd:     info:
> > abort_transition_graph:      Transition aborted by
> > status-a.b.com-fail-count-mycustomprog, fail-count-mycustomprog=1:
> > Transient attribute change (create cib=0.48.2038787,
> > source=abort_unless_down:329, path=/cib/status/node_state[@id='a.b.com
> > <http://a.b.com>']/transient_attributes[@id='a.b.com
> > <http://a.b.com>']/instance_attributes[@id='status-a.b.com
> > <http://status-a.b.com>
> > Dec 28 13:22:04 [2199] a.b.com <http://a.b.com>       crmd:     info:
> > abort_transition_graph:      Transition aborted by
> > status-a.b.com-last-failure-mycustomprog,
> > last-failure-mycustomprog=1482931323: Transient attribute change
> > (create cib=0.48.2038788, source=abort_unless_down:329,
> > path=/cib/status/node_state[@id='a.b.com
> > <http://a.b.com>']/transient_attributes[@id='a.b.com
> > <http://a.b.com>']/instance_attributes[@id='status-a.b.com
> > <http://status-a.b.com>
> > Dec 28 13:22:04 [2198] a.b.com <http://a.b.com>    pengine:  warning:
> > unpack_rsc_op_failure:       Processing failed op monitor for
> > mycustomprog on a.b.com <http://a.b.com>: not running (7)
> > Dec 28 13:22:04 [2198] a.b.com <http://a.b.com>    pengine:     info:
> > native_print:        mycustomprog       (lsb:mycustomprog):
> >  FAILED a.b.com <http://a.b.com>
> > Dec 28 13:22:04 [2198] a.b.com <http://a.b.com>    pengine:     info:
> > get_failcount_full:  mycustomprog has failed 1 times on a.b.com
> > <http://a.b.com>
> > Dec 28 13:22:04 [2198] a.b.com <http://a.b.com>    pengine:     info:
> > common_apply_stickiness:     mycustomprog can fail 999999 more times
> > on a.b.com <http://a.b.com> before being forced off
> > Dec 28 13:22:04 [2198] a.b.com <http://a.b.com>    pengine:     info:
> > RecurringOp:  Start recurring monitor (10s) for mycustomprog on
> > a.b.com <http://a.b.com>
> > Dec 28 13:22:04 [2198] a.b.com <http://a.b.com>    pengine:   notice:
> > LogActions:  Recover mycustomprog       (Started a.b.com <http://a.b.com
> >)
> > Dec 28 13:22:04 [2199] a.b.com <http://a.b.com>       crmd:   notice:
> > te_rsc_command:      Initiating action 5: stop mycustomprog_stop_0 on
> > a.b.com <http://a.b.com> (local)
> > Dec 28 13:22:04 [2196] a.b.com <http://a.b.com>       lrmd:     info:
> > cancel_recurring_action:     Cancelling lsb operation
> > mycustomprog_status_10000
> >
> >
> >
> > pcs config:
> >
> > source settings.rc
> >
> > pcs property set stonith-enabled=false
> > pcs property set no-quorum-policy=ignore
> > pcs resource create ClusterIP2 IPaddr2 ip=$MYVIP
> cidr_netmask=$NETMASKVIP1
> > pcs resource create ClusterIP3 IPaddr2 ip=$MYVIP2
> > cidr_netmask=$NETMASKVIP2
> > pcs resource create mycustomprog lsb:mycustomprog op monitor
> > interval="10s"
> > pcs constraint colocation add ClusterIP3 with ClusterIP2 INFINITY
> > pcs constraint colocation add mycustomprog with ClusterIP2 INFINITY
> > pcs property set start-failure-is-fatal=false
> > pcs resource defaults resource-stickiness=100
> > pcs constraint colocation add chkhealth with ClusterIP2 INFINITY
> >
> >
> >
> > _______________________________________________
> > Users mailing list: Users at clusterlabs.org
> > http://lists.clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
>
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20170102/7c18a4d2/attachment-0003.html>