[Pacemaker] Help understanding what went wrong

Wed Oct 29 21:06:31 EDT 2014

Hi

Seems like I had the same problem on my prod cluster

Oct 30 01:21:53 prodrp1 lrmd[2536]:  warning: child_timeout_callback: ybrpstat_monitor_5000 process (PID 17918) timed out
Oct 30 01:21:53 prodrp1 lrmd[2536]:  warning: operation_finished: ybrpstat_monitor_5000:17918 - timed out after 20000ms
Oct 30 01:21:53 prodrp1 crmd[2539]:    error: process_lrm_event: LRM operation ybrpstat_monitor_5000 (4384) Timed Out (timeout=20000ms)
Oct 30 01:21:53 prodrp1 attrd[2537]:   notice: attrd_cs_dispatch: Update relayed from prodrp2

This is the first recording on prodrp1.

The issue is that prodrp1 was the master for the IP.

I am wondering if it has something to do with it being a cloned resource ?

So the cluster realises there is a problem, in the middle of it realising that the cluster check monitor fails because (assumption here) it does some sort of cluster wide check!

A

> -----Original Message-----
> From: Alex Samad - Yieldbroker [mailto:Alex.Samad at yieldbroker.com]
> Sent: Thursday, 30 October 2014 11:43 AM
> To: pacemaker at oss.clusterlabs.org
> Subject: [Pacemaker] Help understanding what went wrong
> 
> Hi
> 
> 
> rpm -qa | grep pace
> pacemaker-libs-1.1.10-14.el6_5.3.x86_64
> pacemaker-cluster-libs-1.1.10-14.el6_5.3.x86_64
> pacemaker-cli-1.1.10-14.el6_5.3.x86_64
> pacemaker-1.1.10-14.el6_5.3.x86_64
> 
> centos 6.5
> 
> 
> I have a 2 node cluster
> pcs config
> Cluster Name: ybrp
> Corosync Nodes:
> 
> Pacemaker Nodes:
>  wwwrp1 wwwrp2
> 
> Resources:
>  Resource: ybrpip (class=ocf provider=heartbeat type=IPaddr2)
>   Attributes: ip=10.32.43.50 cidr_netmask=24 nic=eth0
> clusterip_hash=sourceip-sourceport
>   Meta Attrs: stickiness=0,migration-threshold=3,failure-timeout=600s
>   Operations: start on-fail=restart interval=0s timeout=60s (ybrpip-start-
> interval-0s)
>               monitor on-fail=restart interval=5s timeout=20s (ybrpip-monitor-
> interval-5s)
>               stop interval=0s timeout=60s (ybrpip-stop-interval-0s)
>  Clone: ybrpstat-clone
>   Meta Attrs: globally-unique=false clone-max=2 clone-node-max=1
>   Resource: ybrpstat (class=ocf provider=yb type=proxy)
>    Operations: monitor on-fail=restart interval=5s timeout=20s (ybrpstat-
> monitor-interval-5s)
> 
> Stonith Devices:
> Fencing Levels:
> 
> Location Constraints:
> Ordering Constraints:
>   start ybrpstat-clone then start ybrpip (Mandatory) (id:order-ybrpstat-clone-
> ybrpip-mandatory)
> Colocation Constraints:
>   ybrpip with ybrpstat-clone (INFINITY) (id:colocation-ybrpip-ybrpstat-clone-
> INFINITY)
> 
> Cluster Properties:
>  cluster-infrastructure: cman
>  dc-version: 1.1.10-14.el6_5.3-368c726
>  last-lrm-refresh: 1414629002
>  no-quorum-policy: ignore
>  stonith-enabled: false
> 
> 
> 
> Basically 1 node died (wwwrp1) hardware failure
> 
> I can see in the logs the cluster wants to bring the IP address over and it
> seems to do it but
> 
> 
> 
> ## this seems to be the ip address moving Oct 30 01:19:43 wwwrp2
> IPaddr2(ybrpip)[25778]: INFO: Adding inet address 10.32.43.50/24 with
> broadcast address 10.32.43.255 to device eth0 Oct 30 01:19:43 wwwrp2
> IPaddr2(ybrpip)[25778]: INFO: Bringing device eth0 up Oct 30 01:19:43
> wwwrp2 IPaddr2(ybrpip)[25778]: INFO: /usr/libexec/heartbeat/send_arp -i
> 200 -r 5 -p /var/run/resource-agents/send_arp-10.32.43.50 eth0 10.32.43.50
> auto not_used not_used
> 
> 
> ## this seems to be where it checks other stuff
> Oct 30 01:19:43 wwwrp2 crmd[2462]:   notice: process_lrm_event: LRM
> operation ybrpip_start_0 (call=13476, rc=0, cib-update=11762,
> confirmed=true) ok
> Oct 30 01:19:43 wwwrp2 crmd[2462]:   notice: te_rsc_command: Initiating
> action 8: monitor ybrpip_monitor_5000 on wwwrp2 (local)
> Oct 30 01:19:43 wwwrp2 crmd[2462]:   notice: process_lrm_event: LRM
> operation ybrpip_monitor_5000 (call=13479, rc=0, cib-update=11763,
> confirmed=false) ok
> Oct 30 01:19:43 wwwrp2 crmd[2462]:   notice: run_graph: Transition 6828
> (Complete=6, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pacemaker/pengine/pe-input-3.bz2): Complete
> Oct 30 01:19:43 wwwrp2 crmd[2462]:   notice: do_state_transition: State
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> 
> # this is where monitor times out, but it doesn't look like 20000ms .. the initial
> try was at Oct 30 01:19:43 Oct 30 01:19:44 wwwrp2 lrmd[2459]:  warning:
> child_timeout_callback: ybrpstat_monitor_5000 process (PID 25712) timed
> out Oct 30 01:19:44 wwwrp2 lrmd[2459]:  warning: operation_finished:
> ybrpstat_monitor_5000:25712 - timed out after 20000ms
> Oct 30 01:19:44 wwwrp2 crmd[2462]:    error: process_lrm_event: LRM
> operation ybrpstat_monitor_5000 (13473) Timed Out (timeout=20000ms)
> Oct 30 01:19:44 wwwrp2 crmd[2462]:  warning: update_failcount: Updating
> failcount for ybrpstat on wwwrp2 after failed monitor: rc=1
> (update=value++, time=1414592384)
> 
> I'm guessing because it timed out it went into failed mode. I need to know
> why it timed out. The script has never timed out before or in testing...
> 
> Am I reading this right.  The reason the resource didn't fail over (ip address)
> was because there was no ybrpstat running on wwwrp2 and the reason for
> that was the monitor action failed/timed out
> 
> Thanks
> Alex
> 
> 
> === logs ==
> 
> Oct 30 01:19:42 wwwrp2 pengine[2461]:   notice: unpack_config: On loss of
> CCM Quorum: Ignore
> Oct 30 01:19:42 wwwrp2 pengine[2461]:  warning: unpack_rsc_op: Processing
> failed op start for ybrpstat:0 on wwwrp1: unknown error (1)
> Oct 30 01:19:42 wwwrp2 pengine[2461]:   notice: LogActions: Start
> ybrpip#011(wwwrp1)
> Oct 30 01:19:42 wwwrp2 pengine[2461]:   notice: LogActions: Recover
> ybrpstat:0#011(Started wwwrp1)
> Oct 30 01:19:42 wwwrp2 pengine[2461]:   notice: process_pe_message:
> Calculated Transition 6827: /var/lib/pacemaker/pengine/pe-input-2.bz2
> Oct 30 01:19:42 wwwrp2 pengine[2461]:   notice: unpack_config: On loss of
> CCM Quorum: Ignore
> Oct 30 01:19:42 wwwrp2 pengine[2461]:  warning: unpack_rsc_op: Processing
> failed op start for ybrpstat:0 on wwwrp1: unknown error (1) Oct 30 01:19:42
> wwwrp2 pengine[2461]:  warning: common_apply_stickiness: Forcing
> ybrpstat-clone away from wwwrp1 after 1000000 failures (max=1000000) Oct
> 30 01:19:42 wwwrp2 pengine[2461]:  warning: common_apply_stickiness:
> Forcing ybrpstat-clone away from wwwrp1 after 1000000 failures
> (max=1000000)
> Oct 30 01:19:42 wwwrp2 pengine[2461]:   notice: LogActions: Start
> ybrpip#011(wwwrp2)
> Oct 30 01:19:42 wwwrp2 pengine[2461]:   notice: LogActions: Stop
> ybrpstat:0#011(wwwrp1)
> Oct 30 01:19:42 wwwrp2 pengine[2461]:   notice: process_pe_message:
> Calculated Transition 6828: /var/lib/pacemaker/pengine/pe-input-3.bz2
> Oct 30 01:19:42 wwwrp2 crmd[2462]:   notice: te_rsc_command: Initiating
> action 7: start ybrpip_start_0 on wwwrp2 (local)
> Oct 30 01:19:42 wwwrp2 crmd[2462]:   notice: te_rsc_command: Initiating
> action 1: stop ybrpstat_stop_0 on wwwrp1
> Oct 30 01:19:43 wwwrp2 IPaddr2(ybrpip)[25778]: INFO: Adding inet address
> 10.32.43.50/24 with broadcast address 10.32.43.255 to device eth0 Oct 30
> 01:19:43 wwwrp2 IPaddr2(ybrpip)[25778]: INFO: Bringing device eth0 up Oct
> 30 01:19:43 wwwrp2 IPaddr2(ybrpip)[25778]: INFO:
> /usr/libexec/heartbeat/send_arp -i 200 -r 5 -p /var/run/resource-
> agents/send_arp-10.32.43.50 eth0 10.32.43.50 auto not_used not_used
> Oct 30 01:19:43 wwwrp2 crmd[2462]:   notice: process_lrm_event: LRM
> operation ybrpip_start_0 (call=13476, rc=0, cib-update=11762,
> confirmed=true) ok
> Oct 30 01:19:43 wwwrp2 crmd[2462]:   notice: te_rsc_command: Initiating
> action 8: monitor ybrpip_monitor_5000 on wwwrp2 (local)
> Oct 30 01:19:43 wwwrp2 crmd[2462]:   notice: process_lrm_event: LRM
> operation ybrpip_monitor_5000 (call=13479, rc=0, cib-update=11763,
> confirmed=false) ok
> Oct 30 01:19:43 wwwrp2 crmd[2462]:   notice: run_graph: Transition 6828
> (Complete=6, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pacemaker/pengine/pe-input-3.bz2): Complete
> Oct 30 01:19:43 wwwrp2 crmd[2462]:   notice: do_state_transition: State
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> Oct 30 01:19:44 wwwrp2 lrmd[2459]:  warning: child_timeout_callback:
> ybrpstat_monitor_5000 process (PID 25712) timed out Oct 30 01:19:44
> wwwrp2 lrmd[2459]:  warning: operation_finished:
> ybrpstat_monitor_5000:25712 - timed out after 20000ms
> Oct 30 01:19:44 wwwrp2 crmd[2462]:    error: process_lrm_event: LRM
> operation ybrpstat_monitor_5000 (13473) Timed Out (timeout=20000ms)
> Oct 30 01:19:44 wwwrp2 crmd[2462]:  warning: update_failcount: Updating
> failcount for ybrpstat on wwwrp2 after failed monitor: rc=1
> (update=value++, time=1414592384)
> Oct 30 01:19:44 wwwrp2 crmd[2462]:   notice: do_state_transition: State
> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
> cause=C_FSA_INTERNAL origin=abort_transition_graph ]
> Oct 30 01:19:44 wwwrp2 attrd[2460]:   notice: attrd_trigger_update: Sending
> flush op to all hosts for: fail-count-ybrpstat (1)
> Oct 30 01:19:44 wwwrp2 attrd[2460]:   notice: attrd_perform_update: Sent
> update 12543: fail-count-ybrpstat=1
> Oct 30 01:19:44 wwwrp2 attrd[2460]:   notice: attrd_trigger_update: Sending
> flush op to all hosts for: last-failure-ybrpstat (1414592384)
> Oct 30 01:19:44 wwwrp2 attrd[2460]:   notice: attrd_perform_update: Sent
> update 12545: last-failure-ybrpstat=1414592384
> Oct 30 01:19:44 wwwrp2 pengine[2461]:   notice: unpack_config: On loss of
> CCM Quorum: Ignore
> Oct 30 01:19:44 wwwrp2 pengine[2461]:  warning: unpack_rsc_op: Processing
> failed op start for ybrpstat:0 on wwwrp1: unknown error (1) Oct 30 01:19:44
> wwwrp2 pengine[2461]:  warning: unpack_rsc_op: Processing failed op
> monitor for ybrpstat:0 on wwwrp2: unknown error (1) Oct 30 01:19:44
> wwwrp2 pengine[2461]:  warning: common_apply_stickiness: Forcing
> ybrpstat-clone away from wwwrp1 after 1000000 failures (max=1000000) Oct
> 30 01:19:44 wwwrp2 pengine[2461]:  warning: common_apply_stickiness:
> Forcing ybrpstat-clone away from wwwrp1 after 1000000 failures
> (max=1000000)
> Oct 30 01:19:44 wwwrp2 pengine[2461]:   notice: LogActions: Restart
> ybrpip#011(Started wwwrp2)
> Oct 30 01:19:44 wwwrp2 pengine[2461]:   notice: LogActions: Recover
> ybrpstat:0#011(Started wwwrp2)
> Oct 30 01:19:44 wwwrp2 pengine[2461]:   notice: process_pe_message:
> Calculated Transition 6829: /var/lib/pacemaker/pengine/pe-input-4.bz2
> Oct 30 01:19:44 wwwrp2 pengine[2461]:   notice: unpack_config: On loss of
> CCM Quorum: Ignore
> Oct 30 01:19:44 wwwrp2 pengine[2461]:  warning: unpack_rsc_op: Processing
> failed op start for ybrpstat:0 on wwwrp1: unknown error (1) Oct 30 01:19:44
> wwwrp2 pengine[2461]:  warning: unpack_rsc_op: Processing failed op
> monitor for ybrpstat:0 on wwwrp2: unknown error (1) Oct 30 01:19:44
> wwwrp2 pengine[2461]:  warning: common_apply_stickiness: Forcing
> ybrpstat-clone away from wwwrp1 after 1000000 failures (max=1000000) Oct
> 30 01:19:44 wwwrp2 pengine[2461]:  warning: common_apply_stickiness:
> Forcing ybrpstat-clone away from wwwrp1 after 1000000 failures
> (max=1000000)
> Oct 30 01:19:44 wwwrp2 pengine[2461]:   notice: LogActions: Restart
> ybrpip#011(Started wwwrp2)
> Oct 30 01:19:44 wwwrp2 pengine[2461]:   notice: LogActions: Recover
> ybrpstat:0#011(Started wwwrp2)
> Oct 30 01:19:44 wwwrp2 pengine[2461]:   notice: process_pe_message:
> Calculated Transition 6830: /var/lib/pacemaker/pengine/pe-input-5.bz2
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org