[Pacemaker] Critical: Monitor operation of IPaddr2 timing out, taking more than 60s. Fails to recover.

Mon Aug 13 11:56:09 EDT 2012

Hi Mario,

On Thu, Aug 09, 2012 at 10:52:39AM -0300, Mario Penners wrote:
> Hi Parshvi,
> 
> just a quick-shot and without analyzing your mail in detail: find
> attached an edited version of the IPaddr2 RA.
> 
> I was trying to use the original script a while agho, and basically
> nothing worked: It did not recognize the link failures (due to the way
> how the test was implemented it would only work if you have not more
> than 1 IP per interface), there was no proper support for bonding, the
> IP addresses would not be shifted ....
> 
> 
> I did some (very minor) changes to ge the script working for us. Just
> have a shot at it if you want, maybe replacing the RA will already solve
> your problem.

Do you think you have anything to contribute? There's currently
an effort going on to update the IPaddr2:
https://github.com/ClusterLabs/resource-agents/pull/97.

According to the comments at the top of your script, it would be
good to have your patches on board too. But your RA seems to be
based on an outdated version of IPaddr2.

Cheers,

Dejan

> Cheers,
> Mario  
> 
> 
> On Thu, 2012-08-09 at 05:44 +0000, Parshvi wrote:
> > Parshvi <parshvi.17 at ...> writes:
> > 
> > > 
> > > Hi,
> > > 
> > > The monitor operation of IPaddr2 rsc agent is timing out.
> > > Interval: 5s
> > > Timeout: 60s
> > > The timeout was increased from an earlier 20s to now 60s. Even then, there are 
> > > multiple logs of monitor op. timing out.
> > > 
> > > 1) What can cause the monitor to take so long ?
> > > 2) Looking at the pe-input, what contributes to the operation time ? Is it 
> > just 
> > > the exec-time or exec-time + queue-time ?
> > > 3) Any solution proposed ?
> > > 
> > > I have lrm pe-input when the timeout was configured at 20s:
> > > Here, is pe-input snapshot where monitor op. timed out (with timeout=20s)
> > > 
> > > <lrm_resource id="Group_1_ClusterIP" type="IPaddr2" class="ocf" 
> > > provider="heartbeat">
> > >             <lrm_rsc_op id="Group_1_ClusterIP_monitor_0" operation="monitor" 
> > > crm-debug-origin="build_active_RAs" crm_feature_set="3.0.1" transition-
> > > key="28:0:7:6b445452-980a-455f-8616-7bd12f20843e" transition-
> > > magic="0:7;28:0:7:6b445452-980a-455f-8616-7bd12f20843e" call-id="10" rc-
> > code="7" 
> > > op-status="0" interval="0" last-run="1343738096" last-rc-change="1343738096" 
> > > exec-time="20" queue-time="30" op-digest="f22a042c86b227078b239707d4e4d4a2"/>
> > > 
> > >             <lrm_rsc_op id="Group_1_ClusterIP_start_0" operation="start" crm-
> > > debug-origin="do_update_resource" crm_feature_set="3.0.1" transition-
> > > key="87:27957:0:6b445452-980a-455f-8616-7bd12f20843e" transition-
> > > magic="0:0;87:27957:0:6b445452-980a-455f-8616-7bd12f20843e" call-id="83503" 
> > rc-
> > > code="0" op-status="0" interval="0" last-run="1343928908" last-rc-
> > > change="1343928908" exec-time="280" queue-time="20" op-
> > > digest="f22a042c86b227078b239707d4e4d4a2"/>
> > > 
> > >             <lrm_rsc_op id="Group_1_ClusterIP_monitor_5000" 
> > operation="monitor" 
> > > crm-debug-origin="do_update_resource" crm_feature_set="3.0.1" transition-
> > > key="12:27957:0:6b445452-980a-455f-8616-7bd12f20843e" transition-
> > > magic="2:-2;12:27957:0:6b445452-980a-455f-8616-7bd12f20843e" call-id="83504" 
> > rc-
> > > code="-2" op-status="2" interval="5000" last-rc-change="1343928921" exec-
> > > time="20000" queue-time="0" op-digest="79c3bdd01c6e0fd819484536a54bf7a2"/>
> > > (Please note exec-time=20000)
> > > 
> > 
> > Following are the details of packages:
> > cluster-glue: 1.0.6 (1c87a0c58c59fc384b93ec11476cefdbb6ddc1e1)
> > resource-agents: # Build version: 7a11934b142d1daf42a04fbaa0391a3ac47cee4c
> > CRM Version: 1.0.12 (unknown)
> > pacemaker 1.0.12-1.el5.centos - (none) x86_64
> > corosync 1.2.7-1.1.el5 - (none) x86_64
> > resource-agents 1.0.4-1.1.el5 - (none) x86_64
> > 
> > There are 4 virtual IP resources configued:
> > 
> > Out of these, 3 recover after a monitor timeout but one Virtual IP rsc does not 
> > recover. Following are the logs that are observed:
> > 
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: run_graph: Transition 63579 
> > (Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=8, 
> > Source=/var/lib/pengine/pe-input-1660.bz2): Terminated
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: ERROR: te_graph_trigger: Transition 
> > failed: terminated
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Graph 63579 (9 
> > actions in 9 synapses): batch-limit=30 jobs, network-delay=60000ms
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 0 was 
> > confirmed (priority: 0)
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 1 is pending 
> > (priority: 0)
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem:     [Action 8]: 
> > Pending (id: Rsc1_GroupClusterIP_stop_0, loc: CSS-FU-1, priority: 0)
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem:      * [Input 103]: 
> > Pending (id: Rsc2_stop_0, loc: CSS-FU-1, priority: 0)
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 2 is pending 
> > (priority: 0)
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem:     [Action 97]: 
> > Pending (id: Rsc1_GroupClusterIP_start_0, loc: CSS-FU-2, priority: 0)
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem:      * [Input 8]: 
> > Pending (id: Rsc1_GroupClusterIP_stop_0, loc: CSS-FU-1, priority: 0)
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 3 is pending 
> > (priority: 0)
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem:     [Action 98]: 
> > Pending (id: Rsc1_GroupClusterIP_monitor_1000, loc: CSS-FU-2, priority: 0)
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem:      * [Input 97]: 
> > Pending (id: Rsc1_GroupClusterIP_start_0, loc: CSS-FU-2, priority: 0)
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 4 is pending 
> > (priority: 0)
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem:     [Action 99]: 
> > Pending (id: Rsc3_start_0, loc: CSS-FU-2, priority: 0)
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem:      * [Input 97]: 
> > Pending (id: Rsc1_GroupClusterIP_start_0, loc: CSS-FU-2, priority: 0)
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 5 is pending 
> > (priority: 0)
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem:     [Action 100]: 
> > Pending (id: Rsc3_monitor_1000, loc: CSS-FU-2, priority: 0)
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem:      * [Input 99]: 
> > Pending (id: Rsc3_start_0, loc: CSS-FU-2, priority: 0)
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 6 is pending 
> > (priority: 0)
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem:     [Action 101]: 
> > Pending (id: Rsc4_start_0, loc: CSS-FU-2, priority: 0)
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem:      * [Input 97]: 
> > Pending (id: Rsc1_GroupClusterIP_start_0, loc: CSS-FU-2, priority: 0)
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 7 is pending 
> > (priority: 0)
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem:     [Action 102]: 
> > Pending (id: Rsc4_monitor_1000, loc: CSS-FU-2, priority: 0)
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem:      * [Input 101]: 
> > Pending (id: Rsc4_start_0, loc: CSS-FU-2, priority: 0)
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 8 is pending 
> > (priority: 0)
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem:     [Action 36]: 
> > Pending (id: all_stopped, type: pseduo, priority: 0)
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem:      * [Input 8]: 
> > Pending (id: Rsc1_GroupClusterIP_stop_0, loc: CSS-FU-1, priority: 0)
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: info: te_graph_trigger: Transition 63579 
> > is now complete
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: info: notify_crmd: Transition 63579 
> > status: done - <null>
> > Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: info: do_state_transition: State 
> > transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS 
> > cause=C_FSA_INTERNAL origin=notify_crmd ]
> > 
> > 1) Please help as to why a monitor is timing out ?
> > 2) Why does one of the VIP's fails to recover after a timeout ?
> > 
> > 
> > 
> > 
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> 

> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org