[Pacemaker] Missing lrm_opstatus

Dejan Muhamedagic dejanmm at fastmail.fm
Tue Oct 5 07:50:30 EDT 2010


Hi,

On Tue, Oct 05, 2010 at 11:18:37AM +0200, Andrew Beekhof wrote:
> Dejan: looks like something in the lrm library.
> Any idea why the message doesn't contain lrm_opstatus?

Becase this monitor operation never run. Which seems to be a
plausible explanation since the start-delay is set to 600s.

> lrm_targetrc also looks strange.

That translates to CHANGED, i.e. notify the requester only on
status change. That should be what the requester itself sent with
the operation create request.

Thanks,

Dejan

> On Thu, Sep 30, 2010 at 9:41 PM, Ron Kerry <rkerry at sgi.com> wrote:
> > Folks -
> >
> > I am seeing the following message sequence that results in a "bogus"
> > declaration of monitor failures for two resources very quickly after a
> > failover completes (from hendrix to genesis) with all resources coming up.
> > The scenario is the same for both resources.
> >
> > CXFS resource monitor invoked after a successful start but the response is
> > "faked" likely due to the start-delay defined for monitoring.
> >
> > Sep 30 10:23:33 genesis crmd: [12176]: info: te_rsc_command: Initiating
> > action 8: monitor CXFS_monitor_30000 on genesis (local)
> > Sep 30 10:23:33 genesis crmd: [12176]: info: do_lrm_rsc_op: Performing
> > key=8:1:0:68ec92a5-8ae4-4b71-a37b-f348916fba9c op=CXFS_monitor_30000 )
> > Sep 30 10:23:33 genesis lrmd: [12173]: debug: on_msg_perform_op: an
> > operation operation monitor[15] on ocf::cxfs::CXFS for client 12176, its
> > parameters: CRM_meta_name=[monitor] CRM_meta_start_delay=[600000]
> > crm_feature_set=[3.0.2] CRM_meta_on_fail=[restar..
> > Sep 30 10:23:33 genesis crmd: [12176]: info: do_lrm_rsc_op: Faking
> > confirmation of CXFS_monitor_30000: execution postponed for over 5 minutes
> > Sep 30 10:23:33 genesis crmd: [12176]: info: send_direct_ack: ACK'ing
> > resource op CXFS_monitor_30000 from
> > 8:1:0:68ec92a5-8ae4-4b71-a37b-f348916fba9c: lrm_invoke-lrmd-1285860213-14
> > Sep 30 10:23:33 genesis crmd: [12176]: info: process_te_message: Processing
> > (N)ACK lrm_invoke-lrmd-1285860213-14 from genesis
> > Sep 30 10:23:33 genesis crmd: [12176]: info: match_graph_event: Action
> > CXFS_monitor_30000 (8) confirmed on genesis (rc=0)
> >
> > Similar sequence for the TMF resource ...
> >
> > Sep 30 10:23:44 genesis crmd: [12176]: info: te_rsc_command: Initiating
> > action 12: monitor TMF_monitor_60000 on genesis (local)
> > Sep 30 10:23:44 genesis crmd: [12176]: info: do_lrm_rsc_op: Performing
> > key=12:1:0:68ec92a5-8ae4-4b71-a37b-f348916fba9c op=TMF_monitor_60000 )
> > Sep 30 10:23:44 genesis lrmd: [12173]: debug: on_msg_perform_op: an
> > operation operation monitor[19] on ocf::tmf::TMF for client 12176, its
> > parameters: admin_emails=[rkerry at sgi.com] loader_hosts=[ibm3494cps]
> > devgrpnames=[ibm3592] loader_names=[ibm3494] loa...
> > Sep 30 10:23:44 genesis crmd: [12176]: info: do_lrm_rsc_op: Faking
> > confirmation of TMF_monitor_60000: execution postponed for over 5 minutes
> > Sep 30 10:23:44 genesis crmd: [12176]: info: send_direct_ack: ACK'ing
> > resource op TMF_monitor_60000 from
> > 12:1:0:68ec92a5-8ae4-4b71-a37b-f348916fba9c: lrm_invoke-lrmd-1285860224-19
> > Sep 30 10:23:44 genesis crmd: [12176]: info: process_te_message: Processing
> > (N)ACK lrm_invoke-lrmd-1285860224-19 from genesis
> > Sep 30 10:23:44 genesis crmd: [12176]: info: match_graph_event: Action
> > TMF_monitor_60000 (12) confirmed on genesis (rc=0)
> >
> > TMF monitor operation state gets an error. Note that the operation id
> > matches the above invocation.
> >
> > Sep 30 10:26:12 genesis lrmd: [12173]: debug: on_msg_get_state:state of rsc
> > TMF is LRM_RSC_IDLE
> > Sep 30 10:26:12 genesis crmd: [12176]: WARN: msg_to_op(1326): failed to get
> > the value of field lrm_opstatus from a ha_msg
> > Sep 30 10:26:12 genesis crmd: [12176]: info: msg_to_op: Message follows:
> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG: Dumping message with 16
> > fields
> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[0] : [lrm_t=op]
> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[1] : [lrm_rid=TMF]
> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[2] : [lrm_op=monitor]
> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[3] : [lrm_timeout=630000]
> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[4] : [lrm_interval=60000]
> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[5] : [lrm_delay=600000]
> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[6] : [lrm_copyparams=0]
> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[7] : [lrm_t_run=0]
> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[8] : [lrm_t_rcchange=0]
> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[9] : [lrm_exec_time=0]
> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[10] : [lrm_queue_time=0]
> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[11] : [lrm_targetrc=-2]
> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[12] : [lrm_app=crmd]
> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[13] :
> > [lrm_userdata=12:1:0:68ec92a5-8ae4-4b71-a37b-f348916fba9c]
> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[14] :
> > [(2)lrm_param=0x6000000000081260(331 413)]
> >
> > Same for the CXFS monitor operation state ...
> >
> > Sep 30 10:26:12 genesis lrmd: [12173]: debug: on_msg_get_state:state of rsc
> > CXFS is LRM_RSC_IDLE
> > Sep 30 10:26:12 genesis crmd: [12176]: WARN: msg_to_op(1326): failed to get
> > the value of field lrm_opstatus from a ha_msg
> > Sep 30 10:26:12 genesis crmd: [12176]: info: msg_to_op: Message follows:
> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG: Dumping message with 16
> > fields
> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[0] : [lrm_t=op]
> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[1] : [lrm_rid=CXFS]
> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[2] : [lrm_op=monitor]
> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[3] : [lrm_timeout=630000]
> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[4] : [lrm_interval=30000]
> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[5] : [lrm_delay=600000]
> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[6] : [lrm_copyparams=0]
> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[7] : [lrm_t_run=0]
> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[8] : [lrm_t_rcchange=0]
> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[9] : [lrm_exec_time=0]
> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[10] : [lrm_queue_time=0]
> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[11] : [lrm_targetrc=-2]
> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[12] : [lrm_app=crmd]
> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[13] :
> > [lrm_userdata=8:1:0:68ec92a5-8ae4-4b71-a37b-f348916fba9c]
> > Sep 30 10:26:12 genesis crmd: [12176]: info: MSG[14] :
> > [(2)lrm_param=0x6000000000030320(214 254)]
> >
> > In both cases, the lrm_opstatus field appears to be missing!!
> >
> > This results in updating the fail-count and last-failure values for these
> > resources.
> >
> > Sep 30 10:26:12 genesis attrd: [12174]: info: attrd_trigger_update: Sending
> > flush op to all hosts for: fail-count-TMF (<null>)
> > Sep 30 10:26:12 genesis attrd: [12174]: info: attrd_trigger_update: Sending
> > flush op to all hosts for: last-failure-TMF (<null>)
> > Sep 30 10:26:12 genesis attrd: [12174]: info: attrd_trigger_update: Sending
> > flush op to all hosts for: last-failure-CXFS (<null>)
> > Sep 30 10:26:12 genesis attrd: [12174]: info: attrd_trigger_update: Sending
> > flush op to all hosts for: fail-count-CXFS (<null>)
> >
> > This causes pengine to failover the resources thinking there was a monitor
> > failure when there was not.
> >
> > Sep 30 10:26:12 genesis pengine: [12175]: notice: LogActions: Move resource
> > CXFS        (Started genesis -> hendrix)
> > Sep 30 10:26:12 genesis pengine: [12175]: notice: LogActions: Move resource
> > VirtualIP   (Started genesis -> hendrix)
> > Sep 30 10:26:12 genesis pengine: [12175]: notice: LogActions: Move resource
> > TMF (Started genesis -> hendrix)
> > Sep 30 10:26:12 genesis pengine: [12175]: notice: LogActions: Move resource
> > DMF (Started genesis -> hendrix)
> > Sep 30 10:26:12 genesis pengine: [12175]: notice: LogActions: Move resource
> > DMFMAN      (Started genesis -> hendrix)
> > Sep 30 10:26:12 genesis pengine: [12175]: notice: LogActions: Move resource
> > DMFSOAP     (Started genesis -> hendrix)
> >
> > --
> >
> > Ron Kerry         rkerry at sgi.com
> > Field Technical Support - SGI Federal
> >
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs:
> > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> >
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker




More information about the Pacemaker mailing list