[Pacemaker] stonith monitor timeout not restartable

Dejan Muhamedagic dejanmm at fastmail.fm
Wed Nov 24 05:47:52 EST 2010


Hi,

On Fri, Nov 19, 2010 at 02:19:02PM -0500, Ron Kerry wrote:
> I have a customer running pacemaker/openais from the SLE11-HAE
> distribution. On occasion we have a stonith clone instance timeout.
> It becomes unrestartable and it is not generally recoverable unless
> we completely stop openais on both nodes and restart. I have an
> example from the logs below. Is this a known issue?

Didn't see this so far. Strange, this code has been there for
quite some time. Can you please open a bugzilla.

Thanks,

Dejan

>  >> This is the stonith operation that has the problem
> Nov 15 23:08:35 dmf2 lrmd: [83870]: debug: rsc:stonith-sgi-ipmi:1:10: monitor
> Nov 15 23:08:35 dmf2 lrmd: [98416]: debug: perform_ra_op: resetting scheduler class to SCHED_OTHER
> Nov 15 23:08:35 dmf2 lrmd: [98416]: debug: stonithd_signon: creating connection
> Nov 15 23:08:35 dmf2 lrmd: [98416]: debug: sending out the signon msg.
> Nov 15 23:08:35 dmf2 stonithd: [83868]: debug: client
> STONITH_RA_EXEC_98416 (pid=98416) succeeded to signon to stonithd.
> Nov 15 23:08:35 dmf2 lrmd: [98416]: debug: signed on to stonithd.
> Nov 15 23:08:35 dmf2 stonithd: [83868]: debug: client
> STONITH_RA_EXEC_98416 [pid: 98416] requests a resource operation
> monitor on stonith-sgi-ipmi:1 (external/sgi-ipmi)
> Nov 15 23:08:35 dmf2 lrmd: [98416]: debug: waiting for the stonithRA reply msg.
> Nov 15 23:08:35 dmf2 lrmd: [98416]: debug: a stonith RA operation queue to run, call_id=98417.
> Nov 15 23:08:35 dmf2 lrmd: [98416]: debug: stonithd_receive_ops_result: begin
> Nov 15 23:08:35 dmf2 stonithd: [98417]: debug: external_status: called.
> Nov 15 23:08:35 dmf2 stonithd: [98417]: debug: external_run_cmd:
> Calling '/usr/lib64/stonith/plugins/external/sgi-ipmi status'
> 
>  >> the stonith operation times out!!
> Nov 15 23:08:50 dmf2 stonithd: [83868]: WARN:
> external_stonith-sgi-ipmi:1_monitor process (PID 98417) timed out
> (try 1).  Killing with signal SIGTERM (15).
> Nov 15 23:08:55 dmf2 stonithd: [83868]: WARN:
> external_stonith-sgi-ipmi:1_monitor process (PID 98417) timed out
> (try 2).  Killing with signal SIGKILL (9).
> Nov 15 23:08:55 dmf2 stonithd: [83868]: WARN: Managed
> external_stonith-sgi-ipmi:1_monitor process 98417 killed by signal 9
> [SIGKILL - Kill, unblockable].
> Nov 15 23:08:55 dmf2 stonithd: [83868]: debug: Child process
> external_stonith-sgi-ipmi:1_monitor [98417] exited, its exit code: 0
> when signo=9.
> Nov 15 23:08:55 dmf2 stonithd: [83868]: WARN: A STONITH operation timed out.
> Nov 15 23:08:55 dmf2 stonithd: [83868]: debug: stonith-sgi-ipmi:1's
> (external/sgi-ipmi) op monitor finished. op_result=2
> Nov 15 23:08:55 dmf2 stonithd: [83868]: debug: client STONITH_RA_EXEC_98416 (pid=98416) signed off
> Nov 15 23:08:55 dmf2 crmd: [83873]: info: process_lrm_event: LRM
> operation stonith-sgi-ipmi:1_monitor_15000 (call=10, rc=2,
> cib-update=118, confirmed=false) invalid parameter
>   >> a timeout is interpreted as an invalid parameter
> Nov 15 23:08:55 dmf2 crmd: [83873]: debug: te_update_diff:
> Processing diff (cib_modify): 0.453.14 -> 0.453.15 (S_IDLE)
> Nov 15 23:08:55 dmf2 crmd: [83873]: debug: get_xpath_object: No
> match for //cib_update_result//diff-added//crm_config in
> /notify/cib_update_result/diff
> Nov 15 23:08:55 dmf2 crmd: [83873]: info: process_graph_event:
> Action stonith-sgi-ipmi:1_monitor_15000/32
> (0:2;32:7:0:0600ce86-cbe8-475c-add7-32ee6b812e21) initiated by a
> different transitioner
> Nov 15 23:08:55 dmf2 crmd: [83873]: info: abort_transition_graph:
> process_graph_event:456 - Triggered transition abort (complete=1,
> tag=lrm_rsc_op, id=stonith-sgi-ipmi:1_monitor_15000,
> magic=0:2;32:7:0:0600ce86-cbe8-475c-add7-32ee6b812e21, cib=0.453.15)
> : Foreign event
> 
>  >> updating the failcount and last failure time
> Nov 15 23:09:09 dmf2 crmd: [83873]: WARN: update_failcount: Updating
> failcount for stonith-sgi-ipmi:1 on dmf2 after failed monitor: rc=2
> (update=value++, time=1289880535)
> Nov 15 23:09:09 dmf2 attrd: [83871]: debug: attrd_local_callback:
> update message from crmd: fail-count-stonith-sgi-ipmi:1=value++
> Nov 15 23:09:09 dmf2 crmd: [83873]: debug: attrd_update: Sent
> update: fail-count-stonith-sgi-ipmi:1=value++ for dmf2
> Nov 15 23:09:09 dmf2 attrd: [83871]: info: find_hash_entry: Creating
> hash entry for fail-count-stonith-sgi-ipmi:1
> Nov 15 23:09:09 dmf2 crmd: [83873]: debug: attrd_update: Sent
> update: last-failure-stonith-sgi-ipmi:1=1289880535 for dmf2
> Nov 15 23:09:09 dmf2 attrd: [83871]: debug: attrd_local_callback:
> Supplied: value++, Current: (null), Stored: (null)
> ...
> 
>  >> the timeout is considered a hard error and it will not allow a restart!
> Nov 15 23:09:15 dmf2 pengine: [83872]: debug: unpack_rsc_op:
> stonith-sgi-ipmi:1_monitor_15000 on dmf2 returned 2 (invalid
> parameter) instead of the expected value: 0 (ok)
> Nov 15 23:09:15 dmf2 pengine: [83872]: ERROR: unpack_rsc_op: Hard
> error - stonith-sgi-ipmi:1_monitor_15000 failed with rc=2:
> Preventing stonith-sgi-ipmi-set from re-starting on dmf2
> Nov 15 23:09:37 dmf2 pengine: [83872]: debug: native_assign_node:
> All nodes for resource stonith-sgi-ipmi:1 are unavailable, unclean
> or shutting down (dmf1: 1, -1000000)
> Nov 15 23:09:37 dmf2 pengine: [83872]: WARN: native_color: Resource
> stonith-sgi-ipmi:1 cannot run anywhere
> Nov 15 23:09:37 dmf2 pengine: [83872]: debug: clone_color: Allocated
> 1 stonith-sgi-ipmi-set instances of a possible 2
> Nov 15 23:09:37 dmf2 pengine: [83872]: notice: LogActions: Leave resource CXFS  (Started dmf2)
> Nov 15 23:09:37 dmf2 pengine: [83872]: notice: LogActions: Leave resource Public-IP     (Started dmf2)
> Nov 15 23:09:37 dmf2 pengine: [83872]: notice: LogActions: Leave resource PDMO-IP       (Started dmf2)
> Nov 15 23:09:37 dmf2 pengine: [83872]: notice: LogActions: Leave resource Openvault     (Started dmf2)
> Nov 15 23:09:37 dmf2 pengine: [83872]: notice: LogActions: Leave resource DMF   (Started dmf2)
> Nov 15 23:09:37 dmf2 pengine: [83872]: notice: LogActions: Leave resource NFS   (Started dmf2)
> Nov 15 23:09:37 dmf2 pengine: [83872]: notice: LogActions: Leave
> resource stonith-sgi-ipmi:0 (Started dmf1)
> Nov 15 23:09:37 dmf2 pengine: [83872]: notice: LogActions: Stop resource stonith-sgi-ipmi:1     (dmf2)
> Nov 15 23:09:58 dmf2 pengine: [83872]: debug: unpack_rsc_op:
> stonith-sgi-ipmi:1_monitor_15000 on dmf2 returned 2 (invalid
> parameter) instead of the expected value: 0 (ok)
> Nov 15 23:09:58 dmf2 pengine: [83872]: ERROR: unpack_rsc_op: Hard
> error - stonith-sgi-ipmi:1_monitor_15000 failed with rc=2:
> Preventing stonith-sgi-ipmi-set from re-starting on dmf2
> Nov 15 23:09:58 dmf2 pengine: [83872]: WARN: unpack_rsc_op:
> Processing failed op stonith-sgi-ipmi:1_monitor_15000 on dmf2:
> invalid parameter (2)
> Nov 15 23:09:58 dmf2 pengine: [83872]: WARN: unpack_rsc_op:
> Processing failed op CXFS_monitor_0 on dmf2: unknown exec error (-2)
> Nov 15 23:09:58 dmf2 pengine: [83872]: notice: group_print:  Resource Group: dmfGroup
> Nov 15 23:09:58 dmf2 pengine: [83872]: notice: native_print:
> CXFS  (ocf::sgi:cxfs): Started dmf2
> Nov 15 23:09:58 dmf2 pengine: [83872]: notice: native_print:
> Public-IP (ocf::heartbeat:IPaddr2):       Started dmf2
> Nov 15 23:09:58 dmf2 pengine: [83872]: notice: native_print:
> PDMO-IP (ocf::heartbeat:IPaddr2):       Started dmf2
> Nov 15 23:09:58 dmf2 pengine: [83872]: notice: native_print:
> Openvault (ocf::sgi:openvault):   Started dmf2
> Nov 15 23:09:58 dmf2 pengine: [83872]: notice: native_print:      DMF   (ocf::sgi:dmf): Started dmf2
> Nov 15 23:09:58 dmf2 pengine: [83872]: notice: native_print:
> NFS   (ocf::heartbeat:nfsserver):     Started dmf2
> Nov 15 23:09:58 dmf2 pengine: [83872]: notice: clone_print:  Clone Set: stonith-sgi-ipmi-set
> Nov 15 23:09:58 dmf2 pengine: [83872]: debug: native_active: Resource stonith-sgi-ipmi:0 active on dmf1
> Nov 15 23:09:58 dmf2 pengine: [83872]: debug: native_active: Resource stonith-sgi-ipmi:0 active on dmf1
> Nov 15 23:09:58 dmf2 pengine: [83872]: debug: native_active: Resource stonith-sgi-ipmi:1 active on dmf2
> Nov 15 23:09:58 dmf2 pengine: [83872]: notice: native_print:
> stonith-sgi-ipmi:1 (stonith:external/sgi-ipmi):    Started dmf2
> FAILED
> Nov 15 23:10:08 dmf2 pengine: [83872]: notice: short_print:      Started: [ dmf1 ]
> Nov 15 23:10:08 dmf2 pengine: [83872]: debug:
> common_apply_stickiness: Resource stonith-sgi-ipmi:0: preferring
> current location (node=dmf1, weight=1)
> Nov 15 23:10:08 dmf2 pengine: [83872]: debug:
> common_apply_stickiness: Resource CXFS: preferring current location
> (node=dmf2, weight=1)
> Nov 15 23:10:08 dmf2 pengine: [83872]: debug:
> common_apply_stickiness: Resource Public-IP: preferring current
> location (node=dmf2, weight=1)
> Nov 15 23:10:08 dmf2 pengine: [83872]: debug:
> common_apply_stickiness: Resource PDMO-IP: preferring current
> location (node=dmf2, weight=1)
> Nov 15 23:10:08 dmf2 pengine: [83872]: debug:
> common_apply_stickiness: Resource Openvault: preferring current
> location (node=dmf2, weight=1)
> Nov 15 23:10:08 dmf2 pengine: [83872]: debug:
> common_apply_stickiness: Resource DMF: preferring current location
> (node=dmf2, weight=1)
> Nov 15 23:10:08 dmf2 pengine: [83872]: debug:
> common_apply_stickiness: Resource NFS: preferring current location
> (node=dmf2, weight=1)
> Nov 15 23:10:08 dmf2 pengine: [83872]: debug:
> common_apply_stickiness: Resource stonith-sgi-ipmi:1: preferring
> current location (node=dmf2, weight=1)
> 
>  >> This implies the resource can still restart, but the operation is cancelled anyway
> Nov 15 23:10:08 dmf2 pengine: [83872]: info: get_failcount:
> stonith-sgi-ipmi-set has failed 1 times on dmf2
> Nov 15 23:10:08 dmf2 pengine: [83872]: notice:
> common_apply_stickiness: stonith-sgi-ipmi-set can fail 999999 more
> times on dmf2 before being forced off
> Nov 15 23:10:08 dmf2 pengine: [83872]: debug: native_assign_node:
> All nodes for resource stonith-sgi-ipmi:1 are unavailable, unclean
> or shutting down (dmf1: 1, -1000000)
> Nov 15 23:10:08 dmf2 lrmd: [83870]: info: cancel_op: operation
> monitor[10] on stonith::external/sgi-ipmi::stonith-sgi-ipmi:1 for
> client 83873, its parameters: CRM_meta_interval=[15000]
> nodelist=[dmf1;admin;admin;supermicro;10.7.8.201
> dmf2;admin;admin;supermicro;10CRM_meta_on_fail=[restart]
> CRM_meta_timeout=[15000] CRM_meta_clone_max=[2]
> crm_feature_set=[3.0.1] CRM_meta_globally_unique=[false]
> CRM_meta_name=[monitor] CRM_meta_clone=[1]  cancelled
> Nov 15 23:10:08 dmf2 crmd: [83873]: debug: do_fsa_action: actions:trace:        // A_FINALIZE_TIMER_STOP
> Nov 15 23:10:08 dmf2 pengine: [83872]: WARN: native_color: Resource
> stonith-sgi-ipmi:1 cannot run anywhere
> Nov 15 23:10:08 dmf2 lrmd: [83870]: debug: on_msg_cancel_op: operation 10 cancelled
> Nov 15 23:10:08 dmf2 crmd: [83873]: debug: do_fsa_action: actions:trace:        // A_TE_INVOKE
> Nov 15 23:10:08 dmf2 pengine: [83872]: debug: clone_color: Allocated
> 1 stonith-sgi-ipmi-set instances of a possible 2
> Nov 15 23:10:08 dmf2 lrmd: [83870]: debug: on_msg_perform_op: add an
> operation operation stop[26] on
> stonith::external/sgi-ipmi::stonith-sgi-ipmi:1 for client 83873, its
> parameters: CRM_meta_timeout=[20000] CRM_meta_clone_max=[2]
> crm_feature_set=[3.0.1] CRM_meta_globally_unique=[false]
> CRM_meta_clone=[1]  to the operation list.
> Nov 15 23:10:09 dmf2 crmd: [83873]: info: te_rsc_command: Initiating
> action 3: stop stonith-sgi-ipmi:1_stop_0 on dmf2 (local)
> Nov 15 23:10:09 dmf2 lrmd: [102943]: debug: sending out the signon msg.
> Nov 15 23:10:09 dmf2 crmd: [83873]: debug: cancel_op: Cancelling op
> 10 for stonith-sgi-ipmi:1 (stonith-sgi-ipmi:1:10)
> Nov 15 23:10:09 dmf2 crmd: [83873]: debug: cancel_op: Op 10 for
> stonith-sgi-ipmi:1 (stonith-sgi-ipmi:1:10): cancelled
> Nov 15 23:10:09 dmf2 pengine: [83872]: notice: LogActions: Stop resource stonith-sgi-ipmi:1     (dmf2)
> Nov 15 23:10:09 dmf2 crmd: [83873]: info: do_lrm_rsc_op: Performing
> key=3:35:0:815c525b-e8f1-4bb4-a5eb-2229d1980cc1
> op=stonith-sgi-ipmi:1_stop_0 )
> Nov 15 23:10:09 dmf2 stonithd: [83868]: debug: client
> STONITH_RA_EXEC_102943 (pid=102943) succeeded to signon to stonithd.
> Nov 15 23:10:09 dmf2 lrmd: [102943]: debug: signed on to stonithd.
> Nov 15 23:10:09 dmf2 lrmd: [102943]: info: Try to stop STONITH
> resource <rsc_id=stonith-sgi-ipmi:1> : Device=external/sgi-ipmi
> Nov 15 23:10:09 dmf2 lrmd: [102943]: debug: waiting for the stonithRA reply msg.
> Nov 15 23:10:09 dmf2 stonithd: [83868]: debug: client
> STONITH_RA_EXEC_102943 [pid: 102943] requests a resource operation
> stop on stonith-sgi-ipmi:1 (external/sgi-ipmi)
> Nov 15 23:10:09 dmf2 stonithd: [83868]: debug: external_destroy: called.
> Nov 15 23:10:09 dmf2 lrmd: [102943]: debug: a stonith RA operation queue to run, call_id=102944.
> Nov 15 23:10:09 dmf2 lrmd: [102943]: debug: stonithd_receive_ops_result: begin
> Nov 15 23:10:09 dmf2 stonithd: [83868]: debug: Child process
> external_stonith-sgi-ipmi:1_stop [102944] exited, its exit code: 0
> when signo=0.
> Nov 15 23:10:09 dmf2 stonithd: [83868]: debug: stonith-sgi-ipmi:1's
> (external/sgi-ipmi) op stop finished. op_result=0
> Nov 15 23:10:09 dmf2 stonithd: [83868]: debug: client STONITH_RA_EXEC_102943 (pid=102943) signed off
> Nov 15 23:10:09 dmf2 lrmd: [83870]: info: Managed
> stonith-sgi-ipmi:1:stop process 102943 exited with return code 0.
> Nov 15 23:10:09 dmf2 crmd: [83873]: info: process_lrm_event: LRM
> operation stonith-sgi-ipmi:1_stop_0 (call=26, rc=0, cib-update=122,
> confirmed=true) ok
> Nov 15 23:10:09 dmf2 crmd: [83873]: info: match_graph_event: Action
> stonith-sgi-ipmi:1_stop_0 (3) confirmed on dmf2 (rc=0)
> -- 
> 
> Ron Kerry         rkerry at sgi.com
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker




More information about the Pacemaker mailing list