[Pacemaker] stonith monitor timeout not restartable

Ron Kerry rkerry at sgi.com
Fri Nov 19 14:19:02 EST 2010


I have a customer running pacemaker/openais from the SLE11-HAE distribution. On occasion we have a 
stonith clone instance timeout. It becomes unrestartable and it is not generally recoverable unless 
we completely stop openais on both nodes and restart. I have an example from the logs below. Is this 
a known issue?

  >> This is the stonith operation that has the problem
Nov 15 23:08:35 dmf2 lrmd: [83870]: debug: rsc:stonith-sgi-ipmi:1:10: monitor
Nov 15 23:08:35 dmf2 lrmd: [98416]: debug: perform_ra_op: resetting scheduler class to SCHED_OTHER
Nov 15 23:08:35 dmf2 lrmd: [98416]: debug: stonithd_signon: creating connection
Nov 15 23:08:35 dmf2 lrmd: [98416]: debug: sending out the signon msg.
Nov 15 23:08:35 dmf2 stonithd: [83868]: debug: client STONITH_RA_EXEC_98416 (pid=98416) succeeded to 
signon to stonithd.
Nov 15 23:08:35 dmf2 lrmd: [98416]: debug: signed on to stonithd.
Nov 15 23:08:35 dmf2 stonithd: [83868]: debug: client STONITH_RA_EXEC_98416 [pid: 98416] requests a 
resource operation monitor on stonith-sgi-ipmi:1 (external/sgi-ipmi)
Nov 15 23:08:35 dmf2 lrmd: [98416]: debug: waiting for the stonithRA reply msg.
Nov 15 23:08:35 dmf2 lrmd: [98416]: debug: a stonith RA operation queue to run, call_id=98417.
Nov 15 23:08:35 dmf2 lrmd: [98416]: debug: stonithd_receive_ops_result: begin
Nov 15 23:08:35 dmf2 stonithd: [98417]: debug: external_status: called.
Nov 15 23:08:35 dmf2 stonithd: [98417]: debug: external_run_cmd: Calling 
'/usr/lib64/stonith/plugins/external/sgi-ipmi status'

  >> the stonith operation times out!!
Nov 15 23:08:50 dmf2 stonithd: [83868]: WARN: external_stonith-sgi-ipmi:1_monitor process (PID 
98417) timed out (try 1).  Killing with signal SIGTERM (15).
Nov 15 23:08:55 dmf2 stonithd: [83868]: WARN: external_stonith-sgi-ipmi:1_monitor process (PID 
98417) timed out (try 2).  Killing with signal SIGKILL (9).
Nov 15 23:08:55 dmf2 stonithd: [83868]: WARN: Managed external_stonith-sgi-ipmi:1_monitor process 
98417 killed by signal 9 [SIGKILL - Kill, unblockable].
Nov 15 23:08:55 dmf2 stonithd: [83868]: debug: Child process external_stonith-sgi-ipmi:1_monitor 
[98417] exited, its exit code: 0 when signo=9.
Nov 15 23:08:55 dmf2 stonithd: [83868]: WARN: A STONITH operation timed out.
Nov 15 23:08:55 dmf2 stonithd: [83868]: debug: stonith-sgi-ipmi:1's (external/sgi-ipmi) op monitor 
finished. op_result=2
Nov 15 23:08:55 dmf2 stonithd: [83868]: debug: client STONITH_RA_EXEC_98416 (pid=98416) signed off
Nov 15 23:08:55 dmf2 crmd: [83873]: info: process_lrm_event: LRM operation 
stonith-sgi-ipmi:1_monitor_15000 (call=10, rc=2, cib-update=118, confirmed=false) invalid parameter
   >> a timeout is interpreted as an invalid parameter
Nov 15 23:08:55 dmf2 crmd: [83873]: debug: te_update_diff: Processing diff (cib_modify): 0.453.14 -> 
0.453.15 (S_IDLE)
Nov 15 23:08:55 dmf2 crmd: [83873]: debug: get_xpath_object: No match for 
//cib_update_result//diff-added//crm_config in /notify/cib_update_result/diff
Nov 15 23:08:55 dmf2 crmd: [83873]: info: process_graph_event: Action 
stonith-sgi-ipmi:1_monitor_15000/32 (0:2;32:7:0:0600ce86-cbe8-475c-add7-32ee6b812e21) initiated by a 
different transitioner
Nov 15 23:08:55 dmf2 crmd: [83873]: info: abort_transition_graph: process_graph_event:456 - 
Triggered transition abort (complete=1, tag=lrm_rsc_op, id=stonith-sgi-ipmi:1_monitor_15000, 
magic=0:2;32:7:0:0600ce86-cbe8-475c-add7-32ee6b812e21, cib=0.453.15) : Foreign event

  >> updating the failcount and last failure time
Nov 15 23:09:09 dmf2 crmd: [83873]: WARN: update_failcount: Updating failcount for 
stonith-sgi-ipmi:1 on dmf2 after failed monitor: rc=2 (update=value++, time=1289880535)
Nov 15 23:09:09 dmf2 attrd: [83871]: debug: attrd_local_callback: update message from crmd: 
fail-count-stonith-sgi-ipmi:1=value++
Nov 15 23:09:09 dmf2 crmd: [83873]: debug: attrd_update: Sent update: 
fail-count-stonith-sgi-ipmi:1=value++ for dmf2
Nov 15 23:09:09 dmf2 attrd: [83871]: info: find_hash_entry: Creating hash entry for 
fail-count-stonith-sgi-ipmi:1
Nov 15 23:09:09 dmf2 crmd: [83873]: debug: attrd_update: Sent update: 
last-failure-stonith-sgi-ipmi:1=1289880535 for dmf2
Nov 15 23:09:09 dmf2 attrd: [83871]: debug: attrd_local_callback: Supplied: value++, Current: 
(null), Stored: (null)
...

  >> the timeout is considered a hard error and it will not allow a restart!
Nov 15 23:09:15 dmf2 pengine: [83872]: debug: unpack_rsc_op: stonith-sgi-ipmi:1_monitor_15000 on 
dmf2 returned 2 (invalid parameter) instead of the expected value: 0 (ok)
Nov 15 23:09:15 dmf2 pengine: [83872]: ERROR: unpack_rsc_op: Hard error - 
stonith-sgi-ipmi:1_monitor_15000 failed with rc=2: Preventing stonith-sgi-ipmi-set from re-starting 
on dmf2
Nov 15 23:09:37 dmf2 pengine: [83872]: debug: native_assign_node: All nodes for resource 
stonith-sgi-ipmi:1 are unavailable, unclean or shutting down (dmf1: 1, -1000000)
Nov 15 23:09:37 dmf2 pengine: [83872]: WARN: native_color: Resource stonith-sgi-ipmi:1 cannot run 
anywhere
Nov 15 23:09:37 dmf2 pengine: [83872]: debug: clone_color: Allocated 1 stonith-sgi-ipmi-set 
instances of a possible 2
Nov 15 23:09:37 dmf2 pengine: [83872]: notice: LogActions: Leave resource CXFS  (Started dmf2)
Nov 15 23:09:37 dmf2 pengine: [83872]: notice: LogActions: Leave resource Public-IP     (Started dmf2)
Nov 15 23:09:37 dmf2 pengine: [83872]: notice: LogActions: Leave resource PDMO-IP       (Started dmf2)
Nov 15 23:09:37 dmf2 pengine: [83872]: notice: LogActions: Leave resource Openvault     (Started dmf2)
Nov 15 23:09:37 dmf2 pengine: [83872]: notice: LogActions: Leave resource DMF   (Started dmf2)
Nov 15 23:09:37 dmf2 pengine: [83872]: notice: LogActions: Leave resource NFS   (Started dmf2)
Nov 15 23:09:37 dmf2 pengine: [83872]: notice: LogActions: Leave resource stonith-sgi-ipmi:0 
(Started dmf1)
Nov 15 23:09:37 dmf2 pengine: [83872]: notice: LogActions: Stop resource stonith-sgi-ipmi:1     (dmf2)
Nov 15 23:09:58 dmf2 pengine: [83872]: debug: unpack_rsc_op: stonith-sgi-ipmi:1_monitor_15000 on 
dmf2 returned 2 (invalid parameter) instead of the expected value: 0 (ok)
Nov 15 23:09:58 dmf2 pengine: [83872]: ERROR: unpack_rsc_op: Hard error - 
stonith-sgi-ipmi:1_monitor_15000 failed with rc=2: Preventing stonith-sgi-ipmi-set from re-starting 
on dmf2
Nov 15 23:09:58 dmf2 pengine: [83872]: WARN: unpack_rsc_op: Processing failed op 
stonith-sgi-ipmi:1_monitor_15000 on dmf2: invalid parameter (2)
Nov 15 23:09:58 dmf2 pengine: [83872]: WARN: unpack_rsc_op: Processing failed op CXFS_monitor_0 on 
dmf2: unknown exec error (-2)
Nov 15 23:09:58 dmf2 pengine: [83872]: notice: group_print:  Resource Group: dmfGroup
Nov 15 23:09:58 dmf2 pengine: [83872]: notice: native_print:      CXFS  (ocf::sgi:cxfs): 
Started dmf2
Nov 15 23:09:58 dmf2 pengine: [83872]: notice: native_print:      Public-IP 
(ocf::heartbeat:IPaddr2):       Started dmf2
Nov 15 23:09:58 dmf2 pengine: [83872]: notice: native_print:      PDMO-IP 
(ocf::heartbeat:IPaddr2):       Started dmf2
Nov 15 23:09:58 dmf2 pengine: [83872]: notice: native_print:      Openvault 
(ocf::sgi:openvault):   Started dmf2
Nov 15 23:09:58 dmf2 pengine: [83872]: notice: native_print:      DMF   (ocf::sgi:dmf): Started dmf2
Nov 15 23:09:58 dmf2 pengine: [83872]: notice: native_print:      NFS   (ocf::heartbeat:nfsserver): 
     Started dmf2
Nov 15 23:09:58 dmf2 pengine: [83872]: notice: clone_print:  Clone Set: stonith-sgi-ipmi-set
Nov 15 23:09:58 dmf2 pengine: [83872]: debug: native_active: Resource stonith-sgi-ipmi:0 active on dmf1
Nov 15 23:09:58 dmf2 pengine: [83872]: debug: native_active: Resource stonith-sgi-ipmi:0 active on dmf1
Nov 15 23:09:58 dmf2 pengine: [83872]: debug: native_active: Resource stonith-sgi-ipmi:1 active on dmf2
Nov 15 23:09:58 dmf2 pengine: [83872]: notice: native_print:      stonith-sgi-ipmi:1 
(stonith:external/sgi-ipmi):    Started dmf2 FAILED
Nov 15 23:10:08 dmf2 pengine: [83872]: notice: short_print:      Started: [ dmf1 ]
Nov 15 23:10:08 dmf2 pengine: [83872]: debug: common_apply_stickiness: Resource stonith-sgi-ipmi:0: 
preferring current location (node=dmf1, weight=1)
Nov 15 23:10:08 dmf2 pengine: [83872]: debug: common_apply_stickiness: Resource CXFS: preferring 
current location (node=dmf2, weight=1)
Nov 15 23:10:08 dmf2 pengine: [83872]: debug: common_apply_stickiness: Resource Public-IP: 
preferring current location (node=dmf2, weight=1)
Nov 15 23:10:08 dmf2 pengine: [83872]: debug: common_apply_stickiness: Resource PDMO-IP: preferring 
current location (node=dmf2, weight=1)
Nov 15 23:10:08 dmf2 pengine: [83872]: debug: common_apply_stickiness: Resource Openvault: 
preferring current location (node=dmf2, weight=1)
Nov 15 23:10:08 dmf2 pengine: [83872]: debug: common_apply_stickiness: Resource DMF: preferring 
current location (node=dmf2, weight=1)
Nov 15 23:10:08 dmf2 pengine: [83872]: debug: common_apply_stickiness: Resource NFS: preferring 
current location (node=dmf2, weight=1)
Nov 15 23:10:08 dmf2 pengine: [83872]: debug: common_apply_stickiness: Resource stonith-sgi-ipmi:1: 
preferring current location (node=dmf2, weight=1)

  >> This implies the resource can still restart, but the operation is cancelled anyway
Nov 15 23:10:08 dmf2 pengine: [83872]: info: get_failcount: stonith-sgi-ipmi-set has failed 1 times 
on dmf2
Nov 15 23:10:08 dmf2 pengine: [83872]: notice: common_apply_stickiness: stonith-sgi-ipmi-set can 
fail 999999 more times on dmf2 before being forced off
Nov 15 23:10:08 dmf2 pengine: [83872]: debug: native_assign_node: All nodes for resource 
stonith-sgi-ipmi:1 are unavailable, unclean or shutting down (dmf1: 1, -1000000)
Nov 15 23:10:08 dmf2 lrmd: [83870]: info: cancel_op: operation monitor[10] on 
stonith::external/sgi-ipmi::stonith-sgi-ipmi:1 for client 83873, its parameters: 
CRM_meta_interval=[15000] nodelist=[dmf1;admin;admin;supermicro;10.7.8.201 
dmf2;admin;admin;supermicro;10CRM_meta_on_fail=[restart] CRM_meta_timeout=[15000] 
CRM_meta_clone_max=[2] crm_feature_set=[3.0.1] CRM_meta_globally_unique=[false] 
CRM_meta_name=[monitor] CRM_meta_clone=[1]  cancelled
Nov 15 23:10:08 dmf2 crmd: [83873]: debug: do_fsa_action: actions:trace:        // A_FINALIZE_TIMER_STOP
Nov 15 23:10:08 dmf2 pengine: [83872]: WARN: native_color: Resource stonith-sgi-ipmi:1 cannot run 
anywhere
Nov 15 23:10:08 dmf2 lrmd: [83870]: debug: on_msg_cancel_op: operation 10 cancelled
Nov 15 23:10:08 dmf2 crmd: [83873]: debug: do_fsa_action: actions:trace:        // A_TE_INVOKE
Nov 15 23:10:08 dmf2 pengine: [83872]: debug: clone_color: Allocated 1 stonith-sgi-ipmi-set 
instances of a possible 2
Nov 15 23:10:08 dmf2 lrmd: [83870]: debug: on_msg_perform_op: add an operation operation stop[26] on 
stonith::external/sgi-ipmi::stonith-sgi-ipmi:1 for client 83873, its parameters: 
CRM_meta_timeout=[20000] CRM_meta_clone_max=[2] crm_feature_set=[3.0.1] 
CRM_meta_globally_unique=[false] CRM_meta_clone=[1]  to the operation list.
Nov 15 23:10:09 dmf2 crmd: [83873]: info: te_rsc_command: Initiating action 3: stop 
stonith-sgi-ipmi:1_stop_0 on dmf2 (local)
Nov 15 23:10:09 dmf2 lrmd: [102943]: debug: sending out the signon msg.
Nov 15 23:10:09 dmf2 crmd: [83873]: debug: cancel_op: Cancelling op 10 for stonith-sgi-ipmi:1 
(stonith-sgi-ipmi:1:10)
Nov 15 23:10:09 dmf2 crmd: [83873]: debug: cancel_op: Op 10 for stonith-sgi-ipmi:1 
(stonith-sgi-ipmi:1:10): cancelled
Nov 15 23:10:09 dmf2 pengine: [83872]: notice: LogActions: Stop resource stonith-sgi-ipmi:1     (dmf2)
Nov 15 23:10:09 dmf2 crmd: [83873]: info: do_lrm_rsc_op: Performing 
key=3:35:0:815c525b-e8f1-4bb4-a5eb-2229d1980cc1 op=stonith-sgi-ipmi:1_stop_0 )
Nov 15 23:10:09 dmf2 stonithd: [83868]: debug: client STONITH_RA_EXEC_102943 (pid=102943) succeeded 
to signon to stonithd.
Nov 15 23:10:09 dmf2 lrmd: [102943]: debug: signed on to stonithd.
Nov 15 23:10:09 dmf2 lrmd: [102943]: info: Try to stop STONITH resource <rsc_id=stonith-sgi-ipmi:1> 
: Device=external/sgi-ipmi
Nov 15 23:10:09 dmf2 lrmd: [102943]: debug: waiting for the stonithRA reply msg.
Nov 15 23:10:09 dmf2 stonithd: [83868]: debug: client STONITH_RA_EXEC_102943 [pid: 102943] requests 
a resource operation stop on stonith-sgi-ipmi:1 (external/sgi-ipmi)
Nov 15 23:10:09 dmf2 stonithd: [83868]: debug: external_destroy: called.
Nov 15 23:10:09 dmf2 lrmd: [102943]: debug: a stonith RA operation queue to run, call_id=102944.
Nov 15 23:10:09 dmf2 lrmd: [102943]: debug: stonithd_receive_ops_result: begin
Nov 15 23:10:09 dmf2 stonithd: [83868]: debug: Child process external_stonith-sgi-ipmi:1_stop 
[102944] exited, its exit code: 0 when signo=0.
Nov 15 23:10:09 dmf2 stonithd: [83868]: debug: stonith-sgi-ipmi:1's (external/sgi-ipmi) op stop 
finished. op_result=0
Nov 15 23:10:09 dmf2 stonithd: [83868]: debug: client STONITH_RA_EXEC_102943 (pid=102943) signed off
Nov 15 23:10:09 dmf2 lrmd: [83870]: info: Managed stonith-sgi-ipmi:1:stop process 102943 exited with 
return code 0.
Nov 15 23:10:09 dmf2 crmd: [83873]: info: process_lrm_event: LRM operation stonith-sgi-ipmi:1_stop_0 
(call=26, rc=0, cib-update=122, confirmed=true) ok
Nov 15 23:10:09 dmf2 crmd: [83873]: info: match_graph_event: Action stonith-sgi-ipmi:1_stop_0 (3) 
confirmed on dmf2 (rc=0)
-- 

Ron Kerry         rkerry at sgi.com




More information about the Pacemaker mailing list