[ClusterLabs] stonith in dual HMC environment

Wed Mar 22 10:42:53 EDT 2017

> Please share your config along with the logs from the nodes that were
> effected.

I'm starting to think it's not about how to define stonith resources. If 
the whole box is down with all the logical partitions defined, then HMC 
cannot define if LPAR (partition) is really dead or just inaccessible. 
This leads to UNCLEAN OFFLINE node status and pacemaker refusal to do 
anything until it's resolved. Am I right? Anyway, the simples pacemaker 
config from my partitions is below.

primitive sap_ASCS SAPInstance \
	params InstanceName=CAP_ASCS01_crmapp \
	op monitor timeout=60 interval=120 depth=0
primitive sap_D00 SAPInstance \
	params InstanceName=CAP_D00_crmapp \
	op monitor timeout=60 interval=120 depth=0
primitive sap_ip IPaddr2 \
	params ip=10.1.12.2 nic=eth0 cidr_netmask=24
primitive st_ch_hmc stonith:ibmhmc \
	params ipaddr=10.1.2.9 \
	op start interval=0 timeout=300
primitive st_hq_hmc stonith:ibmhmc \
	params ipaddr=10.1.2.8 \
	op start interval=0 timeout=300
group g_sap sap_ip sap_ASCS sap_D00 \
	meta target-role=Started
location l_ch_hq_hmc st_ch_hmc -inf: crmapp01
location l_st_hq_hmc st_hq_hmc -inf: crmapp02
location prefer_node_1 g_sap 100: crmapp01
property cib-bootstrap-options: \
	stonith-enabled=true \
	no-quorum-policy=ignore \
	placement-strategy=balanced \
	expected-quorum-votes=2 \
	dc-version=1.1.12-f47ea56 \
	cluster-infrastructure="classic openais (with plugin)" \
	last-lrm-refresh=1490009096 \
	maintenance-mode=false
rsc_defaults rsc-options: \
	resource-stickiness=200 \
	migration-threshold=3
op_defaults op-options: \
	timeout=600 \
	record-pending=true

Logs are pretty much going in circle: stonith cannot reset logical 
partition via HMC, node stays unclean offline, resources are shown to 
stay on node that is down.

stonith-ng:    error: log_operation:    Operation 'reboot' [6942] (call 
6 from crmd.4568) for host 'crmapp02' with device 'st_ch_hmc:0'
Trying: st_ch_hmc:0
stonith-ng:  warning: log_operation:    st_ch_hmc:0:6942 [ Performing: 
stonith -t ibmhmc -T reset crmapp02 ]
stonith-ng:  warning: log_operation:    st_ch_hmc:0:6942 [ failed: 
crmapp02 3 ]
stonith-ng:     info: internal_stonith_action_execute:  Attempt 2 to 
execute fence_legacy (reboot). remaining timeout is 59
stonith-ng:     info: update_remaining_timeout:         Attempted to 
execute agent fence_legacy (reboot) the maximum number of times (2)

stonith-ng:    error: log_operation:    Operation 'reboot' [6955] (call 
6 from crmd.4568) for host 'crmapp02' with device 'st_hq_hmc' re
Trying: st_hq_hmc
stonith-ng:  warning: log_operation:    st_hq_hmc:6955 [ Performing: 
stonith -t ibmhmc -T reset crmapp02 ]
stonith-ng:  warning: log_operation:    st_hq_hmc:6955 [ failed: 
crmapp02 8 ]
stonith-ng:     info: internal_stonith_action_execute:  Attempt 2 to 
execute fence_legacy (reboot). remaining timeout is 60
stonith-ng:     info: update_remaining_timeout:         Attempted to 
execute agent fence_legacy (reboot) the maximum number of times (2)

stonith-ng:    error: log_operation:    Operation 'reboot' [6976] (call 
6 from crmd.4568) for host 'crmapp02' with device 'st_hq_hmc:0'

stonith-ng:  warning: log_operation:    st_hq_hmc:0:6976 [ Performing: 
stonith -t ibmhmc -T reset crmapp02 ]
stonith-ng:  warning: log_operation:    st_hq_hmc:0:6976 [ failed: 
crmapp02 8 ]
stonith-ng:   notice: stonith_choose_peer:      Couldn't find anyone to 
fence crmapp02 with <any>
stonith-ng:     info: call_remote_stonith:      None of the 1 peers are 
capable of terminating crmapp02 for crmd.4568 (1)
stonith-ng:    error: remote_op_done:   Operation reboot of crmapp02 by 
<no-one> for crmd.4568 at crmapp01.6bf66b9c: No route to host
crmd:   notice: tengine_stonith_callback:         Stonith operation 
6/31:3700:0:b1fed277-9156-48da-8afd-35db672cd1c8: No route to

crmd:   notice: tengine_stonith_callback:         Stonith operation 6 
for crmapp02 failed (No route to host): aborting transition.
crmd:   notice: abort_transition_graph:   Transition aborted: Stonith 
failed (source=tengine_stonith_callback:699, 0)
crmd:   notice: tengine_stonith_notify:   Peer crmapp02 was not 
terminated (reboot) by <anyone> for crmapp01: No route to host (re

crmd:   notice: run_graph:        Transition 3700 (Complete=1, 
Pending=0, Fired=0, Skipped=18, Incomplete=2, Source=/var/lib/pacem

crmd:     info: do_state_transition:      State transition 
S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_IN

pengine:     info: process_pe_message:       Input has not changed since 
last time, not saving to disk
pengine:   notice: unpack_config:    On loss of CCM Quorum: Ignore
pengine:     info: determine_online_status_fencing:  Node crmapp01 is 
active
pengine:     info: determine_online_status:  Node crmapp01 is online
pengine:  warning: pe_fence_node:    Node crmapp02 will be fenced 
because the node is no longer part of the cluster
pengine:  warning: determine_online_status:  Node crmapp02 is unclean
pengine:     info: clone_print:       Clone Set: cl_st_ch_hmc 
[st_ch_hmc]
pengine:     info: native_print:          st_ch_hmc  (stonith:ibmhmc):   
Started crmapp02 (UNCLEAN)
pengine:     info: short_print:           Started: [ crmapp01 ]
pengine:     info: clone_print:       Clone Set: cl_st_hq_hmc 
[st_hq_hmc]
pengine:     info: native_print:          st_hq_hmc  (stonith:ibmhmc):   
Started crmapp02 (UNCLEAN)
pengine:     info: short_print:           Started: [ crmapp01 ]
pengine:     info: group_print:       Resource Group: g_sap
pengine:     info: native_print:          sap_ip     
(ocf::heartbeat:IPaddr2):       Started crmapp02 (UNCLEAN)
pengine:     info: native_print:          sap_ASCS   
(ocf::heartbeat:SAPInstance):   Started crmapp02 (UNCLEAN)
pengine:     info: native_print:          sap_D00    
(ocf::heartbeat:SAPInstance):   Started crmapp02 (UNCLEAN)
pengine:     info: native_color:     Resource st_ch_hmc:1 cannot run 
anywhere
pengine:     info: native_color:     Resource st_hq_hmc:1 cannot run 
anywhere
pengine:  warning: custom_action:    Action st_ch_hmc:1_stop_0 on 
crmapp02 is unrunnable (offline)
pengine:  warning: custom_action:    Action st_ch_hmc:1_stop_0 on 
crmapp02 is unrunnable (offline)
pengine:  warning: custom_action:    Action st_hq_hmc:1_stop_0 on 
crmapp02 is unrunnable (offline)
pengine:  warning: custom_action:    Action st_hq_hmc:1_stop_0 on 
crmapp02 is unrunnable (offline)
pengine:  warning: custom_action:    Action sap_ip_stop_0 on crmapp02 is 
unrunnable (offline)
pengine:  warning: custom_action:    Action sap_ASCS_stop_0 on crmapp02 
is unrunnable (offline)
pengine:     info: RecurringOp:       Start recurring monitor (120s) for 
sap_ASCS on crmapp01
pengine:  warning: custom_action:    Action sap_D00_stop_0 on crmapp02 
is unrunnable (offline)
pengine:     info: RecurringOp:       Start recurring monitor (120s) for 
sap_D00 on crmapp01
pengine:  warning: stage6:   Scheduling Node crmapp02 for STONITH
pengine:     info: native_stop_constraints:  st_ch_hmc:1_stop_0 is 
implicit after crmapp02 is fenced
pengine:     info: native_stop_constraints:  st_hq_hmc:1_stop_0 is 
implicit after crmapp02 is fenced
pengine:     info: native_stop_constraints:  sap_ip_stop_0 is implicit 
after crmapp02 is fenced
pengine:     info: native_stop_constraints:  sap_ASCS_stop_0 is implicit 
after crmapp02 is fenced
pengine:     info: native_stop_constraints:  sap_D00_stop_0 is implicit 
after crmapp02 is fenced
pengine:     info: LogActions:       Leave   st_ch_hmc:0     (Started 
crmapp01)
pengine:   notice: LogActions:       Stop    st_ch_hmc:1     (crmapp02)
pengine:     info: LogActions:       Leave   st_hq_hmc:0     (Started 
crmapp01)
pengine:   notice: LogActions:       Stop    st_hq_hmc:1     (crmapp02)
pengine:   notice: LogActions:       Move    sap_ip  (Started crmapp02 
-> crmapp01)
pengine:   notice: LogActions:       Move    sap_ASCS        (Started 
crmapp02 -> crmapp01)
pengine:   notice: LogActions:       Move    sap_D00 (Started crmapp02 
-> crmapp01)
pengine:  warning: process_pe_message:       Calculated Transition 3701: 
/var/lib/pacemaker/pengine/pe-warn-5.bz2
crmd:     info: do_state_transition:      State transition 
S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC

crmd:   notice: do_te_invoke:     Processing graph 3701 
(ref=pe_calc-dc-1489966722-3790) derived from 
/var/lib/pacemaker/pengine/p
e-warn-5.bz2
crmd:   notice: te_fence_node:    Executing reboot fencing operation 
(31) on crmapp02 (timeout=60000)
stonith-ng:   notice: handle_request:   Client crmd.4568.9cd8bc8b wants 
to fence (reboot) 'crmapp02' with device '(any)'
stonith-ng:   notice: initiate_remote_stonith_op:       Initiating 
remote operation reboot for crmapp02: ed7f7eae-4836-451d-b146-d6243b5

stonith-ng:   notice: get_capable_devices:      stonith-timeout duration 
60 is low for the current configuration. Consider raising it to 80 
seconds
stonith-ng:   notice: can_fence_host_with_device:       st_hq_hmc can 
fence (reboot) crmapp02: dynamic-list
stonith-ng:   notice: can_fence_host_with_device:       st_hq_hmc:0 can 
fence (reboot) crmapp02: dynamic-list
stonith-ng:  warning: log_action:       fence_legacy[6987] stderr: [ 
ssh: connect to host 10.1.2.9 port 22: No route to host^M ]
stonith-ng:  warning: log_action:       fence_legacy[6987] stderr: [ 
Invalid config info for ibmhmc device ]
stonith-ng:     info: internal_stonith_action_execute:  Attempt 2 to 
execute fence_legacy (status). remaining timeout is 11
stonith-ng:  warning: log_action:       fence_legacy[6986] stderr: [ 
ssh: connect to host 10.1.2.9 port 22: No route to host^M ]
stonith-ng:  warning: log_action:       fence_legacy[6986] stderr: [ 
Invalid config info for ibmhmc device ]
stonith-ng:     info: internal_stonith_action_execute:  Attempt 2 to 
execute fence_legacy (list). remaining timeout is 11
stonith-ng:  warning: log_action:       fence_legacy[6994] stderr: [ 
ssh: connect to host 10.1.2.9 port 22: No route to host^M ]
stonith-ng:  warning: log_action:       fence_legacy[6994] stderr: [ 
Invalid config info for ibmhmc device ]
stonith-ng:     info: update_remaining_timeout:         Attempted to 
execute agent fence_legacy (list) the maximum number of times (2) a
llowed
stonith-ng:  warning: log_action:       fence_legacy[6993] stderr: [ 
ssh: connect to host 10.1.2.9 port 22: No route to host^M ]
stonith-ng:  warning: log_action:       fence_legacy[6993] stderr: [ 
Invalid config info for ibmhmc device ]
stonith-ng:     info: update_remaining_timeout:         Attempted to 
execute agent fence_legacy (status) the maximum number of times (2)

stonith-ng:   notice: status_search_cb:         Unkown result when 
testing if st_ch_hmc can fence crmapp02: rc=-201
stonith-ng:     info: process_remote_stonith_query:     Query result 1 
of 1 from crmapp01 for crmapp02/reboot (3 devices) ed7f7eae-4836-
451d-b146-d6243b5c8bf3
stonith-ng:     info: call_remote_stonith:      Total remote op timeout 
set to 180 for fencing of node crmapp02 for crmd.4568.ed7f7eae
stonith-ng:     info: call_remote_stonith:      Requesting that crmapp01 
perform op reboot crmapp02 for crmd.4568 (216s, 0s)
stonith-ng:   notice: get_capable_devices:      stonith-timeout duration 
60 is low for the current configuration. Consider raising it to

stonith-ng:   notice: can_fence_host_with_device:       st_hq_hmc can 
fence (reboot) crmapp02: dynamic-list
stonith-ng:   notice: can_fence_host_with_device:       st_hq_hmc:0 can 
fence (reboot) crmapp02: dynamic-list
stonith-ng:  warning: log_action:       fence_legacy[6999] stderr: [ 
ssh: connect to host 10.1.2.9 port 22: No route to host^M ]
stonith-ng:  warning: log_action:       fence_legacy[6999] stderr: [ 
Invalid config info for ibmhmc device ]
stonith-ng:     info: internal_stonith_action_execute:  Attempt 2 to 
execute fence_legacy (list). remaining timeout is 11
stonith-ng:  warning: log_action:       fence_legacy[7000] stderr: [ 
ssh: connect to host 10.1.2.9 port 22: No route to host^M ]
stonith-ng:  warning: log_action:       fence_legacy[7000] stderr: [ 
Invalid config info for ibmhmc device ]
stonith-ng:     info: internal_stonith_action_execute:  Attempt 2 to 
execute fence_legacy (status). remaining timeout is 11
stonith-ng:  warning: log_action:       fence_legacy[7007] stderr: [ 
ssh: connect to host 10.1.2.9 port 22: No route to host^M ]
stonith-ng:  warning: log_action:       fence_legacy[7007] stderr: [ 
Invalid config info for ibmhmc device ]
stonith-ng:     info: update_remaining_timeout:         Attempted to 
execute agent fence_legacy (list) the maximum number of times (2) a
llowed
stonith-ng:  warning: log_action:       fence_legacy[7008] stderr: [ 
ssh: connect to host 10.1.2.9 port 22: No route to host^M ]
stonith-ng:  warning: log_action:       fence_legacy[7008] stderr: [ 
Invalid config info for ibmhmc device ]
stonith-ng:     info: update_remaining_timeout:         Attempted to 
execute agent fence_legacy (status) the maximum number of times (2)

stonith-ng:   notice: status_search_cb:         Unkown result when 
testing if st_ch_hmc can fence crmapp02: rc=-201

--
Regards,
Alexander