[ClusterLabs] stonith in dual HMC environment

Fri Mar 24 16:01:45 UTC 2017

On 03/22/2017 09:42 AM, Alexander Markov wrote:
> 
>> Please share your config along with the logs from the nodes that were
>> effected.
> 
> I'm starting to think it's not about how to define stonith resources. If
> the whole box is down with all the logical partitions defined, then HMC
> cannot define if LPAR (partition) is really dead or just inaccessible.
> This leads to UNCLEAN OFFLINE node status and pacemaker refusal to do
> anything until it's resolved. Am I right? Anyway, the simples pacemaker
> config from my partitions is below.

Yes, it looks like you are correct. The fence agent is returning an
error when pacemaker tries to use it to reboot crmapp02. From the stderr
in the logs, the message is "ssh: connect to host 10.1.2.9 port 22: No
route to host".

The first thing I'd try is making sure you can fence each node from the
command line by manually running the fence agent. I'm not sure how to do
that for the "stonith:" type agents.

Once that's working, make sure the cluster can do the same, by manually
running "stonith_admin -B $NODE" for each $NODE.

> 
> primitive sap_ASCS SAPInstance \
>     params InstanceName=CAP_ASCS01_crmapp \
>     op monitor timeout=60 interval=120 depth=0
> primitive sap_D00 SAPInstance \
>     params InstanceName=CAP_D00_crmapp \
>     op monitor timeout=60 interval=120 depth=0
> primitive sap_ip IPaddr2 \
>     params ip=10.1.12.2 nic=eth0 cidr_netmask=24

> primitive st_ch_hmc stonith:ibmhmc \
>     params ipaddr=10.1.2.9 \
>     op start interval=0 timeout=300
> primitive st_hq_hmc stonith:ibmhmc \
>     params ipaddr=10.1.2.8 \
>     op start interval=0 timeout=300

I see you have two stonith devices defined, but they don't specify which
nodes they can fence -- pacemaker will assume that either device can be
used to fence either node.

> group g_sap sap_ip sap_ASCS sap_D00 \
>     meta target-role=Started

> location l_ch_hq_hmc st_ch_hmc -inf: crmapp01
> location l_st_hq_hmc st_hq_hmc -inf: crmapp02

These constraints restrict which node monitors which device, not which
node the device can fence.

Assuming st_ch_hmc is intended to fence crmapp01, this will make sure
that crmapp02 monitors that device -- but you also want something like
pcmk_host_list=crmapp01 in the device configuration.

> location prefer_node_1 g_sap 100: crmapp01
> property cib-bootstrap-options: \
>     stonith-enabled=true \
>     no-quorum-policy=ignore \
>     placement-strategy=balanced \
>     expected-quorum-votes=2 \
>     dc-version=1.1.12-f47ea56 \
>     cluster-infrastructure="classic openais (with plugin)" \
>     last-lrm-refresh=1490009096 \
>     maintenance-mode=false
> rsc_defaults rsc-options: \
>     resource-stickiness=200 \
>     migration-threshold=3
> op_defaults op-options: \
>     timeout=600 \
>     record-pending=true
> 
> Logs are pretty much going in circle: stonith cannot reset logical
> partition via HMC, node stays unclean offline, resources are shown to
> stay on node that is down.
> 
> 
> stonith-ng:    error: log_operation:    Operation 'reboot' [6942] (call
> 6 from crmd.4568) for host 'crmapp02' with device 'st_ch_hmc:0'
> Trying: st_ch_hmc:0
> stonith-ng:  warning: log_operation:    st_ch_hmc:0:6942 [ Performing:
> stonith -t ibmhmc -T reset crmapp02 ]
> stonith-ng:  warning: log_operation:    st_ch_hmc:0:6942 [ failed:
> crmapp02 3 ]
> stonith-ng:     info: internal_stonith_action_execute:  Attempt 2 to
> execute fence_legacy (reboot). remaining timeout is 59
> stonith-ng:     info: update_remaining_timeout:         Attempted to
> execute agent fence_legacy (reboot) the maximum number of times (2)
> 
> stonith-ng:    error: log_operation:    Operation 'reboot' [6955] (call
> 6 from crmd.4568) for host 'crmapp02' with device 'st_hq_hmc' re
> Trying: st_hq_hmc
> stonith-ng:  warning: log_operation:    st_hq_hmc:6955 [ Performing:
> stonith -t ibmhmc -T reset crmapp02 ]
> stonith-ng:  warning: log_operation:    st_hq_hmc:6955 [ failed:
> crmapp02 8 ]
> stonith-ng:     info: internal_stonith_action_execute:  Attempt 2 to
> execute fence_legacy (reboot). remaining timeout is 60
> stonith-ng:     info: update_remaining_timeout:         Attempted to
> execute agent fence_legacy (reboot) the maximum number of times (2)
> 
> stonith-ng:    error: log_operation:    Operation 'reboot' [6976] (call
> 6 from crmd.4568) for host 'crmapp02' with device 'st_hq_hmc:0'
> 
> stonith-ng:  warning: log_operation:    st_hq_hmc:0:6976 [ Performing:
> stonith -t ibmhmc -T reset crmapp02 ]
> stonith-ng:  warning: log_operation:    st_hq_hmc:0:6976 [ failed:
> crmapp02 8 ]
> stonith-ng:   notice: stonith_choose_peer:      Couldn't find anyone to
> fence crmapp02 with <any>
> stonith-ng:     info: call_remote_stonith:      None of the 1 peers are
> capable of terminating crmapp02 for crmd.4568 (1)
> stonith-ng:    error: remote_op_done:   Operation reboot of crmapp02 by
> <no-one> for crmd.4568 at crmapp01.6bf66b9c: No route to host
> crmd:   notice: tengine_stonith_callback:         Stonith operation
> 6/31:3700:0:b1fed277-9156-48da-8afd-35db672cd1c8: No route to
> 
> crmd:   notice: tengine_stonith_callback:         Stonith operation 6
> for crmapp02 failed (No route to host): aborting transition.
> crmd:   notice: abort_transition_graph:   Transition aborted: Stonith
> failed (source=tengine_stonith_callback:699, 0)
> crmd:   notice: tengine_stonith_notify:   Peer crmapp02 was not
> terminated (reboot) by <anyone> for crmapp01: No route to host (re
> 
> crmd:   notice: run_graph:        Transition 3700 (Complete=1,
> Pending=0, Fired=0, Skipped=18, Incomplete=2, Source=/var/lib/pacem
> 
> crmd:     info: do_state_transition:      State transition
> S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_IN
> 
> pengine:     info: process_pe_message:       Input has not changed since
> last time, not saving to disk
> pengine:   notice: unpack_config:    On loss of CCM Quorum: Ignore
> pengine:     info: determine_online_status_fencing:  Node crmapp01 is
> active
> pengine:     info: determine_online_status:  Node crmapp01 is online
> pengine:  warning: pe_fence_node:    Node crmapp02 will be fenced
> because the node is no longer part of the cluster
> pengine:  warning: determine_online_status:  Node crmapp02 is unclean
> pengine:     info: clone_print:       Clone Set: cl_st_ch_hmc [st_ch_hmc]
> pengine:     info: native_print:          st_ch_hmc  (stonith:ibmhmc):  
> Started crmapp02 (UNCLEAN)
> pengine:     info: short_print:           Started: [ crmapp01 ]
> pengine:     info: clone_print:       Clone Set: cl_st_hq_hmc [st_hq_hmc]
> pengine:     info: native_print:          st_hq_hmc  (stonith:ibmhmc):  
> Started crmapp02 (UNCLEAN)
> pengine:     info: short_print:           Started: [ crmapp01 ]
> pengine:     info: group_print:       Resource Group: g_sap
> pengine:     info: native_print:          sap_ip    
> (ocf::heartbeat:IPaddr2):       Started crmapp02 (UNCLEAN)
> pengine:     info: native_print:          sap_ASCS  
> (ocf::heartbeat:SAPInstance):   Started crmapp02 (UNCLEAN)
> pengine:     info: native_print:          sap_D00   
> (ocf::heartbeat:SAPInstance):   Started crmapp02 (UNCLEAN)
> pengine:     info: native_color:     Resource st_ch_hmc:1 cannot run
> anywhere
> pengine:     info: native_color:     Resource st_hq_hmc:1 cannot run
> anywhere
> pengine:  warning: custom_action:    Action st_ch_hmc:1_stop_0 on
> crmapp02 is unrunnable (offline)
> pengine:  warning: custom_action:    Action st_ch_hmc:1_stop_0 on
> crmapp02 is unrunnable (offline)
> pengine:  warning: custom_action:    Action st_hq_hmc:1_stop_0 on
> crmapp02 is unrunnable (offline)
> pengine:  warning: custom_action:    Action st_hq_hmc:1_stop_0 on
> crmapp02 is unrunnable (offline)
> pengine:  warning: custom_action:    Action sap_ip_stop_0 on crmapp02 is
> unrunnable (offline)
> pengine:  warning: custom_action:    Action sap_ASCS_stop_0 on crmapp02
> is unrunnable (offline)
> pengine:     info: RecurringOp:       Start recurring monitor (120s) for
> sap_ASCS on crmapp01
> pengine:  warning: custom_action:    Action sap_D00_stop_0 on crmapp02
> is unrunnable (offline)
> pengine:     info: RecurringOp:       Start recurring monitor (120s) for
> sap_D00 on crmapp01
> pengine:  warning: stage6:   Scheduling Node crmapp02 for STONITH
> pengine:     info: native_stop_constraints:  st_ch_hmc:1_stop_0 is
> implicit after crmapp02 is fenced
> pengine:     info: native_stop_constraints:  st_hq_hmc:1_stop_0 is
> implicit after crmapp02 is fenced
> pengine:     info: native_stop_constraints:  sap_ip_stop_0 is implicit
> after crmapp02 is fenced
> pengine:     info: native_stop_constraints:  sap_ASCS_stop_0 is implicit
> after crmapp02 is fenced
> pengine:     info: native_stop_constraints:  sap_D00_stop_0 is implicit
> after crmapp02 is fenced
> pengine:     info: LogActions:       Leave   st_ch_hmc:0     (Started
> crmapp01)
> pengine:   notice: LogActions:       Stop    st_ch_hmc:1     (crmapp02)
> pengine:     info: LogActions:       Leave   st_hq_hmc:0     (Started
> crmapp01)
> pengine:   notice: LogActions:       Stop    st_hq_hmc:1     (crmapp02)
> pengine:   notice: LogActions:       Move    sap_ip  (Started crmapp02
> -> crmapp01)
> pengine:   notice: LogActions:       Move    sap_ASCS        (Started
> crmapp02 -> crmapp01)
> pengine:   notice: LogActions:       Move    sap_D00 (Started crmapp02
> -> crmapp01)
> pengine:  warning: process_pe_message:       Calculated Transition 3701:
> /var/lib/pacemaker/pengine/pe-warn-5.bz2
> crmd:     info: do_state_transition:      State transition
> S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC
> 
> crmd:   notice: do_te_invoke:     Processing graph 3701
> (ref=pe_calc-dc-1489966722-3790) derived from /var/lib/pacemaker/pengine/p
> e-warn-5.bz2
> crmd:   notice: te_fence_node:    Executing reboot fencing operation
> (31) on crmapp02 (timeout=60000)
> stonith-ng:   notice: handle_request:   Client crmd.4568.9cd8bc8b wants
> to fence (reboot) 'crmapp02' with device '(any)'
> stonith-ng:   notice: initiate_remote_stonith_op:       Initiating
> remote operation reboot for crmapp02: ed7f7eae-4836-451d-b146-d6243b5
> 
> stonith-ng:   notice: get_capable_devices:      stonith-timeout duration
> 60 is low for the current configuration. Consider raising it to 80 seconds
> stonith-ng:   notice: can_fence_host_with_device:       st_hq_hmc can
> fence (reboot) crmapp02: dynamic-list
> stonith-ng:   notice: can_fence_host_with_device:       st_hq_hmc:0 can
> fence (reboot) crmapp02: dynamic-list
> stonith-ng:  warning: log_action:       fence_legacy[6987] stderr: [
> ssh: connect to host 10.1.2.9 port 22: No route to host^M ]
> stonith-ng:  warning: log_action:       fence_legacy[6987] stderr: [
> Invalid config info for ibmhmc device ]
> stonith-ng:     info: internal_stonith_action_execute:  Attempt 2 to
> execute fence_legacy (status). remaining timeout is 11
> stonith-ng:  warning: log_action:       fence_legacy[6986] stderr: [
> ssh: connect to host 10.1.2.9 port 22: No route to host^M ]
> stonith-ng:  warning: log_action:       fence_legacy[6986] stderr: [
> Invalid config info for ibmhmc device ]
> stonith-ng:     info: internal_stonith_action_execute:  Attempt 2 to
> execute fence_legacy (list). remaining timeout is 11
> stonith-ng:  warning: log_action:       fence_legacy[6994] stderr: [
> ssh: connect to host 10.1.2.9 port 22: No route to host^M ]
> stonith-ng:  warning: log_action:       fence_legacy[6994] stderr: [
> Invalid config info for ibmhmc device ]
> stonith-ng:     info: update_remaining_timeout:         Attempted to
> execute agent fence_legacy (list) the maximum number of times (2) a
> llowed
> stonith-ng:  warning: log_action:       fence_legacy[6993] stderr: [
> ssh: connect to host 10.1.2.9 port 22: No route to host^M ]
> stonith-ng:  warning: log_action:       fence_legacy[6993] stderr: [
> Invalid config info for ibmhmc device ]
> stonith-ng:     info: update_remaining_timeout:         Attempted to
> execute agent fence_legacy (status) the maximum number of times (2)
> 
> stonith-ng:   notice: status_search_cb:         Unkown result when
> testing if st_ch_hmc can fence crmapp02: rc=-201
> stonith-ng:     info: process_remote_stonith_query:     Query result 1
> of 1 from crmapp01 for crmapp02/reboot (3 devices) ed7f7eae-4836-
> 451d-b146-d6243b5c8bf3
> stonith-ng:     info: call_remote_stonith:      Total remote op timeout
> set to 180 for fencing of node crmapp02 for crmd.4568.ed7f7eae
> stonith-ng:     info: call_remote_stonith:      Requesting that crmapp01
> perform op reboot crmapp02 for crmd.4568 (216s, 0s)
> stonith-ng:   notice: get_capable_devices:      stonith-timeout duration
> 60 is low for the current configuration. Consider raising it to
> 
> stonith-ng:   notice: can_fence_host_with_device:       st_hq_hmc can
> fence (reboot) crmapp02: dynamic-list
> stonith-ng:   notice: can_fence_host_with_device:       st_hq_hmc:0 can
> fence (reboot) crmapp02: dynamic-list
> stonith-ng:  warning: log_action:       fence_legacy[6999] stderr: [
> ssh: connect to host 10.1.2.9 port 22: No route to host^M ]
> stonith-ng:  warning: log_action:       fence_legacy[6999] stderr: [
> Invalid config info for ibmhmc device ]
> stonith-ng:     info: internal_stonith_action_execute:  Attempt 2 to
> execute fence_legacy (list). remaining timeout is 11
> stonith-ng:  warning: log_action:       fence_legacy[7000] stderr: [
> ssh: connect to host 10.1.2.9 port 22: No route to host^M ]
> stonith-ng:  warning: log_action:       fence_legacy[7000] stderr: [
> Invalid config info for ibmhmc device ]
> stonith-ng:     info: internal_stonith_action_execute:  Attempt 2 to
> execute fence_legacy (status). remaining timeout is 11
> stonith-ng:  warning: log_action:       fence_legacy[7007] stderr: [
> ssh: connect to host 10.1.2.9 port 22: No route to host^M ]
> stonith-ng:  warning: log_action:       fence_legacy[7007] stderr: [
> Invalid config info for ibmhmc device ]
> stonith-ng:     info: update_remaining_timeout:         Attempted to
> execute agent fence_legacy (list) the maximum number of times (2) a
> llowed
> stonith-ng:  warning: log_action:       fence_legacy[7008] stderr: [
> ssh: connect to host 10.1.2.9 port 22: No route to host^M ]
> stonith-ng:  warning: log_action:       fence_legacy[7008] stderr: [
> Invalid config info for ibmhmc device ]
> stonith-ng:     info: update_remaining_timeout:         Attempted to
> execute agent fence_legacy (status) the maximum number of times (2)
> 
> stonith-ng:   notice: status_search_cb:         Unkown result when
> testing if st_ch_hmc can fence crmapp02: rc=-201
> 
> 
> -- 
> Regards,
> Alexander
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org