[Pacemaker] stonith configured but not happening

Tue Oct 18 09:40:05 EDT 2011

Hello,

On 10/18/2011 02:59 PM, Brian J. Murrell wrote:
> I have a pacemaker 1.0.10 installation on rhel5 but I can't seem to
> manage to get a working stonith configuration.  I have tested my stonith
> device manually using the stonith command and it works fine.  What
> doesn't seem to be happening is pacemaker/stonithd actually asking for a
> stonith.  In my log I get:
> 
> Oct 18 08:54:23 mds1 stonithd: [4645]: ERROR: Failed to STONITH the node
> oss1: optype=RESET, op_result=TIMEOUT
> Oct 18 08:54:23 mds1 crmd: [4650]: info: tengine_stonith_callback:
> call=-975, optype=1, node_name=oss1, result=2, node_list=,
> action=17:1023:0:4e12e206-e0be-4915-bfb8-b4e052057f01
> Oct 18 08:54:23 mds1 crmd: [4650]: ERROR: tengine_stonith_callback:
> Stonith of oss1 failed (2)... aborting transition.
> Oct 18 08:54:23 mds1 crmd: [4650]: info: abort_transition_graph:
> tengine_stonith_callback:402 - Triggered transition abort (complete=0) :
> Stonith failed
> Oct 18 08:54:23 mds1 crmd: [4650]: info: update_abort_priority: Abort
> priority upgraded from 0 to 1000000
> Oct 18 08:54:23 mds1 crmd: [4650]: info: update_abort_priority: Abort
> action done superceeded by restart
> Oct 18 08:54:23 mds1 crmd: [4650]: info: run_graph:
> ====================================================
> Oct 18 08:54:23 mds1 crmd: [4650]: notice: run_graph: Transition 1023
> (Complete=2, Pending=0, Fired=0, Skipped=7, Incomplete=0,
> Source=/var/lib/pengine/pe-warn-5799.bz2): Stopped
> Oct 18 08:54:23 mds1 crmd: [4650]: info: te_graph_trigger: Transition
> 1023 is now complete
> Oct 18 08:54:23 mds1 crmd: [4650]: info: do_state_transition: State
> transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> Oct 18 08:54:23 mds1 crmd: [4650]: info: do_state_transition: All 1
> cluster nodes are eligible to run resources.
> Oct 18 08:54:23 mds1 crmd: [4650]: info: do_pe_invoke: Query 1307:
> Requesting the current CIB: S_POLICY_ENGINE
> Oct 18 08:54:23 mds1 crmd: [4650]: info: do_pe_invoke_callback: Invoking
> the PE: query=1307, ref=pe_calc-dc-1318942463-1164, seq=16860, quorate=0
> Oct 18 08:54:23 mds1 pengine: [4649]: notice: unpack_config: On loss of
> CCM Quorum: Ignore
> Oct 18 08:54:23 mds1 pengine: [4649]: info: unpack_config: Node scores:
> 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
> Oct 18 08:54:23 mds1 pengine: [4649]: WARN: pe_fence_node: Node oss1
> will be fenced because it is un-expectedly down
> Oct 18 08:54:23 mds1 pengine: [4649]: info:
> determine_online_status_fencing: #011ha_state=active, ccm_state=false,
> crm_state=online, join_state=pending, expected=member
> Oct 18 08:54:23 mds1 pengine: [4649]: WARN: determine_online_status:
> Node oss1 is unclean
> Oct 18 08:54:23 mds1 pengine: [4649]: WARN: pe_fence_node: Node mds2
> will be fenced because it is un-expectedly down
> Oct 18 08:54:23 mds1 pengine: [4649]: info:
> determine_online_status_fencing: #011ha_state=active, ccm_state=false,
> crm_state=online, join_state=pending, expected=member
> Oct 18 08:54:23 mds1 pengine: [4649]: WARN: determine_online_status:
> Node mds2 is unclean
> Oct 18 08:54:23 mds1 pengine: [4649]: info:
> determine_online_status_fencing: Node oss2 is down
> Oct 18 08:54:23 mds1 pengine: [4649]: info: determine_online_status:
> Node mds1 is online
> Oct 18 08:54:23 mds1 pengine: [4649]: notice: native_print:
> MGS_2#011(ocf::hydra:Target):#011Started mds1
> Oct 18 08:54:23 mds1 pengine: [4649]: notice: native_print:
> testfs-MDT0000_3#011(ocf::hydra:Target):#011Started mds2
> Oct 18 08:54:23 mds1 pengine: [4649]: notice: native_print:
> testfs-OST0000_4#011(ocf::hydra:Target):#011Started oss1
> Oct 18 08:54:23 mds1 pengine: [4649]: notice: clone_print:  Clone Set:
> fencing
> Oct 18 08:54:23 mds1 pengine: [4649]: notice: short_print:      Stopped:
> [ st-pm:0 st-pm:1 st-pm:2 st-pm:3 ]
> Oct 18 08:54:23 mds1 pengine: [4649]: info: get_failcount:
> testfs-MDT0000_3 has failed 10 times on mds1
> Oct 18 08:54:23 mds1 pengine: [4649]: notice: common_apply_stickiness:
> testfs-MDT0000_3 can fail 999990 more times on mds1 before being forced off
> Oct 18 08:54:23 mds1 pengine: [4649]: info: native_color: Resource
> testfs-OST0000_4 cannot run anywhere
> Oct 18 08:54:23 mds1 pengine: [4649]: info: native_color: Resource
> st-pm:0 cannot run anywhere
> Oct 18 08:54:23 mds1 pengine: [4649]: info: native_color: Resource
> st-pm:1 cannot run anywhere
> Oct 18 08:54:23 mds1 pengine: [4649]: info: native_color: Resource
> st-pm:2 cannot run anywhere
> Oct 18 08:54:23 mds1 pengine: [4649]: info: native_color: Resource
> st-pm:3 cannot run anywhere
> Oct 18 08:54:23 mds1 pengine: [4649]: WARN: custom_action: Action
> testfs-MDT0000_3_stop_0 on mds2 is unrunnable (offline)
> Oct 18 08:54:23 mds1 pengine: [4649]: WARN: custom_action: Marking node
> mds2 unclean
> Oct 18 08:54:23 mds1 pengine: [4649]: notice: RecurringOp:  Start
> recurring monitor (120s) for testfs-MDT0000_3 on mds1
> Oct 18 08:54:23 mds1 pengine: [4649]: WARN: custom_action: Action
> testfs-OST0000_4_stop_0 on oss1 is unrunnable (offline)
> Oct 18 08:54:23 mds1 pengine: [4649]: WARN: custom_action: Marking node
> oss1 unclean
> Oct 18 08:54:23 mds1 pengine: [4649]: WARN: stage6: Scheduling Node oss1
> for STONITH
> Oct 18 08:54:23 mds1 pengine: [4649]: info: native_stop_constraints:
> testfs-OST0000_4_stop_0 is implicit after oss1 is fenced
> Oct 18 08:54:23 mds1 pengine: [4649]: WARN: stage6: Scheduling Node mds2
> for STONITH
> Oct 18 08:54:23 mds1 pengine: [4649]: info: native_stop_constraints:
> testfs-MDT0000_3_stop_0 is implicit after mds2 is fenced
> Oct 18 08:54:23 mds1 pengine: [4649]: notice: LogActions: Leave resource
> MGS_2#011(Started mds1)
> Oct 18 08:54:23 mds1 pengine: [4649]: notice: LogActions: Move resource
> testfs-MDT0000_3#011(Started mds2 -> mds1)
> Oct 18 08:54:23 mds1 pengine: [4649]: notice: LogActions: Stop resource
> testfs-OST0000_4#011(oss1)
> Oct 18 08:54:23 mds1 pengine: [4649]: notice: LogActions: Leave resource
> st-pm:0#011(Stopped)
> Oct 18 08:54:23 mds1 pengine: [4649]: notice: LogActions: Leave resource
> st-pm:1#011(Stopped)
> Oct 18 08:54:23 mds1 pengine: [4649]: notice: LogActions: Leave resource
> st-pm:2#011(Stopped)
> Oct 18 08:54:23 mds1 pengine: [4649]: notice: LogActions: Leave resource
> st-pm:3#011(Stopped)

none of your fencing clones is running.

> Oct 18 08:54:23 mds1 crmd: [4650]: info: do_state_transition: State
> transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
> cause=C_IPC_MESSAGE origin=handle_response ]
> Oct 18 08:54:23 mds1 crmd: [4650]: info: unpack_graph: Unpacked
> transition 1024: 9 actions in 9 synapses
> Oct 18 08:54:23 mds1 crmd: [4650]: info: do_te_invoke: Processing graph
> 1024 (ref=pe_calc-dc-1318942463-1164) derived from
> /var/lib/pengine/pe-warn-5800.bz2
> Oct 18 08:54:23 mds1 crmd: [4650]: info: te_pseudo_action: Pseudo action
> 15 fired and confirmed
> Oct 18 08:54:23 mds1 crmd: [4650]: info: te_fence_node: Executing reboot
> fencing operation (17) on oss1 (timeout=60000)
> Oct 18 08:54:23 mds1 stonithd: [4645]: info: client tengine [pid: 4650]
> requests a STONITH operation RESET on node oss1
> Oct 18 08:54:23 mds1 stonithd: [4645]: info: we can't manage oss1,
> broadcast request to other nodes
> Oct 18 08:54:23 mds1 stonithd: [4645]: info: Broadcasting the message
> succeeded: require others to stonith node oss1.
> Oct 18 08:54:23 mds1 pengine: [4649]: WARN: process_pe_message:
> Transition 1024: WARNINGs found during PE processing. PEngine Input
> stored in: /var/lib/pengine/pe-warn-5800.bz2
> Oct 18 08:54:23 mds1 pengine: [4649]: info: process_pe_message:
> Configuration WARNINGs found during PE processing.  Please run
> "crm_verify -L" to identify issues.
> 
> My configuration is:
> 
> # crm configure show
> node mds1
> node mds2
> node oss1
> node oss2
> primitive MGS_2 ocf:hydra:Target \
>     meta target-role="Started" \
>     operations $id="MGS_2-operations" \
>     op monitor interval="120" timeout="60" \
>     op start interval="0" timeout="300" \
>     op stop interval="0" timeout="300" \
>     params target="MGS"
> primitive st-pm stonith:external/powerman \
>     params serverhost="192.168.122.1:10101" poweroff="0"
> primitive testfs-MDT0000_3 ocf:hydra:Target \
>     meta target-role="Started" \
>     operations $id="testfs-MDT0000_3-operations" \
>     op monitor interval="120" timeout="60" \
>     op start interval="0" timeout="300" \
>     op stop interval="0" timeout="300" \
>     params target="testfs-MDT0000"
> primitive testfs-OST0000_4 ocf:hydra:Target \
>     meta target-role="Started" \
>     operations $id="testfs-OST0000_4-operations" \
>     op monitor interval="120" timeout="60" \
>     op start interval="0" timeout="300" \
>     op stop interval="0" timeout="300" \
>     params target="testfs-OST0000"
> clone fencing st-pm
> location MGS_2-primary MGS_2 20: mds1
> location MGS_2-secondary MGS_2 10: mds2
> location testfs-MDT0000_3-primary testfs-MDT0000_3 20: mds2
> location testfs-MDT0000_3-secondary testfs-MDT0000_3 10: mds1
> location testfs-OST0000_4-primary testfs-OST0000_4 20: oss1
> location testfs-OST0000_4-secondary testfs-OST0000_4 10: oss2
> property $id="cib-bootstrap-options" \
>     no-quorum-policy="ignore" \
>     expected-quorum-votes="4" \
>     symmetric-cluster="false" \

I'd expect this to be the problem ... if you insist on using an
unsymmetric cluster you must add a location score for each resource you
want to be up on a node ... so add a location constraint for the fencing
clone for each node ... or use a symmetric cluster.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

>     cluster-infrastructure="openais" \
>     dc-version="1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3" \
>     stonith-enabled="true"
> 
> Any ideas why stonith is failing?
> 
> Cheers,
> b.
> 
> 
> 
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 286 bytes
Desc: OpenPGP digital signature
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20111018/549b3c30/attachment-0003.sig>