[Pacemaker] stonith configured but not happening

Tue Oct 18 08:59:09 EDT 2011

I have a pacemaker 1.0.10 installation on rhel5 but I can't seem to
manage to get a working stonith configuration.  I have tested my stonith
device manually using the stonith command and it works fine.  What
doesn't seem to be happening is pacemaker/stonithd actually asking for a
stonith.  In my log I get:

Oct 18 08:54:23 mds1 stonithd: [4645]: ERROR: Failed to STONITH the node
oss1: optype=RESET, op_result=TIMEOUT
Oct 18 08:54:23 mds1 crmd: [4650]: info: tengine_stonith_callback:
call=-975, optype=1, node_name=oss1, result=2, node_list=,
action=17:1023:0:4e12e206-e0be-4915-bfb8-b4e052057f01
Oct 18 08:54:23 mds1 crmd: [4650]: ERROR: tengine_stonith_callback:
Stonith of oss1 failed (2)... aborting transition.
Oct 18 08:54:23 mds1 crmd: [4650]: info: abort_transition_graph:
tengine_stonith_callback:402 - Triggered transition abort (complete=0) :
Stonith failed
Oct 18 08:54:23 mds1 crmd: [4650]: info: update_abort_priority: Abort
priority upgraded from 0 to 1000000
Oct 18 08:54:23 mds1 crmd: [4650]: info: update_abort_priority: Abort
action done superceeded by restart
Oct 18 08:54:23 mds1 crmd: [4650]: info: run_graph:
====================================================
Oct 18 08:54:23 mds1 crmd: [4650]: notice: run_graph: Transition 1023
(Complete=2, Pending=0, Fired=0, Skipped=7, Incomplete=0,
Source=/var/lib/pengine/pe-warn-5799.bz2): Stopped
Oct 18 08:54:23 mds1 crmd: [4650]: info: te_graph_trigger: Transition
1023 is now complete
Oct 18 08:54:23 mds1 crmd: [4650]: info: do_state_transition: State
transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC
cause=C_FSA_INTERNAL origin=notify_crmd ]
Oct 18 08:54:23 mds1 crmd: [4650]: info: do_state_transition: All 1
cluster nodes are eligible to run resources.
Oct 18 08:54:23 mds1 crmd: [4650]: info: do_pe_invoke: Query 1307:
Requesting the current CIB: S_POLICY_ENGINE
Oct 18 08:54:23 mds1 crmd: [4650]: info: do_pe_invoke_callback: Invoking
the PE: query=1307, ref=pe_calc-dc-1318942463-1164, seq=16860, quorate=0
Oct 18 08:54:23 mds1 pengine: [4649]: notice: unpack_config: On loss of
CCM Quorum: Ignore
Oct 18 08:54:23 mds1 pengine: [4649]: info: unpack_config: Node scores:
'red' = -INFINITY, 'yellow' = 0, 'green' = 0
Oct 18 08:54:23 mds1 pengine: [4649]: WARN: pe_fence_node: Node oss1
will be fenced because it is un-expectedly down
Oct 18 08:54:23 mds1 pengine: [4649]: info:
determine_online_status_fencing: #011ha_state=active, ccm_state=false,
crm_state=online, join_state=pending, expected=member
Oct 18 08:54:23 mds1 pengine: [4649]: WARN: determine_online_status:
Node oss1 is unclean
Oct 18 08:54:23 mds1 pengine: [4649]: WARN: pe_fence_node: Node mds2
will be fenced because it is un-expectedly down
Oct 18 08:54:23 mds1 pengine: [4649]: info:
determine_online_status_fencing: #011ha_state=active, ccm_state=false,
crm_state=online, join_state=pending, expected=member
Oct 18 08:54:23 mds1 pengine: [4649]: WARN: determine_online_status:
Node mds2 is unclean
Oct 18 08:54:23 mds1 pengine: [4649]: info:
determine_online_status_fencing: Node oss2 is down
Oct 18 08:54:23 mds1 pengine: [4649]: info: determine_online_status:
Node mds1 is online
Oct 18 08:54:23 mds1 pengine: [4649]: notice: native_print:
MGS_2#011(ocf::hydra:Target):#011Started mds1
Oct 18 08:54:23 mds1 pengine: [4649]: notice: native_print:
testfs-MDT0000_3#011(ocf::hydra:Target):#011Started mds2
Oct 18 08:54:23 mds1 pengine: [4649]: notice: native_print:
testfs-OST0000_4#011(ocf::hydra:Target):#011Started oss1
Oct 18 08:54:23 mds1 pengine: [4649]: notice: clone_print:  Clone Set:
fencing
Oct 18 08:54:23 mds1 pengine: [4649]: notice: short_print:      Stopped:
[ st-pm:0 st-pm:1 st-pm:2 st-pm:3 ]
Oct 18 08:54:23 mds1 pengine: [4649]: info: get_failcount:
testfs-MDT0000_3 has failed 10 times on mds1
Oct 18 08:54:23 mds1 pengine: [4649]: notice: common_apply_stickiness:
testfs-MDT0000_3 can fail 999990 more times on mds1 before being forced off
Oct 18 08:54:23 mds1 pengine: [4649]: info: native_color: Resource
testfs-OST0000_4 cannot run anywhere
Oct 18 08:54:23 mds1 pengine: [4649]: info: native_color: Resource
st-pm:0 cannot run anywhere
Oct 18 08:54:23 mds1 pengine: [4649]: info: native_color: Resource
st-pm:1 cannot run anywhere
Oct 18 08:54:23 mds1 pengine: [4649]: info: native_color: Resource
st-pm:2 cannot run anywhere
Oct 18 08:54:23 mds1 pengine: [4649]: info: native_color: Resource
st-pm:3 cannot run anywhere
Oct 18 08:54:23 mds1 pengine: [4649]: WARN: custom_action: Action
testfs-MDT0000_3_stop_0 on mds2 is unrunnable (offline)
Oct 18 08:54:23 mds1 pengine: [4649]: WARN: custom_action: Marking node
mds2 unclean
Oct 18 08:54:23 mds1 pengine: [4649]: notice: RecurringOp:  Start
recurring monitor (120s) for testfs-MDT0000_3 on mds1
Oct 18 08:54:23 mds1 pengine: [4649]: WARN: custom_action: Action
testfs-OST0000_4_stop_0 on oss1 is unrunnable (offline)
Oct 18 08:54:23 mds1 pengine: [4649]: WARN: custom_action: Marking node
oss1 unclean
Oct 18 08:54:23 mds1 pengine: [4649]: WARN: stage6: Scheduling Node oss1
for STONITH
Oct 18 08:54:23 mds1 pengine: [4649]: info: native_stop_constraints:
testfs-OST0000_4_stop_0 is implicit after oss1 is fenced
Oct 18 08:54:23 mds1 pengine: [4649]: WARN: stage6: Scheduling Node mds2
for STONITH
Oct 18 08:54:23 mds1 pengine: [4649]: info: native_stop_constraints:
testfs-MDT0000_3_stop_0 is implicit after mds2 is fenced
Oct 18 08:54:23 mds1 pengine: [4649]: notice: LogActions: Leave resource
MGS_2#011(Started mds1)
Oct 18 08:54:23 mds1 pengine: [4649]: notice: LogActions: Move resource
testfs-MDT0000_3#011(Started mds2 -> mds1)
Oct 18 08:54:23 mds1 pengine: [4649]: notice: LogActions: Stop resource
testfs-OST0000_4#011(oss1)
Oct 18 08:54:23 mds1 pengine: [4649]: notice: LogActions: Leave resource
st-pm:0#011(Stopped)
Oct 18 08:54:23 mds1 pengine: [4649]: notice: LogActions: Leave resource
st-pm:1#011(Stopped)
Oct 18 08:54:23 mds1 pengine: [4649]: notice: LogActions: Leave resource
st-pm:2#011(Stopped)
Oct 18 08:54:23 mds1 pengine: [4649]: notice: LogActions: Leave resource
st-pm:3#011(Stopped)
Oct 18 08:54:23 mds1 crmd: [4650]: info: do_state_transition: State
transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
cause=C_IPC_MESSAGE origin=handle_response ]
Oct 18 08:54:23 mds1 crmd: [4650]: info: unpack_graph: Unpacked
transition 1024: 9 actions in 9 synapses
Oct 18 08:54:23 mds1 crmd: [4650]: info: do_te_invoke: Processing graph
1024 (ref=pe_calc-dc-1318942463-1164) derived from
/var/lib/pengine/pe-warn-5800.bz2
Oct 18 08:54:23 mds1 crmd: [4650]: info: te_pseudo_action: Pseudo action
15 fired and confirmed
Oct 18 08:54:23 mds1 crmd: [4650]: info: te_fence_node: Executing reboot
fencing operation (17) on oss1 (timeout=60000)
Oct 18 08:54:23 mds1 stonithd: [4645]: info: client tengine [pid: 4650]
requests a STONITH operation RESET on node oss1
Oct 18 08:54:23 mds1 stonithd: [4645]: info: we can't manage oss1,
broadcast request to other nodes
Oct 18 08:54:23 mds1 stonithd: [4645]: info: Broadcasting the message
succeeded: require others to stonith node oss1.
Oct 18 08:54:23 mds1 pengine: [4649]: WARN: process_pe_message:
Transition 1024: WARNINGs found during PE processing. PEngine Input
stored in: /var/lib/pengine/pe-warn-5800.bz2
Oct 18 08:54:23 mds1 pengine: [4649]: info: process_pe_message:
Configuration WARNINGs found during PE processing.  Please run
"crm_verify -L" to identify issues.

My configuration is:

# crm configure show
node mds1
node mds2
node oss1
node oss2
primitive MGS_2 ocf:hydra:Target \
    meta target-role="Started" \
    operations $id="MGS_2-operations" \
    op monitor interval="120" timeout="60" \
    op start interval="0" timeout="300" \
    op stop interval="0" timeout="300" \
    params target="MGS"
primitive st-pm stonith:external/powerman \
    params serverhost="192.168.122.1:10101" poweroff="0"
primitive testfs-MDT0000_3 ocf:hydra:Target \
    meta target-role="Started" \
    operations $id="testfs-MDT0000_3-operations" \
    op monitor interval="120" timeout="60" \
    op start interval="0" timeout="300" \
    op stop interval="0" timeout="300" \
    params target="testfs-MDT0000"
primitive testfs-OST0000_4 ocf:hydra:Target \
    meta target-role="Started" \
    operations $id="testfs-OST0000_4-operations" \
    op monitor interval="120" timeout="60" \
    op start interval="0" timeout="300" \
    op stop interval="0" timeout="300" \
    params target="testfs-OST0000"
clone fencing st-pm
location MGS_2-primary MGS_2 20: mds1
location MGS_2-secondary MGS_2 10: mds2
location testfs-MDT0000_3-primary testfs-MDT0000_3 20: mds2
location testfs-MDT0000_3-secondary testfs-MDT0000_3 10: mds1
location testfs-OST0000_4-primary testfs-OST0000_4 20: oss1
location testfs-OST0000_4-secondary testfs-OST0000_4 10: oss2
property $id="cib-bootstrap-options" \
    no-quorum-policy="ignore" \
    expected-quorum-votes="4" \
    symmetric-cluster="false" \
    cluster-infrastructure="openais" \
    dc-version="1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3" \
    stonith-enabled="true"

Any ideas why stonith is failing?

Cheers,
b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 262 bytes
Desc: OpenPGP digital signature
URL: <http://lists.clusterlabs.org/pipermail/pacemaker/attachments/20111018/e6044bd3/attachment-0002.sig>