[Pacemaker] stonith sbd problem

Tue Aug 10 04:16:05 EDT 2010

hi,

following configuration:

node lnx0047a
node lnx0047b
primitive lnx0101a ocf:heartbeat:KVM \
        params name="lnx0101a" \
        meta allow-migrate="1" target-role="Started" \
        op migrate_from interval="0" timeout="3600s" \
        op migrate_to interval="0" timeout="3600s" \
        op monitor interval="10s" \
        op stop interval="0" timeout="360s"
primitive lnx0102a ocf:heartbeat:KVM \
        params name="lnx0102a" \
        meta allow-migrate="1" target-role="Started" \
        op migrate_from interval="0" timeout="3600s" \
        op migrate_to interval="0" timeout="3600s" \
        op monitor interval="10s" \
        op stop interval="0" timeout="360s"
primitive pingd ocf:pacemaker:pingd \
        params host_list="192.168.136.100" multiplier="100" \
        op monitor interval="15s" timeout="5s"
primitive sbd_fence stonith:external/sbd \
        params sbd_device="/dev/hdisk-4652-38b5" stonith-timeout="60s"
clone fence sbd_fence \
        meta target-role="Started"
clone pingdclone pingd \
        meta globally-unique="false" target-role="Started"
location lnx0101a_ip lnx0101a \
        rule $id="lnx0101a_ip-rule" -inf: not_defined pingd or pingd lte 0
location lnx0102a_ip lnx0102a \
        rule $id="lnx0102a_ip-rule" -inf: not_defined pingd or pingd lte 0
property $id="cib-bootstrap-options" \
        dc-version="1.1.2-2e096a41a5f9e184a1c1537c82c6da1093698eb5" \
        cluster-infrastructure="openais" \
        expected-quorum-votes="2" \
        stonith-enabled="true" \
        stonith-action="reboot" \
        no-quorum-policy="ignore" \
        default-resource-stickiness="1000" \
        last-lrm-refresh="1281364675"

-------------------------------
during clustertest i disabled the interface where pingd ist listening on 
node lnx0047a. i get "Node lnx0047a: UNCLEAN (offline)" on lnx0047b, the 
stonith command is being executed:

/var/log/messages:
...
Aug  9 16:25:05 lnx0047b pengine: [22211]: WARN: pe_fence_node: Node 
lnx0047a will be fenced because it is un-expectedly down
...
Aug  9 16:25:05 lnx0047b pengine: [22211]: WARN: custom_action: Action 
lnx0102a_stop_0 on lnx0047a is unrunnable (offline)
Aug  9 16:25:05 lnx0047b pengine: [22211]: WARN: custom_action: Marking 
node lnx0047a unclean
Aug  9 16:25:05 lnx0047b pengine: [22211]: notice: RecurringOp:  Start 
recurring monitor (10s) for lnx0102a on lnx0047b
Aug  9 16:25:05 lnx0047b pengine: [22211]: WARN: custom_action: Action 
pingd:0_stop_0 on lnx0047a is unrunnable (offline)
Aug  9 16:25:05 lnx0047b pengine: [22211]: WARN: custom_action: Marking 
node lnx0047a unclean
Aug  9 16:25:05 lnx0047b pengine: [22211]: WARN: custom_action: Action 
sbd_fence:0_stop_0 on lnx0047a is unrunnable (offline)
Aug  9 16:25:05 lnx0047b pengine: [22211]: WARN: custom_action: Marking 
node lnx0047a unclean
Aug  9 16:25:05 lnx0047b pengine: [22211]: WARN: stage6: Scheduling Node 
lnx0047a for STONITH
Aug  9 16:25:05 lnx0047b pengine: [22211]: info: native_stop_constraints: 
lnx0102a_stop_0 is implicit after lnx0047a is fenced
Aug  9 16:25:05 lnx0047b pengine: [22211]: info: native_stop_constraints: 
pingd:0_stop_0 is implicit after lnx0047a is fenced
....
Aug  9 16:25:26 lnx0047b stonith-ng: [22207]: info: 
initiate_remote_stonith_op: Initiating remote operation reboot for 
lnx0047a: ee3d0c69-067a-423b-88bc-6d661a2b3254
Aug  9 16:25:26 lnx0047b stonith-ng: [22207]: info: log_data_element: 
stonith_query: Query <stonith_command t="stonith-ng" 
st_async_id="ee3d0c69-067a-423b-88bc-6d661a2b3254" st_op="st_query" 
st_callid="0" st_callopt="0" 
st_remote_op="ee3d0c69-067a-423b-88bc-6d661a2b3254" st_target="lnx0047a" 
st_device_action="reboot" 
st_clientid="eba960fb-ef44-4ffb-a017-d5e01177b4ec" src="lnx0047b" seq="32" 
/>
Aug  9 16:25:26 lnx0047b stonith-ng: [22207]: info: 
can_fence_host_with_device: sbd_fence:1 can fence lnx0047a: dynamic-list
Aug  9 16:25:26 lnx0047b stonith-ng: [22207]: info: stonith_query: Found 1 
matching devices for 'lnx0047a'
Aug  9 16:25:26 lnx0047b stonith-ng: [22207]: info: stonith_command: 
Processed st_query from lnx0047b: rc=1
Aug  9 16:25:26 lnx0047b stonith-ng: [22207]: info: call_remote_stonith: 
Requesting that lnx0047b perform op reboot lnx0047a
Aug  9 16:25:26 lnx0047b stonith-ng: [22207]: info: log_data_element: 
stonith_fence: Exec <stonith_command t="stonith-ng" 
st_async_id="ee3d0c69-067a-423b-88bc-6d661a2b3254" st_op="st_fence" 
st_callid="0" st_callopt="0" 
st_remote_op="ee3d0c69-067a-423b-88bc-6d661a2b3254" st_target="lnx0047a" 
st_device_action="reboot" src="lnx0047b" seq="34" />
Aug  9 16:25:26 lnx0047b stonith-ng: [22207]: info: 
can_fence_host_with_device: sbd_fence:1 can fence lnx0047a: dynamic-list
Aug  9 16:25:26 lnx0047b stonith-ng: [22207]: info: stonith_fence: Found 1 
matching devices for 'lnx0047a'
Aug  9 16:25:26 lnx0047b pengine: [22211]: WARN: process_pe_message: 
Transition 6: WARNINGs found during PE processing. PEngine Input stored 
in: /var/lib/pengine/pe-warn-102.bz2
Aug  9 16:25:26 lnx0047b pengine: [22211]: info: process_pe_message: 
Configuration WARNINGs found during PE processing.  Please run "crm_verify 
-L" to identify issues.
Aug  9 16:25:26 lnx0047b sbd: [23278]: info: reset successfully delivered 
to lnx0047a
Aug  9 16:25:27 lnx0047b sbd: [23845]: info: lnx0047a owns slot 1
Aug  9 16:25:27 lnx0047b sbd: [23845]: info: Writing reset to node slot 
lnx0047a
....
-------
ps -eaf:
...
root     24002 24001  0 16:25 ?        00:00:00 stonith -t external/sbd 
sbd_device /dev/hdisk-4652-38b5 stonith-timeout 60s nodename lnx0047a -T 
reset lnx0047a
root     24007 24002  0 16:25 ?        00:00:00 /bin/bash 
/usr/lib64/stonith/plugins/external/sbd reset lnx0047a
root     24035 22192  0 16:25 ?        00:00:00 
/usr/lib64/heartbeat/stonithd
...

lnx0047a reboots successful, but during the image startup of images 
lnx0047a several stonith commands being executed on the online 
clusternode:

$ ps -eaf|grep ston
root     22207 22192  0 16:15 ?        00:00:00 
/usr/lib64/heartbeat/stonithd
root     23272 23271  0 16:25 ?        00:00:00 stonith -t external/sbd 
sbd_device /dev/hdisk-4652-38b5 stonith-timeout 60s nodename lnx0047a -T 
reset lnx0047a
root     23277 23272  0 16:25 ?        00:00:00 /bin/bash 
/usr/lib64/stonith/plugins/external/sbd reset lnx0047a
root     23340 23339  0 16:26 ?        00:00:00 stonith -t external/sbd 
sbd_device /dev/hdisk-4652-38b5 stonith-timeout 60s nodename lnx0047a -T 
reset lnx0047a
root     23345 23340  0 16:26 ?        00:00:00 /bin/bash 
/usr/lib64/stonith/plugins/external/sbd reset lnx0047a
root     23438 23437  0 16:26 ?        00:00:00 stonith -t external/sbd 
sbd_device /dev/hdisk-4652-38b5 stonith-timeout 60s nodename lnx0047a -T 
reset lnx0047a
root     23443 23438  0 16:26 ?        00:00:00 /bin/bash 
/usr/lib64/stonith/plugins/external/sbd reset lnx0047a

after lnx0047a is up again it get stonithed automatically by lnx0047b, 
althought the cluster isn't up and running (autostart watchdog)

-----------------
so, i'm unable to start lnx0047a until i manually kill alle the stonith 
processes on lnx0047b. 

during reboot-cycle on lnx0047a the Resources aren't able to start on 
lnx0047b:

$ crm_verify -LV
crm_verify[27816]: 2010/08/09_16:25:41 WARN: pe_fence_node: Node lnx0047a 
will be fenced because it is un-expectedly down
crm_verify[27816]: 2010/08/09_16:25:41 WARN: determine_online_status: Node 
lnx0047a is unclean
crm_verify[27816]: 2010/08/09_16:25:41 WARN: custom_action: Action 
lnx0101a_stop_0 on lnx0047a is unrunnable (offline)
crm_verify[27816]: 2010/08/09_16:25:41 WARN: custom_action: Marking node 
lnx0047a unclean
crm_verify[27816]: 2010/08/09_16:25:41 WARN: custom_action: Action 
lnx0102a_stop_0 on lnx0047a is unrunnable (offline)
crm_verify[27816]: 2010/08/09_16:25:41 WARN: custom_action: Marking node 
lnx0047a unclean
crm_verify[27816]: 2010/08/09_16:25:41 WARN: custom_action: Action 
pingd:0_stop_0 on lnx0047a is unrunnable (offline)
crm_verify[27816]: 2010/08/09_16:25:41 WARN: custom_action: Marking node 
lnx0047a unclean
crm_verify[27816]: 2010/08/09_16:25:41 WARN: custom_action: Action 
sbd_fence:1_stop_0 on lnx0047a is unrunnable (offline)
crm_verify[27816]: 2010/08/09_16:25:41 WARN: custom_action: Marking node 
lnx0047a unclean
crm_verify[27816]: 2010/08/09_16:25:41 WARN: stage6: Scheduling Node 
lnx0047a for STONITH

###############
any ideas on the stonith problem?
any ideas on the "unrunnable" problem?

regards
----------------
Disclaimer:
Diese Nachricht dient ausschließlich zu Informationszwecken und ist nur 
für den Gebrauch des angesprochenen Adressaten bestimmt.

This message is only for informational purposes and is intended solely for 
the use of the addressee.
----------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/pacemaker/attachments/20100810/267f792d/attachment.html>