[Pacemaker] Reboot node with stonith after killing a corosync-process?

Fri Apr 15 03:05:28 EDT 2011

I can reproduce this behavior:

- On node02, which had no resources online, I killed all corosync
processes with "killall -9 corosync".
- Node02 was rebootet through stonith
- On node01, I can see the following lines in the message-log (line 6
schedules the STONITH):

For me it seems, that node01 recognized, that the cluster-processes on
node02 were not shot down properly. So the behavior in this case is,
to stonith the node. Could this behavior be disabled? Which setting?

<<
...
Apr 15 08:30:32 node01 pengine: [6152]: notice: unpack_config: On loss
of CCM Quorum: Ignore
Apr 15 08:30:32 node01 pengine: [6152]: WARN: pe_fence_node: Node
node02 will be fenced because it is un-expectedly down
Apr 15 08:30:32 node01 pengine: [6152]: WARN: determine_online_status:
Node node02 is unclean
...
Apr 15 08:30:32 node01 pengine: [6152]: WARN: custom_action: Action
res_stonith_node01_stop_0 on node02 is unrunnable (offline)
Apr 15 08:30:32 node01 pengine: [6152]: WARN: custom_action: Marking
node node02 unclean
Apr 15 08:30:32 node01 pengine: [6152]: WARN: stage6: Scheduling Node
node02 for STONITH
...
ause=C_IPC_MESSAGE origin=handle_response ]
Apr 15 08:30:32 node01 crmd: [6153]: info: unpack_graph: Unpacked
transition 4: 5 actions in 5 synapses
Apr 15 08:30:32 node01 crmd: [6153]: info: do_te_invoke: Processing
graph 4 (ref=pe_calc-dc-1302849032-37) derived from
/var/lib/pengine/pe-warn-7315.bz2
Apr 15 08:30:32 node01 crmd: [6153]: info: te_pseudo_action: Pseudo
action 21 fired and confirmed
Apr 15 08:30:32 node01 crmd: [6153]: info: te_pseudo_action: Pseudo
action 24 fired and confirmed
Apr 15 08:30:32 node01 crmd: [6153]: info: te_fence_node: Executing
reboot fencing operation (26) on node02 (timeout=60000)
Apr 15 08:30:32 node01 stonith-ng: [6148]: info:
initiate_remote_stonith_op: Initiating remote operation reboot for
node02: 8190cf2d-d876-45d1-8e4d-e620e19ca354
...
Apr 15 08:30:32 node01 stonith-ng: [6148]: info: stonith_queryQuery
<stonith_command t="stonith-ng"
st_async_id="8190cf2d-d876-45d1-8e4d-e620e19ca354" st_op="st_query"
st_callid="0" st_callopt="0"
st_remote_op="8190cf2d-d876-45d1-8e4d-e620e19ca354" st_target="node02"
st_device_action="reboot"
st_clientid="983fd169-277a-457d-9985-f30f4320542e" st_timeout="6000"
src="node01" seq="1" />
Apr 15 08:30:32 node01 stonith-ng: [6148]: info:
can_fence_host_with_device: Refreshing port list for
res_stonith_node02
Apr 15 08:30:32 node01 stonith-ng: [6148]: WARN: parse_host_line:
Could not parse (0 0):
Apr 15 08:30:32 node01 stonith-ng: [6148]: info:
can_fence_host_with_device: res_stonith_node02 can fence node02:
dynamic-list
Apr 15 08:30:32 node01 stonith-ng: [6148]: info: stonith_query: Found
1 matching devices for 'node02'
Apr 15 08:30:32 node01 stonith-ng: [6148]: info: call_remote_stonith:
Requesting that node01 perform op reboot node02
Apr 15 08:30:32 node01 stonith-ng: [6148]: info: stonith_fenceExec
<stonith_command t="stonith-ng"
st_async_id="8190cf2d-d876-45d1-8e4d-e620e19ca354" st_op="st_fence"
st_callid="0" st_callopt="0"
st_remote_op="8190cf2d-d876-45d1-8e4d-e620e19ca354" st_target="node02"
st_device_action="reboot" st_timeout="54000" src="node01" seq="3" />
Apr 15 08:30:32 node01 stonith-ng: [6148]: info:
can_fence_host_with_device: res_stonith_node02 can fence node02:
dynamic-list
Apr 15 08:30:32 node01 stonith-ng: [6148]: info: stonith_fence: Found
1 matching devices for 'node02'
Apr 15 08:30:32 node01 pengine: [6152]: WARN: process_pe_message:
Transition 4: WARNINGs found during PE processing. PEngine Input
stored in: /var/lib/pengine/pe-warn-7315.bz2
Apr 15 08:30:32 node01 external/ipmi[19297]: [19310]: debug: ipmitool
output: Chassis Power Control: Reset
Apr 15 08:30:33 node01 stonith-ng: [6148]: info: log_operation:
Operation 'reboot' [19292] for host 'node02' with device
'res_stonith_node02' returned: 0 (call 0 from (null))
Apr 15 08:30:33 node01 stonith-ng: [6148]: info:
process_remote_stonith_execExecResult <st-reply
st_origin="stonith_construct_async_reply" t="stonith-ng"
st_op="st_notify" st_remote_op="8190cf2d-d876-45d1-8e4d-e620e19ca354"
st_callid="0" st_callopt="0" st_rc="0" st_output="Performing: stonith
-t external/ipmi -T reset node02 success: node02 0 " src="node01"
seq="4" />
Apr 15 08:30:33 node01 stonith-ng: [6148]: info: remote_op_done:
Notifing clients of 8190cf2d-d876-45d1-8e4d-e620e19ca354 (reboot of
node02 from 983fd169-277a-457d-9985-f30f4320542e by node01): 1, rc=0
Apr 15 08:30:33 node01 stonith-ng: [6148]: info:
stonith_notify_client: Sending st_fence-notification to client
6153/5395a0da-71b3-4437-b284-f10a8470fce6
Apr 15 08:30:33 node01 crmd: [6153]: info:
tengine_stonith_callbackStonithOp <st-reply
st_origin="stonith_construct_async_reply" t="stonith-ng"
st_op="reboot" st_remote_op="8190cf2d-d876-45d1-8e4d-e620e19ca354"
st_callid="0" st_callopt="0" st_rc="0" st_output="Performing: stonith
-t external/ipmi -T reset node02 success: node02 0 " src="node01"
seq="4" state="1" st_target="node02" />
Apr 15 08:30:33 node01 crmd: [6153]: info: tengine_stonith_callback:
Stonith operation 2/26:4:0:25562131-e2c3-4dd8-8be7-a2237e7ad015: OK
(0)
Apr 15 08:30:33 node01 crmd: [6153]: info: tengine_stonith_callback:
Stonith of node02 passed
Apr 15 08:30:33 node01 crmd: [6153]: info: send_stonith_update:
Sending fencing update 85 for node02
Apr 15 08:30:33 node01 crmd: [6153]: notice: crmd_peer_update: Status
update: Client node02/crmd now has status [offline] (DC=true)
Apr 15 08:30:33 node01 crmd: [6153]: info: check_join_state:
crmd_peer_update: Membership changed since join started: 172 -> 176
...
>>

OS: SLES11-SP1-HAE
Clusterglue: cluster-glue: 1.0.7 (3e3d209f9217f8e517ed1ab8bb2fdd576cc864be)
dc-version="1.1.5-5ce2879aa0d5f43d01629bc20edc6868a9352002"
Installed RPM's: libpacemaker3-1.1.5-5.5.5 libopenais3-1.1.4-5.4.3
pacemaker-mgmt-2.0.0-0.5.5 cluster-glue-1.0.7-6.6.3
openais-1.1.4-5.4.3 pacemaker-1.1.5-5.5.5
pacemaker-mgmt-client-2.0.0-0.5.5

Thanks a lot.
Tom

2011/4/15 Andrew Beekhof <andrew at beekhof.net>:
> Impossible to say without logs.  Sounds strange though.
>
> On Fri, Apr 15, 2011 at 7:17 AM, Tom Tux <tomtux80 at gmail.com> wrote:
>> Hi
>>
>> I have a two node cluster (stonith enabled). On one node I tried
>> stopping openais (/etc/init.d/openais stop), but this was hanging. So
>> I killed all running corosync processes (killall -9 corosync).
>> Afterward, I started openais on this node again (rcopenais start).
>> After a few seconds, this node was stonith'ed and went to reboot.
>>
>> My question hereby:
>> Is this a normal behavior? If yes, is it, because I killed the hanging
>> corosync-processes and after starting openais again, the cluster
>> recognized an unclean state on this node?
>>
>> Thanks a lot.
>> Tom
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>
>