[Pacemaker] Reboot node with stonith after killing a corosync-process?

Fri Apr 15 06:09:54 EDT 2011

Hi

On 04/15/2011 09:05 AM, Tom Tux wrote:
> I can reproduce this behavior:
> 
> - On node02, which had no resources online, I killed all corosync
> processes with "killall -9 corosync".
> - Node02 was rebootet through stonith
> - On node01, I can see the following lines in the message-log (line 6
> schedules the STONITH):
> 
> For me it seems, that node01 recognized, that the cluster-processes on
> node02 were not shot down properly. So the behavior in this case is,
> to stonith the node. Could this behavior be disabled? Which setting?

The cluster cannot distinguish between a node that has lost power, has
broken network or someone killed corosync there.

To the surviving node, the other one is just dead and stonith makes sure
it really is.

That's expected and i guess it will not change.

Regards
Dominik

> <<
> ...
> Apr 15 08:30:32 node01 pengine: [6152]: notice: unpack_config: On loss
> of CCM Quorum: Ignore
> Apr 15 08:30:32 node01 pengine: [6152]: WARN: pe_fence_node: Node
> node02 will be fenced because it is un-expectedly down
> Apr 15 08:30:32 node01 pengine: [6152]: WARN: determine_online_status:
> Node node02 is unclean
> ...
> Apr 15 08:30:32 node01 pengine: [6152]: WARN: custom_action: Action
> res_stonith_node01_stop_0 on node02 is unrunnable (offline)
> Apr 15 08:30:32 node01 pengine: [6152]: WARN: custom_action: Marking
> node node02 unclean
> Apr 15 08:30:32 node01 pengine: [6152]: WARN: stage6: Scheduling Node
> node02 for STONITH
> ...
> ause=C_IPC_MESSAGE origin=handle_response ]
> Apr 15 08:30:32 node01 crmd: [6153]: info: unpack_graph: Unpacked
> transition 4: 5 actions in 5 synapses
> Apr 15 08:30:32 node01 crmd: [6153]: info: do_te_invoke: Processing
> graph 4 (ref=pe_calc-dc-1302849032-37) derived from
> /var/lib/pengine/pe-warn-7315.bz2
> Apr 15 08:30:32 node01 crmd: [6153]: info: te_pseudo_action: Pseudo
> action 21 fired and confirmed
> Apr 15 08:30:32 node01 crmd: [6153]: info: te_pseudo_action: Pseudo
> action 24 fired and confirmed
> Apr 15 08:30:32 node01 crmd: [6153]: info: te_fence_node: Executing
> reboot fencing operation (26) on node02 (timeout=60000)
> Apr 15 08:30:32 node01 stonith-ng: [6148]: info:
> initiate_remote_stonith_op: Initiating remote operation reboot for
> node02: 8190cf2d-d876-45d1-8e4d-e620e19ca354
> ...
> Apr 15 08:30:32 node01 stonith-ng: [6148]: info: stonith_queryQuery
> <stonith_command t="stonith-ng"
> st_async_id="8190cf2d-d876-45d1-8e4d-e620e19ca354" st_op="st_query"
> st_callid="0" st_callopt="0"
> st_remote_op="8190cf2d-d876-45d1-8e4d-e620e19ca354" st_target="node02"
> st_device_action="reboot"
> st_clientid="983fd169-277a-457d-9985-f30f4320542e" st_timeout="6000"
> src="node01" seq="1" />
> Apr 15 08:30:32 node01 stonith-ng: [6148]: info:
> can_fence_host_with_device: Refreshing port list for
> res_stonith_node02
> Apr 15 08:30:32 node01 stonith-ng: [6148]: WARN: parse_host_line:
> Could not parse (0 0):
> Apr 15 08:30:32 node01 stonith-ng: [6148]: info:
> can_fence_host_with_device: res_stonith_node02 can fence node02:
> dynamic-list
> Apr 15 08:30:32 node01 stonith-ng: [6148]: info: stonith_query: Found
> 1 matching devices for 'node02'
> Apr 15 08:30:32 node01 stonith-ng: [6148]: info: call_remote_stonith:
> Requesting that node01 perform op reboot node02
> Apr 15 08:30:32 node01 stonith-ng: [6148]: info: stonith_fenceExec
> <stonith_command t="stonith-ng"
> st_async_id="8190cf2d-d876-45d1-8e4d-e620e19ca354" st_op="st_fence"
> st_callid="0" st_callopt="0"
> st_remote_op="8190cf2d-d876-45d1-8e4d-e620e19ca354" st_target="node02"
> st_device_action="reboot" st_timeout="54000" src="node01" seq="3" />
> Apr 15 08:30:32 node01 stonith-ng: [6148]: info:
> can_fence_host_with_device: res_stonith_node02 can fence node02:
> dynamic-list
> Apr 15 08:30:32 node01 stonith-ng: [6148]: info: stonith_fence: Found
> 1 matching devices for 'node02'
> Apr 15 08:30:32 node01 pengine: [6152]: WARN: process_pe_message:
> Transition 4: WARNINGs found during PE processing. PEngine Input
> stored in: /var/lib/pengine/pe-warn-7315.bz2
> Apr 15 08:30:32 node01 external/ipmi[19297]: [19310]: debug: ipmitool
> output: Chassis Power Control: Reset
> Apr 15 08:30:33 node01 stonith-ng: [6148]: info: log_operation:
> Operation 'reboot' [19292] for host 'node02' with device
> 'res_stonith_node02' returned: 0 (call 0 from (null))
> Apr 15 08:30:33 node01 stonith-ng: [6148]: info:
> process_remote_stonith_execExecResult <st-reply
> st_origin="stonith_construct_async_reply" t="stonith-ng"
> st_op="st_notify" st_remote_op="8190cf2d-d876-45d1-8e4d-e620e19ca354"
> st_callid="0" st_callopt="0" st_rc="0" st_output="Performing: stonith
> -t external/ipmi -T reset node02 success: node02 0 " src="node01"
> seq="4" />
> Apr 15 08:30:33 node01 stonith-ng: [6148]: info: remote_op_done:
> Notifing clients of 8190cf2d-d876-45d1-8e4d-e620e19ca354 (reboot of
> node02 from 983fd169-277a-457d-9985-f30f4320542e by node01): 1, rc=0
> Apr 15 08:30:33 node01 stonith-ng: [6148]: info:
> stonith_notify_client: Sending st_fence-notification to client
> 6153/5395a0da-71b3-4437-b284-f10a8470fce6
> Apr 15 08:30:33 node01 crmd: [6153]: info:
> tengine_stonith_callbackStonithOp <st-reply
> st_origin="stonith_construct_async_reply" t="stonith-ng"
> st_op="reboot" st_remote_op="8190cf2d-d876-45d1-8e4d-e620e19ca354"
> st_callid="0" st_callopt="0" st_rc="0" st_output="Performing: stonith
> -t external/ipmi -T reset node02 success: node02 0 " src="node01"
> seq="4" state="1" st_target="node02" />
> Apr 15 08:30:33 node01 crmd: [6153]: info: tengine_stonith_callback:
> Stonith operation 2/26:4:0:25562131-e2c3-4dd8-8be7-a2237e7ad015: OK
> (0)
> Apr 15 08:30:33 node01 crmd: [6153]: info: tengine_stonith_callback:
> Stonith of node02 passed
> Apr 15 08:30:33 node01 crmd: [6153]: info: send_stonith_update:
> Sending fencing update 85 for node02
> Apr 15 08:30:33 node01 crmd: [6153]: notice: crmd_peer_update: Status
> update: Client node02/crmd now has status [offline] (DC=true)
> Apr 15 08:30:33 node01 crmd: [6153]: info: check_join_state:
> crmd_peer_update: Membership changed since join started: 172 -> 176
> ...
>>>
> 
> OS: SLES11-SP1-HAE
> Clusterglue: cluster-glue: 1.0.7 (3e3d209f9217f8e517ed1ab8bb2fdd576cc864be)
> dc-version="1.1.5-5ce2879aa0d5f43d01629bc20edc6868a9352002"
> Installed RPM's: libpacemaker3-1.1.5-5.5.5 libopenais3-1.1.4-5.4.3
> pacemaker-mgmt-2.0.0-0.5.5 cluster-glue-1.0.7-6.6.3
> openais-1.1.4-5.4.3 pacemaker-1.1.5-5.5.5
> pacemaker-mgmt-client-2.0.0-0.5.5
> 
> Thanks a lot.
> Tom
> 
> 
> 
> 2011/4/15 Andrew Beekhof <andrew at beekhof.net>:
>> Impossible to say without logs.  Sounds strange though.
>>
>> On Fri, Apr 15, 2011 at 7:17 AM, Tom Tux <tomtux80 at gmail.com> wrote:
>>> Hi
>>>
>>> I have a two node cluster (stonith enabled). On one node I tried
>>> stopping openais (/etc/init.d/openais stop), but this was hanging. So
>>> I killed all running corosync processes (killall -9 corosync).
>>> Afterward, I started openais on this node again (rcopenais start).
>>> After a few seconds, this node was stonith'ed and went to reboot.
>>>
>>> My question hereby:
>>> Is this a normal behavior? If yes, is it, because I killed the hanging
>>> corosync-processes and after starting openais again, the cluster
>>> recognized an unclean state on this node?
>>>
>>> Thanks a lot.
>>> Tom