[Pacemaker] a situation where pacemaker refuses to stop

Brian J. Murrell brian at interlinx.bc.ca
Sat Feb 23 08:15:27 EST 2013


I seem to have found a situation where pacemaker (pacemaker-1.1.7-6.el6.x86_64)
refuses to stop (i.e. service pacemaker stop) on EL6.

The status of the 2 node cluster was that the node being asked to stop
(node2) was continually trying to stonith another node (node1) in the
cluster which was not running corosync/pacemaker (yet).  The reason
node2 was looping around the stonith operation for node1 was that there
was no stonith resource set up for node1 (yet).

The log on node2 simply repeats this over and over again:

stonith-ng[20695]:    error: remote_op_done: Operation reboot of node1 by <no-one> for node2[d4e76f3a-42ed-4576-975e-b805ac30c04a]: Operation timed out
crmd[20699]:     info: tengine_stonith_callback: StonithOp <remote-op state="0" st_target="node1" st_op="reboot" />
crmd[20699]:   notice: tengine_stonith_callback: Stonith operation 110 for node1 failed (Operation timed out): aborting transition.
crmd[20699]:     info: abort_transition_graph: tengine_stonith_callback:454 - Triggered transition abort (complete=0) : Stonith failed
crmd[20699]:   notice: tengine_stonith_notify: Peer node1 was not terminated (reboot) by <anyone> for node2: Operation timed out (ref=18e93407-4efa-4b97-99e1-b331591598ef)
crmd[20699]:   notice: run_graph: ==== Transition 108 (Complete=2, Pending=0, Fired=0, Skipped=4, Incomplete=0, Source=/var/lib/pengine/pe-warn-3.bz2): Stopped
crmd[20699]:   notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=notify_crmd ]
pengine[20698]:   notice: unpack_config: On loss of CCM Quorum: Ignore
pengine[20698]:  warning: stage6: Scheduling Node node1 for STONITH
pengine[20698]:   notice: stage6: Scheduling Node node2 for shutdown
pengine[20698]:   notice: LogActions: Stop    st-fencing#011(node2)
pengine[20698]:  warning: process_pe_message: Transition 109: WARNINGs found during PE processing. PEngine Input stored in: /var/lib/pengine/pe-warn-3.bz2
crmd[20699]:   notice: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
pengine[20698]:   notice: process_pe_message: Configuration WARNINGs found during PE processing.  Please run "crm_verify -L" to identify issues.
crmd[20699]:     info: do_te_invoke: Processing graph 109 (ref=pe_calc-dc-1361624958-120) derived from /var/lib/pengine/pe-warn-3.bz2
crmd[20699]:   notice: te_fence_node: Executing reboot fencing operation (7) on node1 (timeout=60000)
stonith-ng[20695]:     info: initiate_remote_stonith_op: Initiating remote operation reboot for node1: 96b06897-5ba7-46c3-b9d2-797113df2812
stonith-ng[20695]:     info: can_fence_host_with_device: Refreshing port list for st-fencing
stonith-ng[20695]:     info: can_fence_host_with_device: st-fencing can not fence node1: dynamic-list
stonith-ng[20695]:     info: stonith_command: Processed st_query from node2: rc=0

and while that's repeating the "service pacemaker stop" is producing:

node2# service pacemaker stop
Signaling Pacemaker Cluster Manager to terminate:          [  OK  ]
Waiting for cluster services to unload

I suppose this will continue forever until I either manually force
pacemaker down or fix up the cluster config to allow the stonith
operation to succeed.  In an environment where pacemaker is being
controlled by another process, this is clearly an undesirable sit-
uation.

Is this behavior (the shutdown hanging while pacemaker spins trying
to stonith) expected?

Cheers,
b.


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 263 bytes
Desc: OpenPGP digital signature
URL: <http://lists.clusterlabs.org/pipermail/pacemaker/attachments/20130223/c4e33d32/attachment-0002.sig>


More information about the Pacemaker mailing list