[Pacemaker] why so long to stonith?
Brian J. Murrell
brian at interlinx.bc.ca
Tue Apr 23 15:34:40 EDT 2013
Using pacemaker 1.1.8 on RHEL 6.4, I did a test where I just killed
(-KILL) corosync on a peer node. Pacemaker seemed to take a long time
to transition to stonithing it though after noticing it was AWOL:
Apr 23 19:05:20 node2 corosync[1324]: [TOTEM ] A processor failed, forming new configuration.
Apr 23 19:05:21 node2 corosync[1324]: [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 188: memb=1, new=0, lost=1
Apr 23 19:05:21 node2 corosync[1324]: [pcmk ] info: pcmk_peer_update: memb: node2 2608507072
Apr 23 19:05:21 node2 corosync[1324]: [pcmk ] info: pcmk_peer_update: lost: node1 4252674240
Apr 23 19:05:21 node2 corosync[1324]: [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 188: memb=1, new=0, lost=0
Apr 23 19:05:21 node2 corosync[1324]: [pcmk ] info: pcmk_peer_update: MEMB: node2 2608507072
Apr 23 19:05:21 node2 corosync[1324]: [pcmk ] info: ais_mark_unseen_peer_dead: Node node1 was not seen in the previous transition
Apr 23 19:05:21 node2 corosync[1324]: [pcmk ] info: update_member: Node 4252674240/node1 is now: lost
Apr 23 19:05:21 node2 corosync[1324]: [pcmk ] info: send_member_notification: Sending membership update 188 to 2 children
Apr 23 19:05:21 node2 corosync[1324]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Apr 23 19:05:21 node2 corosync[1324]: [CPG ] chosen downlist: sender r(0) ip(192.168.122.155) ; members(old:2 left:1)
Apr 23 19:05:21 node2 corosync[1324]: [MAIN ] Completed service synchronization, ready to provide service.
Apr 23 19:05:21 node2 crmd[1634]: notice: ais_dispatch_message: Membership 188: quorum lost
Apr 23 19:05:21 node2 crmd[1634]: notice: crm_update_peer_state: crm_update_ais_node: Node node1[4252674240] - state is now lost
Apr 23 19:05:21 node2 cib[1629]: notice: ais_dispatch_message: Membership 188: quorum lost
Apr 23 19:05:21 node2 cib[1629]: notice: crm_update_peer_state: crm_update_ais_node: Node node1[4252674240] - state is now lost
Apr 23 19:08:31 node2 crmd[1634]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ]
Apr 23 19:08:31 node2 pengine[1633]: notice: unpack_config: On loss of CCM Quorum: Ignore
Apr 23 19:08:31 node2 pengine[1633]: warning: pe_fence_node: Node node1 will be fenced because the node is no longer part of the cluster
Apr 23 19:08:31 node2 pengine[1633]: warning: determine_online_status: Node node1 is unclean
Apr 23 19:08:31 node2 pengine[1633]: warning: custom_action: Action MGS_e4a31b_stop_0 on node1 is unrunnable (offline)
Apr 23 19:08:31 node2 pengine[1633]: warning: stage6: Scheduling Node node1 for STONITH
Apr 23 19:08:31 node2 pengine[1633]: notice: LogActions: Move MGS_e4a31b#011(Started node1 -> node2)
Apr 23 19:08:31 node2 crmd[1634]: notice: te_fence_node: Executing reboot fencing operation (15) on node1 (timeout=60000)
Apr 23 19:08:31 node2 stonith-ng[1630]: notice: handle_request: Client crmd.1634.642b9c6e wants to fence (reboot) 'node1' with device '(any)'
Apr 23 19:08:31 node2 stonith-ng[1630]: notice: initiate_remote_stonith_op: Initiating remote operation reboot for node1: fb431eb4-789c-41bc-903e-4041d50e93b4 (0)
Apr 23 19:08:31 node2 pengine[1633]: warning: process_pe_message: Calculated Transition 115: /var/lib/pacemaker/pengine/pe-warn-7.bz2
Apr 23 19:08:41 node2 stonith-ng[1630]: notice: log_operation: Operation 'reboot' [27682] (call 0 from crmd.1634) for host 'node1' with device 'st-node1' returned: 0 (OK)
Apr 23 19:08:41 node2 stonith-ng[1630]: notice: remote_op_done: Operation reboot of node1 by node2 for crmd.1634 at node2.fb431eb4: OK
Apr 23 19:08:41 node2 crmd[1634]: notice: tengine_stonith_callback: Stonith operation 3/15:115:0:c118573c-84e3-48bd-8dc9-40de24438385: OK (0)
Apr 23 19:08:41 node2 crmd[1634]: notice: tengine_stonith_notify: Peer node1 was terminated (st_notify_fence) by node2 for node2: OK (ref=fb431eb4-789c-41bc-903e-4041d50e93b4) by client crmd.1634
Apr 23 19:09:03 node2 corosync[1324]: [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 192: memb=1, new=0, lost=0
Apr 23 19:09:03 node2 corosync[1324]: [pcmk ] info: pcmk_peer_update: memb: node2 2608507072
Apr 23 19:09:03 node2 corosync[1324]: [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 192: memb=2, new=1, lost=0
Apr 23 19:09:03 node2 corosync[1324]: [pcmk ] info: update_member: Node 4252674240/node1 is now: member
Apr 23 19:09:03 node2 corosync[1324]: [pcmk ] info: pcmk_peer_update: NEW: node1 4252674240
Apr 23 19:09:03 node2 corosync[1324]: [pcmk ] info: pcmk_peer_update: MEMB: node2 2608507072
Apr 23 19:09:03 node2 corosync[1324]: [pcmk ] info: pcmk_peer_update: MEMB: node1 4252674240
Apr 23 19:09:03 node2 corosync[1324]: [pcmk ] info: send_member_notification: Sending membership update 192 to 2 children
Apr 23 19:09:03 node2 corosync[1324]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Apr 23 19:09:03 node2 corosync[1324]: [pcmk ] info: update_member: 0x1efc040 Node 4252674240 (node1) born on: 192
Apr 23 19:09:03 node2 corosync[1324]: [pcmk ] info: send_member_notification: Sending membership update 192 to 2 children
As you can see, 3 minutes and 10 seconds went by before pacemaker
transitioned from noticing the node unresponsive to stonithing it.
This smacks of some kind of mis-configured timeout but I'm not aware
of any timeout that would have this effect.
Thoughts?
b.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 263 bytes
Desc: OpenPGP digital signature
URL: <http://lists.clusterlabs.org/pipermail/pacemaker/attachments/20130423/317a3f1b/attachment-0002.sig>
More information about the Pacemaker
mailing list