[Pacemaker] timeout rebooting with stonith_sbd
Alexandr A. Alexandrov
shurrman at gmail.com
Tue May 14 10:49:40 EDT 2013
Hi!
I have a two-node cluster (virtual machines) with several resources and
shared storage.
When the connectivity is lost (for some reason still needed to be
debuged), here is what I get (I am skipping unrelated messages)
May 14 16:49:21 wcs2 corosync[27531]: [TOTEM ] The token was lost in
the OPERATIONAL state.
May 14 16:49:21 wcs2 corosync[27531]: [TOTEM ] A processor failed,
forming new configuration.
Why corosync connectivity is lost? There was nothing suspicious in the
logs at all.
May 14 16:49:36 wcs2 corosync[27531]: [VOTEQ ] node 739269211 state=2,
votes=1, expected=2
May 14 16:49:36 wcs2 corosync[27531]: [VOTEQ ] node 739269212 state=1,
votes=1, expected=2
May 14 16:49:36 wcs2 corosync[27531]: [QUORUM] This node is within the
non-primary component and will NOT provide any services.
May 14 16:49:36 wcs2 corosync[27531]: [QUORUM] Members[1]: 739269212
May 14 16:49:36 wcs2 corosync[27531]: [QUORUM] sending quorum
notification to (nil), length = 52
May 14 16:49:36 wcs2 crmd[11381]: warning: match_down_event: No match
for shutdown action on 739269211
May 14 16:49:36 wcs2 crmd[11381]: notice: peer_update_callback:
Stonith/shutdown of wcs1 not matched
What does that warning mean?
May 14 16:49:37 wcs2 pengine[27574]: notice: unpack_config: On loss of
CCM Quorum: Ignore
May 14 16:49:37 wcs2 pengine[27574]: warning: pe_fence_node: Node wcs1
will be fenced because stonith_sbd is thought to be active there
May 14 16:49:37 wcs2 pengine[27574]: warning: custom_action: Action
stonith_sbd_stop_0 on wcs1 is unrunnable (offline)
May 14 16:49:37 wcs2 pengine[27574]: warning: stage6: Scheduling Node
wcs1 for STONITH
May 14 16:49:37 wcs2 pengine[27574]: notice: LogActions: Move
stonith_sbd#011(Started wcs1 -> wcs2)
All resources were active on node wcs2 (survived), stonith_sbd was
active on node wcs1
May 14 16:49:37 wcs2 crmd[11381]: notice: te_fence_node: Executing
reboot fencing operation (38) on wcs1 (timeout=60000)
May 14 16:49:37 wcs2 stonith-ng[27571]: notice: handle_request: Client
crmd.11381.a02439c4 wants to fence (reboot) 'wcs1' with device '(any)'
May 14 16:49:37 wcs2 stonith-ng[27571]: notice:
initiate_remote_stonith_op: Initiating remote operation reboot for wcs1:
37151815-2182-42fa-b32e-86288b1808
5b (0)
Now, as these are actually virtual machines, reboot takes place quite
quickly:
May 14 16:49:46 wcs2 crmd[11381]: notice: pcmk_quorum_notification:
Membership 1000: quorum acquired (2)
May 14 16:49:46 wcs2 crmd[11381]: notice: crm_update_peer_state:
pcmk_quorum_notification: Node wcs1[739269211] - state is now member
May 14 16:50:05 wcs2 crmd[11381]: notice: do_state_transition: State
transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC
cause=C_FSA_INTERNAL origin=do_election_check ]
May 14 16:50:07 wcs2 attrd[27573]: notice: attrd_local_callback:
Sending full refresh (origin=crmd)
May 14 16:50:07 wcs2 attrd[27573]: notice: attrd_trigger_update:
Sending flush op to all hosts for: probe_complete (true)
May 14 16:50:49 wcs2 stonith-ng[27571]: error: remote_op_done:
Operation reboot of wcs1 by wcs2 for crmd.11381 at wcs2.37151815: Timer expired
May 14 16:50:49 wcs2 crmd[11381]: notice: tengine_stonith_callback:
Stonith operation 11/38:2655:0:8f1636b7-dd1d-470c-b645-65a9c8743a69:
Timer expired (-62)
May 14 16:50:49 wcs2 crmd[11381]: notice: tengine_stonith_callback:
Stonith operation 11 for wcs1 failed (Timer expired): aborting transition.
May 14 16:50:49 wcs2 crmd[11381]: notice: tengine_stonith_notify: Peer
wcs1 was not terminated (st_notify_fence) by wcs2 for wcs2: Timer
expired (ref=37151815-2182-42fa-b32e-86288b18085b) by client crmd.11381
But why reboot operation timers expire?
More information about the Pacemaker
mailing list