[ClusterLabs] Q (SLES11 SP4): Delay after node came up (info: throttle_send_command: New throttle mode: 0000 (was ffffffff))

Ulrich Windl Ulrich.Windl at rz.uni-regensburg.de
Mon Jan 16 05:25:04 EST 2017


Hi!

I have a question: The following happened in out 3-node cluster (n1, n2, n3):
n3 was DC, n2 was offlined, n2 came online again, n1 rebooted (went offline/online), then n2 reboted (offline /online)

I observed a significant delay after all three nodes were online before resources were started. Actualy the start seemed to be triggered by some crm restart action on n3.

Logs on n3 (DC) look like this:
cib:     info: cib_process_request:      Completed cib_modify operation for section status: OK (rc=0, origin=local/crmd/359, version=1.99.1)
crmd:   notice: handle_request:   Current ping state: S_TRANSITION_ENGINE
(...many more...)
stonith-ng:     info: plugin_handle_membership:         Membership 3328: quorum retained
crmd:     info: plugin_handle_membership:         Membership 3328: quorum retained
[...]
stonith-ng:     info: plugin_handle_membership:         Membership 3328: quorum retained
[...]
cib:     info: cib_process_request:      Completed cib_modify operation for section status: OK (rc=0, origin=local/crmd/365, version=1.99.3)
crmd:     info: crmd_cs_dispatch:         Setting expected votes to 3
crmd:     info: plugin_handle_membership:         Membership 3328: quorum retained
[...]
crmd:     info: crmd_cs_dispatch:         Setting expected votes to 3
crmd:     info: do_state_transition:      State transition S_TRANSITION_ENGINE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL origin=peer_update_callback ]
crmd:     info: do_dc_join_offer_one:     An unknown node joined - (re-)offer to any unconfirmed nodes
(???what's that?)
crmd:     info: join_make_offer:  Making join offers based on membership 3328
crmd:     info: join_make_offer:  join-2: Sending offer to n2
crmd:     info: crm_update_peer_join:     join_make_offer: Node n2[739512325] - join-2 phase 0 -> 1
crmd:     info: join_make_offer:  Skipping n1: already known 4
crmd:     info: join_make_offer:  Skipping n3: already known 4
crmd:   notice: abort_transition_graph:   Transition aborted: Peer Halt (source=do_te_invoke:168, 0)
cib:     info: cib_process_request:      Completed cib_modify operation for section crm_config: OK (rc=0, origin=local/crmd/375, version=1.99.5)
crmd:     info: crm_update_peer_join:     do_dc_join_filter_offer: Node n2[739512325] - join-2 phase 1 -> 2
crmd:     info: crm_update_peer_expected:         do_dc_join_filter_offer: Node n2[739512325] - expected state is now member (was down)
crmd:     info: abort_transition_graph:   Transition aborted: Peer Halt (source=do_te_invoke:168, 0)
crmd:     info: do_state_transition:      State transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED cause=C_FSA_INTERNAL origin=check_join_state ]
crmd:     info: crmd_join_phase_log:      join-2: n2=integrated
crmd:     info: crmd_join_phase_log:      join-2: n1=confirmed
crmd:     info: crmd_join_phase_log:      join-2: n3=confirmed
crmd:   notice: do_dc_join_finalize:      join-2: Syncing the CIB from n2 to the rest of the cluster
[...]
cib:     info: cib_process_replace:      Replaced 1.99.5 with 1.99.5 from n2
cib:     info: cib_process_request:      Completed cib_replace operation for section 'all': OK (rc=0, origin=n2/crmd/376, version=1.99.5)
crmd:     info: crm_update_peer_join:     finalize_join_for: Node n2[739512325] - join-2 phase 2 -> 3
crmd:     info: do_log:   FSA: Input I_WAIT_FOR_EVENT from do_te_invoke() received in state S_FINALIZE_JOIN
crmd:     info: abort_transition_graph:   Transition aborted: Peer Halt (source=do_te_invoke:168, 0)
[...]
cib:     info: cib_file_write_with_digest:       Wrote version 1.99.0 of the CIB to disk (digest: 6e71ae6f4a1d2619cc64c91d40f55a32)
(??? We already had 1.99.5)
cib:     info: cib_process_request:      Completed cib_modify operation for section status: OK (rc=0, origin=n2/attrd/3, version=1.99.6)
crmd:     info: crm_update_peer_join:     do_dc_join_ack: Node n2[739512325] - join-2 phase 3 -> 4
crmd:     info: do_dc_join_ack:   join-2: Updating node state to member for n2
[...]
crmd:   notice: handle_request:   Current ping state: S_FINALIZE_JOIN
crmd:     info: do_log:   FSA: Input I_WAIT_FOR_EVENT from do_te_invoke() received in state S_FINALIZE_JOIN
[...]
crmd:     info: do_state_transition:      State transition S_FINALIZE_JOIN -> S_POLICY_ENGINE [ input=I_FINALIZED cause=C_FSA_INTERNAL origin=check_join_state ]
crmd:   notice: crm_update_quorum:        Updating quorum status to true (call=385)
crmd:     info: abort_transition_graph:   Transition aborted: Peer Cancelled (source=do_te_invoke:161, 0)
[...]
crmd:   notice: handle_request:   Current ping state: S_POLICY_ENGINE
[...]
(the following messages repeat for a long time)
crmd:   notice: handle_request:   Current ping state: S_POLICY_ENGINE
crmd:     info: do_log:   FSA: Input I_WAIT_FOR_EVENT from do_te_invoke() received in state S_POLICY_ENGINE
crmd:     info: abort_transition_graph:   Transition aborted: Peer Halt (source=do_te_invoke:168, 0)
crmd:     info: do_log:   FSA: Input I_WAIT_FOR_EVENT from do_te_invoke() received in state S_POLICY_ENGINE
crmd:     info: abort_transition_graph:   Transition aborted: Peer Halt (source=do_te_invoke:168, 0)
crmd:     info: do_log:   FSA: Input I_WAIT_FOR_EVENT from do_te_invoke() received in state S_POLICY_ENGINE
crmd:     info: abort_transition_graph:   Transition aborted: Peer Halt (source=do_te_invoke:168, 0)
[...]
crmd:     info: abort_transition_graph:   Transition aborted: Peer Halt (source=do_te_invoke:168, 0)
lrmd:     info: log_finished:     finished - rsc:prm_xen_v09 action:stop call_id:146 pid:6420 exit-code:0 exec-time:331841ms queue-time:0ms
[...]
crmd:   notice: handle_request:   Current ping state: S_POLICY_ENGINE
pengine:     info: determine_online_status_fencing:  Node n1 is active
pengine:     info: determine_online_status:  Node n1 is online
pengine:     info: determine_online_status_fencing:  Node n3 is active
pengine:     info: determine_online_status:  Node n3 is online
pengine:     info: determine_online_status_fencing:  Node n2 is active
pengine:     info: determine_online_status:  Node n2 is online
(now the cluster seems to become active...)

On node n2:
[...]
crmd:   notice: do_state_transition:      State transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ]
attrd:   notice: attrd_perform_update:     Sent update 3: shutdown=0
[...]
crmd:     info: throttle_send_command:    New throttle mode: 0000 (was ffffffff)
(...Here is the big delay when noting seemed to happen)
cib:     info: cib_perform_op:   Diff: --- 1.99.9 2

On n1 the delay was between these two lines:
cib:     info: cib_process_ping:         Reporting our current digest to n3: 497fb3b47c84a24f0cc7e127b219fdae for 1.99.9 (0xc7cbb0 0)
cib:     info: cib_perform_op:   Diff: --- 1.99.9 2

What may be causing this? It was almost like a three-minute delay.

Software is the latest version for SLES11 SP4 (pacemaker-1.1.12-18.1).

Regards,
Ulrich






More information about the Users mailing list