[ClusterLabs] Q (SLES11 SP4): Delay after node came up (info: throttle_send_command: New throttle mode: 0000 (was ffffffff))
Ken Gaillot
kgaillot at redhat.com
Thu Feb 9 17:16:47 EST 2017
On 01/16/2017 04:25 AM, Ulrich Windl wrote:
> Hi!
>
> I have a question: The following happened in out 3-node cluster (n1, n2, n3):
> n3 was DC, n2 was offlined, n2 came online again, n1 rebooted (went offline/online), then n2 reboted (offline /online)
>
> I observed a significant delay after all three nodes were online before resources were started. Actualy the start seemed to be triggered by some crm restart action on n3.
>
> Logs on n3 (DC) look like this:
> cib: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=local/crmd/359, version=1.99.1)
> crmd: notice: handle_request: Current ping state: S_TRANSITION_ENGINE
> (...many more...)
> stonith-ng: info: plugin_handle_membership: Membership 3328: quorum retained
> crmd: info: plugin_handle_membership: Membership 3328: quorum retained
> [...]
> stonith-ng: info: plugin_handle_membership: Membership 3328: quorum retained
> [...]
> cib: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=local/crmd/365, version=1.99.3)
> crmd: info: crmd_cs_dispatch: Setting expected votes to 3
> crmd: info: plugin_handle_membership: Membership 3328: quorum retained
> [...]
> crmd: info: crmd_cs_dispatch: Setting expected votes to 3
> crmd: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL origin=peer_update_callback ]
> crmd: info: do_dc_join_offer_one: An unknown node joined - (re-)offer to any unconfirmed nodes
> (???what's that?)
This is normal when a node joins the cluster. The DC's cluster layer
detects any joins, and the DC's crmd responds to that by offering
membership to the new node(s).
> crmd: info: join_make_offer: Making join offers based on membership 3328
> crmd: info: join_make_offer: join-2: Sending offer to n2
> crmd: info: crm_update_peer_join: join_make_offer: Node n2[739512325] - join-2 phase 0 -> 1
> crmd: info: join_make_offer: Skipping n1: already known 4
> crmd: info: join_make_offer: Skipping n3: already known 4
Above we can see that n1 and n3 already have confirmed membership, and
the newly joined n2 gets offered membership.
> crmd: notice: abort_transition_graph: Transition aborted: Peer Halt (source=do_te_invoke:168, 0)
This is one of the common log messages I think can be improved. "Peer
Halt" in this case does not mean the peer halted, but rather that the
transition was halted due to a peer event (in this case the join).
> cib: info: cib_process_request: Completed cib_modify operation for section crm_config: OK (rc=0, origin=local/crmd/375, version=1.99.5)
> crmd: info: crm_update_peer_join: do_dc_join_filter_offer: Node n2[739512325] - join-2 phase 1 -> 2
> crmd: info: crm_update_peer_expected: do_dc_join_filter_offer: Node n2[739512325] - expected state is now member (was down)
> crmd: info: abort_transition_graph: Transition aborted: Peer Halt (source=do_te_invoke:168, 0)
> crmd: info: do_state_transition: State transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED cause=C_FSA_INTERNAL origin=check_join_state ]
> crmd: info: crmd_join_phase_log: join-2: n2=integrated
> crmd: info: crmd_join_phase_log: join-2: n1=confirmed
> crmd: info: crmd_join_phase_log: join-2: n3=confirmed
> crmd: notice: do_dc_join_finalize: join-2: Syncing the CIB from n2 to the rest of the cluster
> [...]
> cib: info: cib_process_replace: Replaced 1.99.5 with 1.99.5 from n2
> cib: info: cib_process_request: Completed cib_replace operation for section 'all': OK (rc=0, origin=n2/crmd/376, version=1.99.5)
> crmd: info: crm_update_peer_join: finalize_join_for: Node n2[739512325] - join-2 phase 2 -> 3
> crmd: info: do_log: FSA: Input I_WAIT_FOR_EVENT from do_te_invoke() received in state S_FINALIZE_JOIN
> crmd: info: abort_transition_graph: Transition aborted: Peer Halt (source=do_te_invoke:168, 0)
> [...]
> cib: info: cib_file_write_with_digest: Wrote version 1.99.0 of the CIB to disk (digest: 6e71ae6f4a1d2619cc64c91d40f55a32)
> (??? We already had 1.99.5)
Only .0's are written to disk -- the .x's contain updates to dynamic
information (like the status section) and are in-memory only.
> cib: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=n2/attrd/3, version=1.99.6)
> crmd: info: crm_update_peer_join: do_dc_join_ack: Node n2[739512325] - join-2 phase 3 -> 4
> crmd: info: do_dc_join_ack: join-2: Updating node state to member for n2
> [...]
> crmd: notice: handle_request: Current ping state: S_FINALIZE_JOIN
> crmd: info: do_log: FSA: Input I_WAIT_FOR_EVENT from do_te_invoke() received in state S_FINALIZE_JOIN
> [...]
> crmd: info: do_state_transition: State transition S_FINALIZE_JOIN -> S_POLICY_ENGINE [ input=I_FINALIZED cause=C_FSA_INTERNAL origin=check_join_state ]
> crmd: notice: crm_update_quorum: Updating quorum status to true (call=385)
> crmd: info: abort_transition_graph: Transition aborted: Peer Cancelled (source=do_te_invoke:161, 0)
> [...]
> crmd: notice: handle_request: Current ping state: S_POLICY_ENGINE
> [...]
> (the following messages repeat for a long time)
> crmd: notice: handle_request: Current ping state: S_POLICY_ENGINE
> crmd: info: do_log: FSA: Input I_WAIT_FOR_EVENT from do_te_invoke() received in state S_POLICY_ENGINE
> crmd: info: abort_transition_graph: Transition aborted: Peer Halt (source=do_te_invoke:168, 0)
> crmd: info: do_log: FSA: Input I_WAIT_FOR_EVENT from do_te_invoke() received in state S_POLICY_ENGINE
> crmd: info: abort_transition_graph: Transition aborted: Peer Halt (source=do_te_invoke:168, 0)
> crmd: info: do_log: FSA: Input I_WAIT_FOR_EVENT from do_te_invoke() received in state S_POLICY_ENGINE
> crmd: info: abort_transition_graph: Transition aborted: Peer Halt (source=do_te_invoke:168, 0)
> [...]
> crmd: info: abort_transition_graph: Transition aborted: Peer Halt (source=do_te_invoke:168, 0)
> lrmd: info: log_finished: finished - rsc:prm_xen_v09 action:stop call_id:146 pid:6420 exit-code:0 exec-time:331841ms queue-time:0ms
> [...]
> crmd: notice: handle_request: Current ping state: S_POLICY_ENGINE
> pengine: info: determine_online_status_fencing: Node n1 is active
> pengine: info: determine_online_status: Node n1 is online
> pengine: info: determine_online_status_fencing: Node n3 is active
> pengine: info: determine_online_status: Node n3 is online
> pengine: info: determine_online_status_fencing: Node n2 is active
> pengine: info: determine_online_status: Node n2 is online
> (now the cluster seems to become active...)
>
> On node n2:
> [...]
> crmd: notice: do_state_transition: State transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ]
> attrd: notice: attrd_perform_update: Sent update 3: shutdown=0
> [...]
> crmd: info: throttle_send_command: New throttle mode: 0000 (was ffffffff)
> (...Here is the big delay when noting seemed to happen)
> cib: info: cib_perform_op: Diff: --- 1.99.9 2
>
> On n1 the delay was between these two lines:
> cib: info: cib_process_ping: Reporting our current digest to n3: 497fb3b47c84a24f0cc7e127b219fdae for 1.99.9 (0xc7cbb0 0)
> cib: info: cib_perform_op: Diff: --- 1.99.9 2
>
> What may be causing this? It was almost like a three-minute delay.
Unfortunately, I don't have any ideas here.
> Software is the latest version for SLES11 SP4 (pacemaker-1.1.12-18.1).
>
> Regards,
> Ulrich
More information about the Users
mailing list