[Pacemaker] "Election Timeout" and node became the "Pending" state.

Andrew Beekhof andrew at beekhof.net
Thu Oct 7 04:04:47 EDT 2010


On Tue, Oct 5, 2010 at 6:44 AM,  <renayama19661014 at ybb.ne.jp> wrote:
> Hi,
>
> We tested complicated node trouble.
>
> An error of "Election Timeout" occurred then.
>
>  * Pacemaker:pacemaker-1.0.9.1
>  * heartbeat-3.0.3-2.3.el5
>  * cluster-glue:cluster-glue-1.0.6-1.6.el5
>  * resource-agents-1.0.3-1.0.dev.b7a3b1973ba7
>
> We tested it in the next procedure.
>
> Step1) Start all nodes.
> Step2) In a cgl49 node, we generate a monitor error of prmApPostgreSQLDB1.
> Step3) A cgl49 node is done STONITH of by a cgl54 node.
> Step4) With Step3, we do kill of the master process of the cgl54 node.
> Step5) A cgl54 node reboots.
> Step6) A cgl49 node is done STONITH.
> Step7) A cgl53 node is promoted to a DC node.
> Step8) A cgl49 node is done STONITH of again.
>       However, because the cgl49 node has STONITH only from a cgl54 node, STONITH does time-out and
> does a loop.
>
> ============
> Last updated: Mon Aug 30 14:40:58 2010
> Stack: Heartbeat
> Current DC: cgl53 (a07bcfc0-7aee-4382-9a2b-711b9c93e7e9) - partition WITHOUT quorum
> Version: 1.0.9-74392a28b7f3 stable-1.0 tip
> 4 Nodes configured, unknown expected votes
> 16 Resources configured.
> ============
>
> Node cgl49 (979c05ea-442b-4f53-9ba7-6cb7e82f30ac): UNCLEAN (offline)
> Node cgl54 (9bea1025-3cbe-481f-830d-a24dfc7f0374): UNCLEAN (offline)
> Online: [ cgl50 cgl53 ]
>
> Step9) When a cgl54 node restores, the election of the DC is performed, but an error occurs here.
>
>  * cgl50 node
>  crmd: [32110]: info: do_state_transition: State transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION
> cause=C_FSA_INTERNAL origin=do_election_count_vote ]
>  crmd: [32110]: info: update_dc: Unset DC cgl53
>  (snip)
>  cgl50 crmd: [32110]: ERROR: crm_timer_popped: Election Timeout (I_ELECTION_DC) just popped!
>
>  * cgl53 node
>  crmd: [1325]: info: do_state_transition: State transition S_INTEGRATION -> S_ELECTION [
> input=I_ELECTION cause=C_FSA_INTERNAL origin=do_election_count_vote ]
>  cgl53 crmd: [1325]: info: update_dc: Unset DC cgl53
>  (snip)
>  crmd: [1325]: ERROR: crm_timer_popped: Election Timeout (I_ELECTION_DC) just popped!
>  (snip)
>  crmd: [1325]: ERROR: crm_timer_popped: Election Timeout (I_ELECTION_DC) just popped!
>  (siip)
>  crmd: [1325]: info: crmd_ha_msg_filter: Another DC detected: cgl50 (op=join_offer)
>
>
> Step10) A cgl53 node becomes the "Pending" state.
> And a cgl53 node becomes the "online" state after STONITH of the wait state did time-out.
>
> Why is it that "Election Timeout" occurred?

Possibly the ccm membership hasn't fully recovered.

> Why is it that a cgl53 node became the "Pending" state?

this is usually when we know the node is up, but we couldn't complete
the crm-level negotiation necessary for it to run resources.
possibly its in a bad state waiting for something to start or its
replies are being lost

> Possibly this may be a problem of ccm.
> In addition, the same problem may be already reported.
>
>
>  * Because a log file was big, I registered the same contents with Bugzilla.
>  * http://developerbugs.linux-foundation.org/show_bug.cgi?id=2502

ok, i'll follow up there




More information about the Pacemaker mailing list