[Pacemaker] "Election Timeout" and node became the "Pending" state.

renayama19661014 at ybb.ne.jp renayama19661014 at ybb.ne.jp
Tue Oct 5 00:44:11 EDT 2010


Hi,

We tested complicated node trouble.

An error of "Election Timeout" occurred then.

 * Pacemaker:pacemaker-1.0.9.1
 * heartbeat-3.0.3-2.3.el5 
 * cluster-glue:cluster-glue-1.0.6-1.6.el5 
 * resource-agents-1.0.3-1.0.dev.b7a3b1973ba7 

We tested it in the next procedure.

Step1) Start all nodes.
Step2) In a cgl49 node, we generate a monitor error of prmApPostgreSQLDB1.
Step3) A cgl49 node is done STONITH of by a cgl54 node.
Step4) With Step3, we do kill of the master process of the cgl54 node.
Step5) A cgl54 node reboots.
Step6) A cgl49 node is done STONITH.
Step7) A cgl53 node is promoted to a DC node.
Step8) A cgl49 node is done STONITH of again.
       However, because the cgl49 node has STONITH only from a cgl54 node, STONITH does time-out and
does a loop.

============
Last updated: Mon Aug 30 14:40:58 2010
Stack: Heartbeat
Current DC: cgl53 (a07bcfc0-7aee-4382-9a2b-711b9c93e7e9) - partition WITHOUT quorum
Version: 1.0.9-74392a28b7f3 stable-1.0 tip
4 Nodes configured, unknown expected votes
16 Resources configured.
============

Node cgl49 (979c05ea-442b-4f53-9ba7-6cb7e82f30ac): UNCLEAN (offline)
Node cgl54 (9bea1025-3cbe-481f-830d-a24dfc7f0374): UNCLEAN (offline)
Online: [ cgl50 cgl53 ]

Step9) When a cgl54 node restores, the election of the DC is performed, but an error occurs here.

 * cgl50 node 
 crmd: [32110]: info: do_state_transition: State transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION
cause=C_FSA_INTERNAL origin=do_election_count_vote ]
 crmd: [32110]: info: update_dc: Unset DC cgl53
 (snip)
 cgl50 crmd: [32110]: ERROR: crm_timer_popped: Election Timeout (I_ELECTION_DC) just popped!

 * cgl53 node 
 crmd: [1325]: info: do_state_transition: State transition S_INTEGRATION -> S_ELECTION [
input=I_ELECTION cause=C_FSA_INTERNAL origin=do_election_count_vote ]
 cgl53 crmd: [1325]: info: update_dc: Unset DC cgl53
 (snip)
 crmd: [1325]: ERROR: crm_timer_popped: Election Timeout (I_ELECTION_DC) just popped!
 (snip)
 crmd: [1325]: ERROR: crm_timer_popped: Election Timeout (I_ELECTION_DC) just popped!
 (siip)
 crmd: [1325]: info: crmd_ha_msg_filter: Another DC detected: cgl50 (op=join_offer)


Step10) A cgl53 node becomes the "Pending" state. 
And a cgl53 node becomes the "online" state after STONITH of the wait state did time-out.

Why is it that "Election Timeout" occurred? 
Why is it that a cgl53 node became the "Pending" state?

Possibly this may be a problem of ccm. 
In addition, the same problem may be already reported.


 * Because a log file was big, I registered the same contents with Bugzilla.
  * http://developerbugs.linux-foundation.org/show_bug.cgi?id=2502

Best Regards,
Hideo Yamauchi.






More information about the Pacemaker mailing list