[Pacemaker] Need help with resolving very long election cycle

Shyam shyam.kaushik at gmail.com
Thu Feb 2 05:57:04 EST 2012


Hi Andreas,

Yes this is only for testing. The specific test was not two VM's running on
same host. We have two physical servers each running a VM & the VM's run
pacemaker/heartbeat. We reboot both physical servers (to simulate a
power-fail) & after that watch both VM's do negotiation.

--Shyam

On Thu, Feb 2, 2012 at 3:38 PM, Andreas Kurz <andreas at hastexo.com> wrote:

> On 02/02/2012 04:45 AM, Shyam wrote:
> > Hi Andreas,
> >
> > Thanks for your reply.
> >
> > We are using pacemaker in VM environment & was primarily checking how it
> > behaves when two nodes hosting the clustered VM's reboot. It apparently
> > took a very long time doing the elections.
>
> Ok, but this is only for testing? For a production system the VMs
> running a cluster should not run on the same host as this would be a SPOF.
>
> >
> > I realized that we were using dc-deadtime at 5sec. After bumping this up
> > to 10sec, this long election cycle problem disappeared.
>
> ... interesting
>
> Regards,
> Andreas
>
> --
> Need help with Pacemaker?
> http://www.hastexo.com/now
>
> >
> > --Shyam
> >
> > On Thu, Feb 2, 2012 at 3:59 AM, Andreas Kurz <andreas at hastexo.com
> > <mailto:andreas at hastexo.com>> wrote:
> >
> >     On 01/27/2012 12:21 PM, Shyam wrote:
> >     > Folks,
> >     >
> >     > We are constantly running into a long election cycle where in a
> 2-node
> >     > cluster when both of them are simultaneously rebooted, they take a
> >     long
> >     > time running through election loop.
> >
> >     why do you want to reboot them simultaneously? ... stop them one
> after
> >     another and this will work fine.
> >
> >     If you want to avoid time consuming resource movement use cluster
> >     property stop-all-resources prior to the serialized shutdown.
> >
> >     Regards,
> >     Andreas
> >
> >     --
> >     Need help with Pacemaker?
> >     http://www.hastexo.com/now
> >
> >     >
> >     > On one node pacemaker loops like:
> >     > Jan 26 22:03:20 vsa-0000009c-vc-1 crmd: [1134]: info:
> do_dc_takeover:
> >     > Taking over DC status for this partition
> >     > Jan 26 22:03:20 vsa-0000009c-vc-1 cib: [1130]: info:
> >     > cib_process_readwrite: We are now in R/O mode
> >     > Jan 26 22:03:20 vsa-0000009c-vc-1 cib: [1130]: info:
> >     > cib_process_request: Operation complete: op cib_slave_all for
> section
> >     > 'all' (origin=local/crmd/222, version=1.1.1): ok (rc=0)
> >     > Jan 26 22:03:20 vsa-0000009c-vc-1 cib: [1130]: info:
> >     > cib_process_readwrite: We are now in R/W mode
> >     > Jan 26 22:03:20 vsa-0000009c-vc-1 cib: [1130]: info:
> >     > cib_process_request: Operation complete: op cib_master for section
> >     'all'
> >     > (origin=local/crmd/223, version=1.1.1): ok (rc=0)
> >     > Jan 26 22:03:20 vsa-0000009c-vc-1 cib: [1130]: info:
> >     > cib_process_request: Operation complete: op cib_modify for section
> cib
> >     > (origin=local/crmd/224, version=1.1.1): ok (rc=0)
> >     > Jan 26 22:03:20 vsa-0000009c-vc-1 cib: [1130]: info:
> >     > cib_process_request: Operation complete: op cib_modify for section
> >     > crm_config (origin=local/crmd/226, version=1.1.1): ok (rc=0)
> >     > Jan 26 22:03:20 vsa-0000009c-vc-1 crmd: [1134]: info:
> >     > do_dc_join_offer_all: join-25: Waiting on 2 outstanding join acks
> >     > Jan 26 22:03:20 vsa-0000009c-vc-1 cib: [1130]: info:
> >     > cib_process_request: Operation complete: op cib_modify for section
> >     > crm_config (origin=local/crmd/228, version=1.1.1): ok (rc=0)
> >     > Jan 26 22:03:20 vsa-0000009c-vc-1 crmd: [1134]: info:
> >     > config_query_callback: Checking for expired actions every 900000ms
> >     > Jan 26 22:03:20 vsa-0000009c-vc-1 crmd: [1134]: info:
> >     > do_election_count_vote: Election 50 (owner:
> >     > 00000156-0156-0000-2b91-000000000000) pass: vote from
> >     vsa-0000009c-vc-0
> >     > (Age)
> >     > Jan 26 22:03:20 vsa-0000009c-vc-1 crmd: [1134]: info: update_dc:
> >     Set DC
> >     > to vsa-0000009c-vc-1 (3.0.1)
> >     > Jan 26 22:03:20 vsa-0000009c-vc-1 crmd: [1134]: info:
> >     > do_state_transition: State transition S_INTEGRATION -> S_ELECTION [
> >     > input=I_ELECTION cause=C_FSA_INTERNAL
> origin=do_election_count_vote ]
> >     > Jan 26 22:03:20 vsa-0000009c-vc-1 crmd: [1134]: info: update_dc:
> Unset
> >     > DC vsa-0000009c-vc-1
> >     > Jan 26 22:03:21 vsa-0000009c-vc-1 crmd: [1134]: info:
> >     > do_election_count_vote: Election 51 (owner:
> >     > 00000156-0156-0000-2b91-000000000000) pass: vote from
> >     vsa-0000009c-vc-0
> >     > (Age)
> >     > Jan 26 22:03:21 vsa-0000009c-vc-1 crmd: [1134]: WARN: do_log: FSA:
> >     Input
> >     > I_JOIN_REQUEST from route_message() received in state S_ELECTION
> >     > Jan 26 22:03:22 vsa-0000009c-vc-1 crmd: [1134]: info:
> >     > do_state_transition: State transition S_ELECTION -> S_INTEGRATION [
> >     > input=I_ELECTION_DC cause=C_FSA_INTERNAL origin=do_election_check ]
> >     > Jan 26 22:03:22 vsa-0000009c-vc-1 crmd: [1134]: info:
> start_subsystem:
> >     > Starting sub-system "pengine"
> >     > Jan 26 22:03:22 vsa-0000009c-vc-1 crmd: [1134]: WARN:
> start_subsystem:
> >     > Client pengine already running as pid 1234
> >     > Jan 26 22:03:26 vsa-0000009c-vc-1 crmd: [1134]: info:
> do_dc_takeover:
> >     > Taking over DC status for this partition
> >     > Jan 26 22:03:26 vsa-0000009c-vc-1 cib: [1130]: info:
> >     > cib_process_readwrite: We are now in R/O mode
> >     > Jan 26 22:03:26 vsa-0000009c-vc-1 cib: [1130]: info:
> >     > cib_process_request: Operation complete: op cib_slave_all for
> section
> >     > 'all' (origin=local/crmd/231, version=1.1.1): ok (rc=0)
> >     > Jan 26 22:03:26 vsa-0000009c-vc-1 cib: [1130]: info:
> >     > cib_process_readwrite: We are now in R/W mode
> >     > Jan 26 22:03:26 vsa-0000009c-vc-1 cib: [1130]: info:
> >     > cib_process_request: Operation complete: op cib_master for section
> >     'all'
> >     > (origin=local/crmd/232, version=1.1.1): ok (rc=0)
> >     > Jan 26 22:03:26 vsa-0000009c-vc-1 cib: [1130]: info:
> >     > cib_process_request: Operation complete: op cib_modify for section
> cib
> >     > (origin=local/crmd/233, version=1.1.1): ok (rc=0)
> >     > Jan 26 22:03:26 vsa-0000009c-vc-1 cib: [1130]: info:
> >     > cib_process_request: Operation complete: op cib_modify for section
> >     > crm_config (origin=local/crmd/235, version=1.1.1): ok (rc=0)
> >     > Jan 26 22:03:26 vsa-0000009c-vc-1 crmd: [1134]: info:
> >     > do_dc_join_offer_all: join-26: Waiting on 2 outstanding join acks
> >     > Jan 26 22:03:26 vsa-0000009c-vc-1 cib: [1130]: info:
> >     > cib_process_request: Operation complete: op cib_modify for section
> >     > crm_config (origin=local/crmd/237, version=1.1.1): ok (rc=0)
> >     > Jan 26 22:03:26 vsa-0000009c-vc-1 crmd: [1134]: info:
> >     > config_query_callback: Checking for expired actions every 900000ms
> >     > Jan 26 22:03:26 vsa-0000009c-vc-1 crmd: [1134]: info:
> >     > do_election_count_vote: Election 52 (owner:
> >     > 00000156-0156-0000-2b91-000000000000) pass: vote from
> >     vsa-0000009c-vc-0
> >     > (Age)
> >     > Jan 26 22:03:26 vsa-0000009c-vc-1 crmd: [1134]: info: update_dc:
> >     Set DC
> >     > to vsa-0000009c-vc-1 (3.0.1)
> >     > Jan 26 22:03:26 vsa-0000009c-vc-1 crmd: [1134]: info:
> >     > do_state_transition: State transition S_INTEGRATION -> S_ELECTION [
> >     > input=I_ELECTION cause=C_FSA_INTERNAL
> origin=do_election_count_vote ]
> >     > Jan 26 22:03:26 vsa-0000009c-vc-1 crmd: [1134]: info: update_dc:
> Unset
> >     > DC vsa-0000009c-vc-1
> >     > Jan 26 22:03:27 vsa-0000009c-vc-1 crmd: [1134]: info:
> >     > do_election_count_vote: Election 53 (owner:
> >     > 00000156-0156-0000-2b91-000000000000) pass: vote from
> >     vsa-0000009c-vc-0
> >     > (Age)
> >     > Jan 26 22:03:27 vsa-0000009c-vc-1 crmd: [1134]: WARN: do_log: FSA:
> >     Input
> >     > I_JOIN_REQUEST from route_message() received in state S_ELECTION
> >     > Jan 26 22:03:28 vsa-0000009c-vc-1 crmd: [1134]: info:
> >     > do_state_transition: State transition S_ELECTION -> S_INTEGRATION [
> >     > input=I_ELECTION_DC cause=C_FSA_INTERNAL origin=do_election_check ]
> >     > Jan 26 22:03:28 vsa-0000009c-vc-1 crmd: [1134]: info:
> start_subsystem:
> >     > Starting sub-system "pengine"
> >     > Jan 26 22:03:28 vsa-0000009c-vc-1 crmd: [1134]: WARN:
> start_subsystem:
> >     > Client pengine already running as pid 1234
> >     >
> >     > &  other node with
> >     > Jan 26 22:03:20 vsa-0000009c-vc-0 crmd: [1314]: info:
> >     crm_timer_popped:
> >     > Election Trigger (I_DC_TIMEOUT) just popped!
> >     > Jan 26 22:03:20 vsa-0000009c-vc-0 crmd: [1314]: WARN: do_log: FSA:
> >     Input
> >     > I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING
> >     > Jan 26 22:03:20 vsa-0000009c-vc-0 crmd: [1314]: info:
> >     > do_state_transition: State transition S_PENDING -> S_ELECTION [
> >     > input=I_DC_TIMEOUT cause=C_TIMER_POPPED origin=crm_timer_popped ]
> >     > Jan 26 22:03:21 vsa-0000009c-vc-0 crmd: [1314]: WARN: do_log: FSA:
> >     Input
> >     > I_JOIN_OFFER from route_message() received in state S_ELECTION
> >     > Jan 26 22:03:21 vsa-0000009c-vc-0 crmd: [1314]: info:
> >     > do_state_transition: State transition S_ELECTION -> S_PENDING [
> >     > input=I_PENDING cause=C_FSA_INTERNAL origin=do_election_count_vote
> ]
> >     > Jan 26 22:03:21 vsa-0000009c-vc-0 crmd: [1314]: info:
> >     do_dc_release: DC
> >     > role released
> >     > Jan 26 22:03:21 vsa-0000009c-vc-0 crmd: [1314]: info:
> do_te_control:
> >     > Transitioner is now inactive
> >     > Jan 26 22:03:26 vsa-0000009c-vc-0 crmd: [1314]: info:
> >     crm_timer_popped:
> >     > Election Trigger (I_DC_TIMEOUT) just popped!
> >     > Jan 26 22:03:26 vsa-0000009c-vc-0 crmd: [1314]: WARN: do_log: FSA:
> >     Input
> >     > I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING
> >     > Jan 26 22:03:26 vsa-0000009c-vc-0 crmd: [1314]: info:
> >     > do_state_transition: State transition S_PENDING -> S_ELECTION [
> >     > input=I_DC_TIMEOUT cause=C_TIMER_POPPED origin=crm_timer_popped ]
> >     > Jan 26 22:03:27 vsa-0000009c-vc-0 crmd: [1314]: WARN: do_log: FSA:
> >     Input
> >     > I_JOIN_OFFER from route_message() received in state S_ELECTION
> >     > Jan 26 22:03:27 vsa-0000009c-vc-0 crmd: [1314]: info:
> >     > do_state_transition: State transition S_ELECTION -> S_PENDING [
> >     > input=I_PENDING cause=C_FSA_INTERNAL origin=do_election_count_vote
> ]
> >     > Jan 26 22:03:27 vsa-0000009c-vc-0 crmd: [1314]: info:
> >     do_dc_release: DC
> >     > role released
> >     > Jan 26 22:03:27 vsa-0000009c-vc-0 crmd: [1314]: info:
> do_te_control:
> >     > Transitioner is now inactive
> >     >
> >     > This takes several minutes & finally breaks.
> >     >
> >     > Any pointers on what can be causing this?
> >     >
> >     > Thanks.
> >     >
> >     > --Shyam
> >     >
> >     >
> >     > _______________________________________________
> >     > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >     <mailto:Pacemaker at oss.clusterlabs.org>
> >     > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >     >
> >     > Project Home: http://www.clusterlabs.org
> >     > Getting started:
> >     http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >     > Bugs: http://bugs.clusterlabs.org
> >
> >
> >
> >     _______________________________________________
> >     Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >     <mailto:Pacemaker at oss.clusterlabs.org>
> >     http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> >     Project Home: http://www.clusterlabs.org
> >     Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >     Bugs: http://bugs.clusterlabs.org
> >
> >
> >
> >
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20120202/524e2cf9/attachment-0003.html>


More information about the Pacemaker mailing list