[Pacemaker] Problems when quorum lost for a short period of time

Thu Oct 3 15:25:30 EDT 2013

On Wed, 2013-10-02 at 10:40 +0200, Lars Marowsky-Bree wrote:
> On 2013-10-02T09:26:26, Lev Sidorenko <levs at securemedia.co.nz> wrote:
> 
> > It is actually 2 nodes for main+stanby and another two nodes just for
> > provide quorum.
> 
> Like Andrew wrote, a third node would be enough for that purpose.
> 
> You might as well run an iSCSI target on that node (instead of the full
> cluster stack) and use sbd to provide fencing and a quorum protocol with
> implied self-fencing.
> 
> > I have no-quorum-policy="stop"
> 
> If you want to be more tolerant of blips, you might consider changing
> this to "freeze". Then you'll be fine - the surviving nodes will attain
> quorum and fence the node if the issue persists.
> 
> > So, sometimes main node looses connection to the cluster and reports
> > "quorum lost" but after 1-2 seconds connection re-establish and reports
> > "quorum retained"
> 
> The main problem of course is this. *Why* are you losing network
> connectivity so frequently that this is a problem? I assume you have
> multiple network interfaces? (Which certainly are cheaper to get than
> more nodes ...)

Yes, we are also investigating the network problem.

> 
> You should investigate and fix the underlaying problem.
> 
> You can also tweak the timeouts in corosync.conf.

I found here:
http://linux.die.net/man/5/corosync.conf
several option which seems to me can be used for increase timeout for
cluster to detect communication failure and triggers "no quorum", they
are:

- token
- merge
- fail_recv_const
- seqno_unchanged_const
- heartbeat_failures_allowed
- max_network_delay
- rrp_problem_count_timeout
- rrp_problem_count_threshold
- rrp_problem_count_mcast_threshold

What is better to use it that situation?

> 
> 
> 
> Regards,
>     Lars
>