[ClusterLabs] temporary loss of quorum when member starts to rejoin

Tue Apr 7 17:02:30 EDT 2020

On Tue, 7 Apr 2020 14:13:35 -0400
Sherrard Burton <sb-clusterlabs at allafrica.com> wrote:

> On 4/7/20 1:16 PM, Andrei Borzenkov wrote:
> > 07.04.2020 00:21, Sherrard Burton пишет:  
> >>>
> >>> It looks like some timing issue or race condition. After reboot node
> >>> manages to contact qnetd first, before connection to other node is
> >>> established. Qnetd behaves as documented - it sees two equal size
> >>> partitions and favors the partition that includes tie breaker (lowest
> >>> node id). So existing node goes out of quorum. Second later both nodes
> >>> see each other and so quorum is regained.  
> >>  
> > 
> > Define the right problem to solve?
> > 
> > Educated guess is that your problem is not corosync but pacemaker
> > stopping resources. In this case just do what was done for years in two
> > node cluster - set no-quorum-policy=ignore and rely on stonith to
> > resolve split brain.
> > 
> > I dropped idea to use qdevice in two node cluster. If you have reliable
> > stonith device it is not needed and without stonith relying on watchdog
> > suicide has too many problems.
> >   
> 
> Andrei,
> in a two-node cluster with stonith only, but no qdevice, how do you 
> avoid the dreaded stonith death match, and the resultant flip-flopping 
> of services?

In my understanding, two_node and wait_for_all should avoid this.

After a node A has been fenced, the node B keeps the quorum thanks to two_node.
When A comes back, as long as it is not able to join the corosync group, it will
not be quorate thanks to wait_for_all. No quorum, no fencing allowed.

But the best protection is to disable pacemaker on boot so an admin can
investigate the situation and join back the node safely.

Regards,