[ClusterLabs] temporary loss of quorum when member starts to rejoin

Jehan-Guillaume de Rorthais jgdr at dalibo.com
Tue Apr 7 17:02:30 EDT 2020


On Tue, 7 Apr 2020 14:13:35 -0400
Sherrard Burton <sb-clusterlabs at allafrica.com> wrote:

> On 4/7/20 1:16 PM, Andrei Borzenkov wrote:
> > 07.04.2020 00:21, Sherrard Burton пишет:  
> >>>
> >>> It looks like some timing issue or race condition. After reboot node
> >>> manages to contact qnetd first, before connection to other node is
> >>> established. Qnetd behaves as documented - it sees two equal size
> >>> partitions and favors the partition that includes tie breaker (lowest
> >>> node id). So existing node goes out of quorum. Second later both nodes
> >>> see each other and so quorum is regained.  
> >>  
> > 
> > Define the right problem to solve?
> > 
> > Educated guess is that your problem is not corosync but pacemaker
> > stopping resources. In this case just do what was done for years in two
> > node cluster - set no-quorum-policy=ignore and rely on stonith to
> > resolve split brain.
> > 
> > I dropped idea to use qdevice in two node cluster. If you have reliable
> > stonith device it is not needed and without stonith relying on watchdog
> > suicide has too many problems.
> >   
> 
> Andrei,
> in a two-node cluster with stonith only, but no qdevice, how do you 
> avoid the dreaded stonith death match, and the resultant flip-flopping 
> of services?

In my understanding, two_node and wait_for_all should avoid this.

After a node A has been fenced, the node B keeps the quorum thanks to two_node.
When A comes back, as long as it is not able to join the corosync group, it will
not be quorate thanks to wait_for_all. No quorum, no fencing allowed.

But the best protection is to disable pacemaker on boot so an admin can
investigate the situation and join back the node safely.

Regards,


More information about the Users mailing list