[ClusterLabs] resource start after network reconnected

Fri Nov 19 12:45:11 EST 2021

On Fri, 2021-11-19 at 10:40 -0500, john tillman wrote:

<snip>

> > If pacemaker tries to stop resources due to out of quorum
> > condition, you
> > could set suitable failure-timeout; this will be equivalent to
> > using "pcs
> > resource refresh". Keep in mind that pacemaker only checks for
> > failure-timeout expiration every cluster-recheck-interval (15 

That's true only for Pacemaker versions less than 2.0.3; since 2.0.3,
the cluster rechecks as soon as the timeout hits.

> > minutes by
> > default). This still is not directly related to network
> > availability, but
> > if network outage resulted in node going out of quorum, when
> > network is
> > back and node joined cluster again it will allow resources to be
> > started
> > on node.
> > 
> 
> When quorum is lost I want all the resources to stop.  The cluster is
> performing this step correctly for me.

As long as it's working properly. If quorum is lost because one of the
nodes is malfunctioning -- maybe a device driver locked up the system,
or CPU wait is horrific due to an out-of-control process or disk
failure -- then that node will not know quorum has been lost and will
not stop resources. If the condition then clears up, suddenly you have
split-brain with two nodes running resources.

> 
> That cluster-recheck-interval would explain the intermittence I saw
> this
> morning.  If I set that to 1 minute would that cause any gross
> negative
> issues?

It increases CPU usage and IPC traffic. For Pacemaker 2.0.3 or later, I
definitely wouldn't bother. For older versions, 1 minute feels a bit
much, I would go with around 5.

> 
> Is there another setting besides cluster-recheck-interval to consider
> adjusting to start mysql when quorum is returned?
> 
> Thank you for the feedback.
> 
> -John

-- 
Ken Gaillot <kgaillot at redhat.com>