[ClusterLabs] resource start after network reconnected

Fri Nov 19 15:49:20 EST 2021

On Fri, 2021-11-19 at 14:57 -0500, john tillman wrote:
> > On Fri, 2021-11-19 at 10:40 -0500, john tillman wrote:
> > 
> > <snip>
> > 
> > > > If pacemaker tries to stop resources due to out of quorum
> > > > condition, you
> > > > could set suitable failure-timeout; this will be equivalent to
> > > > using "pcs
> > > > resource refresh". Keep in mind that pacemaker only checks for
> > > > failure-timeout expiration every cluster-recheck-interval (15
> > 
> > That's true only for Pacemaker versions less than 2.0.3; since
> > 2.0.3,
> > the cluster rechecks as soon as the timeout hits.
> 
> I'm using pacemaker 2.0.5 and it is *not* starting MySQL when quorum
> is
> restored, at least not every time (~1 in 10).  So I have seen it work

That's due to a stop failure, not the recheck interval

> before but I'm more willing to believe that there was a user error in
> that
> one successful sample.
> 
> We (actual a team mate) got mysql to start when quorum is
> restored.  It
> required both setting the cluster-recheck-interval to something more
> frequent than 15min  and  setting the mysql resource's failure-
> timeout to
> non-zero.  In our case we set both to 1 minute with good results for
> the
> last few tests.  We can raise the frequency to something greater than
> 1
> but for our tests, 1 proves it out.

The failure-timeout is equivalent to running refresh when the timeout
hits. The cluster will then re-probe the status of the resource and
decide what, if anything, needs to be done about it.

I can only see that working if the stop failure is transient -- i.e.,
either the stop actually succeeded but returned a failure code (or
maybe timed out), and when the failure timeout or refresh happens, the
re-probe sees the database is actually not running; or the stop really
does fail, but by the time the failure timeout or refresh happens,
another stop attempt after the re-probe is able to succeed.

> 
> 
> > > > minutes by
> > > > default). This still is not directly related to network
> > > > availability, but
> > > > if network outage resulted in node going out of quorum, when
> > > > network is
> > > > back and node joined cluster again it will allow resources to
> > > > be
> > > > started
> > > > on node.
> > > > 
> > > 
> > > When quorum is lost I want all the resources to stop.  The
> > > cluster is
> > > performing this step correctly for me.
> > 
> > As long as it's working properly. If quorum is lost because one of
> > the
> > nodes is malfunctioning -- maybe a device driver locked up the
> > system,
> > or CPU wait is horrific due to an out-of-control process or disk
> > failure -- then that node will not know quorum has been lost and
> > will
> > not stop resources. If the condition then clears up, suddenly you
> > have
> > split-brain with two nodes running resources.
> > 
> > > That cluster-recheck-interval would explain the intermittence I
> > > saw
> > > this
> > > morning.  If I set that to 1 minute would that cause any gross
> > > negative
> > > issues?
> > 
> > It increases CPU usage and IPC traffic. For Pacemaker 2.0.3 or
> > later, I
> > definitely wouldn't bother. For older versions, 1 minute feels a
> > bit
> > much, I would go with around 5.
> > 
> > > Is there another setting besides cluster-recheck-interval to
> > > consider
> > > adjusting to start mysql when quorum is returned?
> > > 
> > > Thank you for the feedback.
> > > 
> > > -John
> > 
> > --
> > Ken Gaillot <kgaillot at redhat.com>
> > 
> > _______________________________________________
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> > 
> > ClusterLabs home: https://www.clusterlabs.org/
> > 
> > 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
> 
-- 
Ken Gaillot <kgaillot at redhat.com>