[ClusterLabs] Antw: [EXT] no-quorum-policy=stop never executed, pacemaker stuck in election/integration, corosync running in "new membership" cycles with itself

kgaillot at redhat.com kgaillot at redhat.com
Tue Jun 1 11:31:21 EDT 2021


On Tue, 2021-06-01 at 13:18 +0200, Ulrich Windl wrote:
> Hi!
> 
> I can't answer, but I doubt the usefulness of "no-quorum-
> policy=stop":
> If nodes loose quorum, they try to stop all resources, but "remain"
> in the
> cluster (will respond to network queries (if any arrive).
> If one of those "stop"s fails, the other part of the cluster never
> knows.
> So what can be done? Should the "other(left)" part of the cluster
> start
> resources, assuming the "other(right)" part of the cluster had
> stopped
> resources successfully?


no-quorum-policy only affects what the non-quorate partition will do.
The quorate partition will still fence the non-quorate part if it is
able, regardless of no-quorum-policy, and won't recover resources until
fencing succeeds.

> 
> Regards,
> Ulrich
> 
> > > > Lars Ellenberg <lars.ellenberg at linbit.com> schrieb am
> > > > 01.06.2021 um 12:52
> 
> in
> Nachricht
> <CANr6vz-rbS3BnuJsxhQzRnMpJe1u+NPhqp+ejNJWnHDScZwSRg at mail.gmail.com>:
> > pcmk 2.0.5, corosync 3.1.0, knet, rhel8
> > I know fencing "solves" this just fine.
> > 
> > what I'd like to understand though is: what exactly is corosync or
> > pacemaker waiting for here,
> > why does it not manage to get to the stage where it would even
> > attempt
> > to "stop" stuff?
> > 
> > two "rings" aka knet interfaces.
> > node isolation test with iptables,
> > INPUT/OUTPUT ‑j DROP on one interface, shortly after on the second
> > as well.
> >  node loses quorum (obviously).
> > 
> > pacemaker is expected to no‑quorum‑policy=stop,
> > but is "stuck" in Election ‑> Integration,
> > while corosync "cycles" bewteen "new membership" (with only
> > itself, 
> > obviously)
> > and "token has not been received in ...", "sync members ...", "new
> > membership has formed ..."
> > 
> > I would have expected corosync to come back with a "stable
> > non‑quorate
> > membership" of just itself
> > within a very short period of time, and pacemaker winning the
> > "election"/"integration" with just itself,
> > and then trying to call "stop" on everything it knows about.

That's what I'd expect, too. I'm guessing the corosync cycling is
what's causing the pacemaker cycling, so I'd focus on corosync first.

> > I'm asking for hints what to look for in the logs, or how to drill
> > down further as to why that is not the case.
> > 
> >     Lars
> > _______________________________________________
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users 
> > 
> > ClusterLabs home: https://www.clusterlabs.org/ 
> 
> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
-- 
Ken Gaillot <kgaillot at redhat.com>



More information about the Users mailing list