[ClusterLabs] Antw: [EXT] Re: Q: constrain or delay "probes"?

Reid Wahl nwahl at redhat.com
Mon Mar 8 02:42:03 EST 2021


Did the "active on too many nodes" message happen right after a probe? If
so, then it does sound like the probe returned code 0.

If a probe returned 0 and it **shouldn't** have done so, then either the
monitor operation needs to be redesigned, or resource-discovery=never (or
resource-discovery=exclusive) can be used to prevent the probe from
happening where it should not.

If a probe returned 0 and it **should** have done so, but the stop
operation on the other node wasn't reflected in the CIB (so that the
resource still appeared to be active there), then that's odd.

A bug is certainly possible, though we can't say without more detail :)

On Sun, Mar 7, 2021 at 11:10 PM Ulrich Windl <
Ulrich.Windl at rz.uni-regensburg.de> wrote:

> >>> Reid Wahl <nwahl at redhat.com> schrieb am 05.03.2021 um 21:22 in
> Nachricht
> <CAPiuu991O08DnaVkm9bc8N9BK-+NH9e0_F25o6DdiS5WZWGSsQ at mail.gmail.com>:
> > On Fri, Mar 5, 2021 at 10:13 AM Ken Gaillot <kgaillot at redhat.com> wrote:
> >
> >> On Fri, 2021-03-05 at 11:39 +0100, Ulrich Windl wrote:
> >> > Hi!
> >> >
> >> > I'm unsure what actually causes a problem I see (a resource was
> >> > "detected running" when it actually was not), but I'm sure some probe
> >> > started on cluster node start cannot provide a useful result until
> >> > some other resource has been started. AFAIK there is no way to make a
> >> > probe obey odering or colocation constraints, so the only work-around
> >> > seems to be a delay. However I'm unsure whether probes can actually
> >> > be delayed.
> >> >
> >> > Ideas?
> >>
> >> Ordered probes are a thorny problem that we've never been able to come
> >> up with a general solution for. We do order certain probes where we
> >> have enough information to know it's safe. The problem is that it is
> >> very easy to introduce ordering loops.
> >>
> >> I don't remember if there any workarounds.
> >>
> >
> > Maybe as a workaround:
> >   - Add an ocf:pacemaker:attribute resource after-and-with rsc1
> >   - Then configure a location rule for rsc2 with resource-discovery=never
> > and score=-INFINITY with expression (in pseudocode) "attribute is not set
> > to active value"
> >
> > I haven't tested but that might cause rsc2's probe to wait until rsc1 is
> > active.
> >
> > And of course, use the usual constraints/rules to ensure rsc2's probe
> only
> > runs on rsc1's node.
> >
> >
> >> > Despite of that I wonder whether some probe/monitor returncode like
> >> > OCF_NOT_READY would make sense if the operation detects that it
> >> > cannot return a current status (so both "running" and "stopped" would
> >> > be as inadequate as "starting" and "stopping" would be (despite of
> >> > the fact that the latter two do not exist)).
> >>
> >
> > This seems logically reasonable, independent of any implementation
> > complexity and considerations of what we would do with that return code.
>
> Thanks for the proposal!
> The actual problem I was facing was that the cluster claimed some resource
> would be running on two nodes at the same time, when actually one node had
> been stopped properly (with all the resources). The bad state in the CIB
> was most likely due to a software bug in pacemaker, but probes on
> re-starting the node seemed not to prevent pacemaker from doing a really
> wrong "recovery action".
> My hope was that probes might update the CIB before some stupid action is
> being dopne. Maybe it's just another software bug...
>
> Regards,
> Ulrich
>
> >
> >
> >> > Regards,
> >> > Ulrich
> >> --
> >> Ken Gaillot <kgaillot at redhat.com>
> >>
> >> _______________________________________________
> >> Manage your subscription:
> >> https://lists.clusterlabs.org/mailman/listinfo/users
> >>
> >> ClusterLabs home: https://www.clusterlabs.org/
> >>
> >>
> >
> > --
> > Regards,
> >
> > Reid Wahl, RHCA
> > Senior Software Maintenance Engineer, Red Hat
> > CEE - Platform Support Delivery - ClusterHA
>
>
>
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
>

-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20210307/f7ef4377/attachment.htm>


More information about the Users mailing list