[ClusterLabs] Antw: [EXT] Re: Q: constrain or delay "probes"?

Ulrich Windl Ulrich.Windl at rz.uni-regensburg.de
Mon Mar 8 02:08:57 EST 2021


>>> Reid Wahl <nwahl at redhat.com> schrieb am 05.03.2021 um 21:22 in Nachricht
<CAPiuu991O08DnaVkm9bc8N9BK-+NH9e0_F25o6DdiS5WZWGSsQ at mail.gmail.com>:
> On Fri, Mar 5, 2021 at 10:13 AM Ken Gaillot <kgaillot at redhat.com> wrote:
> 
>> On Fri, 2021-03-05 at 11:39 +0100, Ulrich Windl wrote:
>> > Hi!
>> >
>> > I'm unsure what actually causes a problem I see (a resource was
>> > "detected running" when it actually was not), but I'm sure some probe
>> > started on cluster node start cannot provide a useful result until
>> > some other resource has been started. AFAIK there is no way to make a
>> > probe obey odering or colocation constraints, so the only work-around
>> > seems to be a delay. However I'm unsure whether probes can actually
>> > be delayed.
>> >
>> > Ideas?
>>
>> Ordered probes are a thorny problem that we've never been able to come
>> up with a general solution for. We do order certain probes where we
>> have enough information to know it's safe. The problem is that it is
>> very easy to introduce ordering loops.
>>
>> I don't remember if there any workarounds.
>>
> 
> Maybe as a workaround:
>   - Add an ocf:pacemaker:attribute resource after-and-with rsc1
>   - Then configure a location rule for rsc2 with resource-discovery=never
> and score=-INFINITY with expression (in pseudocode) "attribute is not set
> to active value"
> 
> I haven't tested but that might cause rsc2's probe to wait until rsc1 is
> active.
> 
> And of course, use the usual constraints/rules to ensure rsc2's probe only
> runs on rsc1's node.
> 
> 
>> > Despite of that I wonder whether some probe/monitor returncode like
>> > OCF_NOT_READY would make sense if the operation detects that it
>> > cannot return a current status (so both "running" and "stopped" would
>> > be as inadequate as "starting" and "stopping" would be (despite of
>> > the fact that the latter two do not exist)).
>>
> 
> This seems logically reasonable, independent of any implementation
> complexity and considerations of what we would do with that return code.

Thanks for the proposal!
The actual problem I was facing was that the cluster claimed some resource would be running on two nodes at the same time, when actually one node had been stopped properly (with all the resources). The bad state in the CIB was most likely due to a software bug in pacemaker, but probes on re-starting the node seemed not to prevent pacemaker from doing a really wrong "recovery action".
My hope was that probes might update the CIB before some stupid action is being dopne. Maybe it's just another software bug...

Regards,
Ulrich

> 
> 
>> > Regards,
>> > Ulrich
>> --
>> Ken Gaillot <kgaillot at redhat.com>
>>
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>>
>> ClusterLabs home: https://www.clusterlabs.org/ 
>>
>>
> 
> -- 
> Regards,
> 
> Reid Wahl, RHCA
> Senior Software Maintenance Engineer, Red Hat
> CEE - Platform Support Delivery - ClusterHA






More information about the Users mailing list