<div dir="ltr"><div>Did the "active on too many nodes" message happen right after a probe? If so, then it does sound like the probe returned code 0.</div><div><br></div><div>If a probe returned 0 and it **shouldn't** have done so, then either the monitor operation needs to be redesigned, or resource-discovery=never (or resource-discovery=exclusive) can be used to prevent the probe from happening where it should not.</div><div><br></div><div>If a probe returned 0 and it **should** have done so, but the stop operation on the other node wasn't reflected in the CIB (so that the resource still appeared to be active there), then that's odd.<br></div><div><br></div><div>A bug is certainly possible, though we can't say without more detail :)<br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sun, Mar 7, 2021 at 11:10 PM Ulrich Windl <<a href="mailto:Ulrich.Windl@rz.uni-regensburg.de">Ulrich.Windl@rz.uni-regensburg.de</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">>>> Reid Wahl <<a href="mailto:nwahl@redhat.com" target="_blank">nwahl@redhat.com</a>> schrieb am 05.03.2021 um 21:22 in Nachricht<br>

<<a href="mailto:CAPiuu991O08DnaVkm9bc8N9BK-%2BNH9e0_F25o6DdiS5WZWGSsQ@mail.gmail.com" target="_blank">CAPiuu991O08DnaVkm9bc8N9BK-+NH9e0_F25o6DdiS5WZWGSsQ@mail.gmail.com</a>>:<br>

> On Fri, Mar 5, 2021 at 10:13 AM Ken Gaillot <<a href="mailto:kgaillot@redhat.com" target="_blank">kgaillot@redhat.com</a>> wrote:<br>

> <br>

>> On Fri, 2021-03-05 at 11:39 +0100, Ulrich Windl wrote:<br>

>> > Hi!<br>

>> ><br>

>> > I'm unsure what actually causes a problem I see (a resource was<br>

>> > "detected running" when it actually was not), but I'm sure some probe<br>

>> > started on cluster node start cannot provide a useful result until<br>

>> > some other resource has been started. AFAIK there is no way to make a<br>

>> > probe obey odering or colocation constraints, so the only work-around<br>

>> > seems to be a delay. However I'm unsure whether probes can actually<br>

>> > be delayed.<br>

>> ><br>

>> > Ideas?<br>

>><br>

>> Ordered probes are a thorny problem that we've never been able to come<br>

>> up with a general solution for. We do order certain probes where we<br>

>> have enough information to know it's safe. The problem is that it is<br>

>> very easy to introduce ordering loops.<br>

>><br>

>> I don't remember if there any workarounds.<br>

>><br>

> <br>

> Maybe as a workaround:<br>

>   - Add an ocf:pacemaker:attribute resource after-and-with rsc1<br>

>   - Then configure a location rule for rsc2 with resource-discovery=never<br>

> and score=-INFINITY with expression (in pseudocode) "attribute is not set<br>

> to active value"<br>

> <br>

> I haven't tested but that might cause rsc2's probe to wait until rsc1 is<br>

> active.<br>

> <br>

> And of course, use the usual constraints/rules to ensure rsc2's probe only<br>

> runs on rsc1's node.<br>

> <br>

> <br>

>> > Despite of that I wonder whether some probe/monitor returncode like<br>

>> > OCF_NOT_READY would make sense if the operation detects that it<br>

>> > cannot return a current status (so both "running" and "stopped" would<br>

>> > be as inadequate as "starting" and "stopping" would be (despite of<br>

>> > the fact that the latter two do not exist)).<br>

>><br>

> <br>

> This seems logically reasonable, independent of any implementation<br>

> complexity and considerations of what we would do with that return code.<br>

<br>

Thanks for the proposal!<br>

The actual problem I was facing was that the cluster claimed some resource would be running on two nodes at the same time, when actually one node had been stopped properly (with all the resources). The bad state in the CIB was most likely due to a software bug in pacemaker, but probes on re-starting the node seemed not to prevent pacemaker from doing a really wrong "recovery action".<br>

My hope was that probes might update the CIB before some stupid action is being dopne. Maybe it's just another software bug...<br>

<br>

Regards,<br>

Ulrich<br>

<br>

> <br>

> <br>

>> > Regards,<br>

>> > Ulrich<br>

>> --<br>

>> Ken Gaillot <<a href="mailto:kgaillot@redhat.com" target="_blank">kgaillot@redhat.com</a>><br>

>><br>

>> _______________________________________________<br>

>> Manage your subscription:<br>

>> <a href="https://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.clusterlabs.org/mailman/listinfo/users</a> <br>

>><br>

>> ClusterLabs home: <a href="https://www.clusterlabs.org/" rel="noreferrer" target="_blank">https://www.clusterlabs.org/</a> <br>

>><br>

>><br>

> <br>

> -- <br>

> Regards,<br>

> <br>

> Reid Wahl, RHCA<br>

> Senior Software Maintenance Engineer, Red Hat<br>

> CEE - Platform Support Delivery - ClusterHA<br>

<br>

<br>

<br>

<br>

_______________________________________________<br>

Manage your subscription:<br>

<a href="https://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.clusterlabs.org/mailman/listinfo/users</a><br>

<br>

ClusterLabs home: <a href="https://www.clusterlabs.org/" rel="noreferrer" target="_blank">https://www.clusterlabs.org/</a><br>

<br>

</blockquote></div><br clear="all"><br>-- <br><div dir="ltr" class="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div>Regards,<br><br></div>Reid Wahl, RHCA<br></div><div>Senior Software Maintenance Engineer, Red Hat<br></div>CEE - Platform Support Delivery - ClusterHA</div></div></div></div></div></div></div></div></div></div></div></div></div></div>