[ClusterLabs] Antw: Re: Antw: [EXT] Re: Q: constrain or delay "probes"?

Mon Mar 8 05:46:51 EST 2021

On 08.03.2021 11:57, Ulrich Windl wrote:
>>>> Reid Wahl <nwahl at redhat.com> schrieb am 08.03.2021 um 08:42 in Nachricht
> <CAPiuu9_V0-3k9k-Z8+z5u5t8bMh3sL3PzzdOLH9g8XCdmfqDow at mail.gmail.com>:
>> Did the "active on too many nodes" message happen right after a probe? If
>> so, then it does sound like the probe returned code 0.
> 
> Events were like this (I greatly condensed the logs):
> (DC h16 being stopped)
> Mar 05 09:53:45 h16 pacemaker-schedulerd[7189]:  notice:  * Migrate    prm_xen_v09              ( h16 -> h18 )
> Mar 05 09:54:23 h16 pacemaker-controld[7190]:  notice: Initiating migrate_to operation prm_xen_v09_migrate_to_0 locally on h16
> Mar 05 09:54:24 h16 libvirtd[8531]: internal error: Failed to send migration data to destination host
> Mar 05 09:54:24 h16 VirtualDomain(prm_xen_v09)[1834]: ERROR: v09: live migration to h18 failed: 1
> Mar 05 09:54:24 h16 pacemaker-controld[7190]:  notice: Transition 1000 action 125 (prm_xen_v09_migrate_to_0 on h16): expected 'ok' but got 'error'
> Mar 05 09:54:47 h16 pacemaker-schedulerd[7189]:  error: Resource prm_xen_v09 is active on 2 nodes (attempting recovery)
> (not really active on two nodes; DC recovers on h18 where v09 probably isn't running, but should stop on h16 first)
> Mar 05 09:54:47 h16 pacemaker-schedulerd[7189]:  notice:  * Recover    prm_xen_v09              (             h18 )
> Mar 05 09:54:47 h16 VirtualDomain(prm_xen_v09)[2068]: INFO: Issuing graceful shutdown request for domain v09.
> Mar 05 09:55:12 h16 pacemaker-execd[7187]:  notice: prm_xen_v09 stop (call 297, PID 2035) exited with status 0 (execution time 25101ms, queue time 0ms)
> Mar 05 09:55:12 h16 pacemaker-controld[7190]:  notice: Result of stop operation for prm_xen_v09 on h16: ok
> Mar 05 09:55:14 h16 pacemaker-controld[7190]:  notice: Transition 1001 aborted by operation prm_xen_v09_start_0 'modify' on h18: Event failed
> Mar 05 09:55:14 h16 pacemaker-controld[7190]:  notice: Transition 1001 action 117 (prm_xen_v09_start_0 on h18): expected 'ok' but got 'error'
> Mar 05 09:55:15 h16 pacemaker-schedulerd[7189]:  warning: Unexpected result (error: v09: live migration to h18 failed: 1) was recorded for migrate_to of prm_xen_v09 on h16 at Mar  5 09:54:23 2021
> 
> Mar 05 09:55:15 h18 pacemaker-execd[7129]:  notice: prm_xen_v09 stop (call 262, PID 46737) exited with status 0 (execution time 309ms, queue time 0ms)
> 
> (DC shut down)
> Mar 05 09:55:20 h16 pacemakerd[7183]:  notice: Shutdown complete
> Mar 05 09:55:20 h16 systemd[1]: Stopped Corosync Cluster Engine.
> 
> (node starting after being stopped)
> Mar 05 10:38:50 h16 systemd[1]: Starting Shared-storage based fencing daemon...
> Mar 05 10:38:50 h16 systemd[1]: Starting Corosync Cluster Engine...
> Mar 05 10:38:59 h16 pacemaker-controld[14022]:  notice: Quorum acquired
> Mar 05 10:39:00 h16 pacemaker-controld[14022]:  notice: State transition S_PENDING -> S_NOT_DC
> (this probe probably reported nonsense)
> Mar 05 10:39:02 h16 pacemaker-controld[14022]:  notice: Result of probe operation for prm_xen_v09 on h16: ok

So resource agent thinks resource is active.

> (DC noticed)
> Mar 05 10:39:02 h18 pacemaker-controld[7132]:  notice: Transition 5 action 58 (prm_xen_v09_monitor_0 on h16): expected 'not running' but got 'ok'
> (from now on probes should be more reliable)
> Mar 05 10:39:07 h16 systemd[1]: Started Virtualization daemon.
> (there is nothing to stop)
> Mar 05 10:39:09 h16 pacemaker-execd[14019]:  notice: executing - rsc:prm_xen_v09 action:stop call_id:166
> (obviously)
> Mar 05 10:40:11 h16 libvirtd[15490]: internal error: Failed to shutdown domain '20' with libxenlight
> (more nonsense)
> Mar 05 10:44:04 h16 VirtualDomain(prm_xen_v09)[17306]: INFO: Issuing forced shutdown (destroy) request for domain v09.
> (eventually)
> Mar 05 10:44:07 h16 pacemaker-controld[14022]:  notice: Result of stop operation for prm_xen_v09 on h16: ok
> Mar 05 10:44:07 h16 pacemaker-execd[14019]:  notice: executing - rsc:prm_xen_v09 action:start call_id:168
> 
>>
>> If a probe returned 0 and it **shouldn't** have done so, then either the
>> monitor operation needs to be redesigned, or resource-discovery=never (or
>> resource-discovery=exclusive) can be used to prevent the probe from
>> happening where it should not.
> 
> Well, the situation here is using virtlockd with indirect locking in a cluster when the cluster provided the shared filesystem used for locking.
> 
> Then the obvious ordering is:
> 1) Provide shared filesystem (mount it)
> 2) start virtlockd (to put the lock files in a shared place)
> 3) run libvirtd (using virtlockd)
> 4) Manage VMs using libvirt
> 
> Unfortunately probes (expecting to use libvirt) are being run even before 1), and I don't know why they return success then.

That is what you need to investigate.

Probe needs to answer "is resource active *now*". If probe for resource
is impossible until some other resources are active, something is really
wrong with design. Either resource is active or not.

> (Other VMs were probed as "not running")
> 
>>
>> If a probe returned 0 and it **should** have done so, but the stop
>> operation on the other node wasn't reflected in the CIB (so that the
>> resource still appeared to be active there), then that's odd.
> 
> Well, when reviewing the logs, the cluster may actually have v09 running on h16 even though the node was stopped.
> So the problem was on stopping, not starting, but still I doubt the probe at that time is quite reliable.
> 
>>
>> A bug is certainly possible, though we can't say without more detail :)
> 
> I see what you mean.
> 
> Regards,
> Ulrich
> 
>>
>> On Sun, Mar 7, 2021 at 11:10 PM Ulrich Windl <
>> Ulrich.Windl at rz.uni-regensburg.de> wrote:
>>
>>>>>> Reid Wahl <nwahl at redhat.com> schrieb am 05.03.2021 um 21:22 in
>>> Nachricht
>>> <CAPiuu991O08DnaVkm9bc8N9BK-+NH9e0_F25o6DdiS5WZWGSsQ at mail.gmail.com>:
>>>> On Fri, Mar 5, 2021 at 10:13 AM Ken Gaillot <kgaillot at redhat.com> wrote:
>>>>
>>>>> On Fri, 2021-03-05 at 11:39 +0100, Ulrich Windl wrote:
>>>>>> Hi!
>>>>>>
>>>>>> I'm unsure what actually causes a problem I see (a resource was
>>>>>> "detected running" when it actually was not), but I'm sure some probe
>>>>>> started on cluster node start cannot provide a useful result until
>>>>>> some other resource has been started. AFAIK there is no way to make a
>>>>>> probe obey odering or colocation constraints, so the only work-around
>>>>>> seems to be a delay. However I'm unsure whether probes can actually
>>>>>> be delayed.
>>>>>>
>>>>>> Ideas?
>>>>>
>>>>> Ordered probes are a thorny problem that we've never been able to come
>>>>> up with a general solution for. We do order certain probes where we
>>>>> have enough information to know it's safe. The problem is that it is
>>>>> very easy to introduce ordering loops.
>>>>>
>>>>> I don't remember if there any workarounds.
>>>>>
>>>>
>>>> Maybe as a workaround:
>>>>   - Add an ocf:pacemaker:attribute resource after-and-with rsc1
>>>>   - Then configure a location rule for rsc2 with resource-discovery=never
>>>> and score=-INFINITY with expression (in pseudocode) "attribute is not set
>>>> to active value"
>>>>
>>>> I haven't tested but that might cause rsc2's probe to wait until rsc1 is
>>>> active.
>>>>
>>>> And of course, use the usual constraints/rules to ensure rsc2's probe
>>> only
>>>> runs on rsc1's node.
>>>>
>>>>
>>>>>> Despite of that I wonder whether some probe/monitor returncode like
>>>>>> OCF_NOT_READY would make sense if the operation detects that it
>>>>>> cannot return a current status (so both "running" and "stopped" would
>>>>>> be as inadequate as "starting" and "stopping" would be (despite of
>>>>>> the fact that the latter two do not exist)).
>>>>>
>>>>
>>>> This seems logically reasonable, independent of any implementation
>>>> complexity and considerations of what we would do with that return code.
>>>
>>> Thanks for the proposal!
>>> The actual problem I was facing was that the cluster claimed some resource
>>> would be running on two nodes at the same time, when actually one node had
>>> been stopped properly (with all the resources). The bad state in the CIB
>>> was most likely due to a software bug in pacemaker, but probes on
>>> re-starting the node seemed not to prevent pacemaker from doing a really
>>> wrong "recovery action".
>>> My hope was that probes might update the CIB before some stupid action is
>>> being dopne. Maybe it's just another software bug...
>>>
>>> Regards,
>>> Ulrich
>>>
>>>>
>>>>
>>>>>> Regards,
>>>>>> Ulrich
>>>>> --
>>>>> Ken Gaillot <kgaillot at redhat.com>
>>>>>
>>>>> _______________________________________________
>>>>> Manage your subscription:
>>>>> https://lists.clusterlabs.org/mailman/listinfo/users 
>>>>>
>>>>> ClusterLabs home: https://www.clusterlabs.org/ 
>>>>>
>>>>>
>>>>
>>>> --
>>>> Regards,
>>>>
>>>> Reid Wahl, RHCA
>>>> Senior Software Maintenance Engineer, Red Hat
>>>> CEE - Platform Support Delivery - ClusterHA
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Manage your subscription:
>>> https://lists.clusterlabs.org/mailman/listinfo/users 
>>>
>>> ClusterLabs home: https://www.clusterlabs.org/ 
>>>
>>>
>>
>> -- 
>> Regards,
>>
>> Reid Wahl, RHCA
>> Senior Software Maintenance Engineer, Red Hat
>> CEE - Platform Support Delivery - ClusterHA
> 
> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
>