[ClusterLabs] Antw: Re: Antw: [EXT] Re: Q: constrain or delay "probes"?

Mon Mar 8 11:47:14 EST 2021

On Mon, 2021-03-08 at 09:57 +0100, Ulrich Windl wrote:
> > > > Reid Wahl <nwahl at redhat.com> schrieb am 08.03.2021 um 08:42 in
> > > > Nachricht
> 
> <CAPiuu9_V0-3k9k-Z8+z5u5t8bMh3sL3PzzdOLH9g8XCdmfqDow at mail.gmail.com>:
> > Did the "active on too many nodes" message happen right after a
> > probe? If
> > so, then it does sound like the probe returned code 0.
> 
> Events were like this (I greatly condensed the logs):
> (DC h16 being stopped)
> Mar 05 09:53:45 h16 pacemaker-schedulerd[7189]:  notice:  *
> Migrate    prm_xen_v09              ( h16 -> h18 )
> Mar 05 09:54:23 h16 pacemaker-controld[7190]:  notice: Initiating
> migrate_to operation prm_xen_v09_migrate_to_0 locally on h16
> Mar 05 09:54:24 h16 libvirtd[8531]: internal error: Failed to send
> migration data to destination host
> Mar 05 09:54:24 h16 VirtualDomain(prm_xen_v09)[1834]: ERROR: v09:
> live migration to h18 failed: 1
> Mar 05 09:54:24 h16 pacemaker-controld[7190]:  notice: Transition
> 1000 action 125 (prm_xen_v09_migrate_to_0 on h16): expected 'ok' but
> got 'error'
> Mar 05 09:54:47 h16 pacemaker-schedulerd[7189]:  error: Resource
> prm_xen_v09 is active on 2 nodes (attempting recovery)

When a live migration fails, the state of the resource on both the
source and target nodes is now unknown. Since it's unknown, Pacemaker
has to assume it could be running on both. At some point, I'll probably
edit this message to say "may be active" instead of "is active".

To put it another way, the recovery from a failed migration is the same
as the recovery from multiple-active.

> (not really active on two nodes; DC recovers on h18 where v09
> probably isn't running, but should stop on h16 first)

recovery = stop + start

> Mar 05 09:54:47 h16 pacemaker-schedulerd[7189]:  notice:  *
> Recover    prm_xen_v09              (             h18 )
> Mar 05 09:54:47 h16 VirtualDomain(prm_xen_v09)[2068]: INFO: Issuing
> graceful shutdown request for domain v09.
> Mar 05 09:55:12 h16 pacemaker-execd[7187]:  notice: prm_xen_v09 stop
> (call 297, PID 2035) exited with status 0 (execution time 25101ms,
> queue time 0ms)
> Mar 05 09:55:12 h16 pacemaker-controld[7190]:  notice: Result of stop
> operation for prm_xen_v09 on h16: ok

the stop is done on h16 first

> Mar 05 09:55:14 h16 pacemaker-controld[7190]:  notice: Transition
> 1001 aborted by operation prm_xen_v09_start_0 'modify' on h18: Event
> failed
> Mar 05 09:55:14 h16 pacemaker-controld[7190]:  notice: Transition
> 1001 action 117 (prm_xen_v09_start_0 on h18): expected 'ok' but got
> 'error'

but then the start on h18 returned an error

> Mar 05 09:55:15 h16 pacemaker-schedulerd[7189]:  warning: Unexpected
> result (error: v09: live migration to h18 failed: 1) was recorded for
> migrate_to of prm_xen_v09 on h16 at Mar  5 09:54:23 2021

note this is historical, not new

> Mar 05 09:55:15 h18 pacemaker-execd[7129]:  notice: prm_xen_v09 stop
> (call 262, PID 46737) exited with status 0 (execution time 309ms,
> queue time 0ms)

we stop on h18 to get back to a known state after the failed start
there

> (DC shut down)
> Mar 05 09:55:20 h16 pacemakerd[7183]:  notice: Shutdown complete
> Mar 05 09:55:20 h16 systemd[1]: Stopped Corosync Cluster Engine.
> 
> (node starting after being stopped)
> Mar 05 10:38:50 h16 systemd[1]: Starting Shared-storage based fencing
> daemon...
> Mar 05 10:38:50 h16 systemd[1]: Starting Corosync Cluster Engine...
> Mar 05 10:38:59 h16 pacemaker-controld[14022]:  notice: Quorum
> acquired
> Mar 05 10:39:00 h16 pacemaker-controld[14022]:  notice: State
> transition S_PENDING -> S_NOT_DC
> (this probe probably reported nonsense)
> Mar 05 10:39:02 h16 pacemaker-controld[14022]:  notice: Result of
> probe operation for prm_xen_v09 on h16: ok
> (DC noticed)
> Mar 05 10:39:02 h18 pacemaker-controld[7132]:  notice: Transition 5
> action 58 (prm_xen_v09_monitor_0 on h16): expected 'not running' but
> got 'ok'
> (from now on probes should be more reliable)
> Mar 05 10:39:07 h16 systemd[1]: Started Virtualization daemon.
> (there is nothing to stop)
> Mar 05 10:39:09 h16 pacemaker-execd[14019]:  notice: executing -
> rsc:prm_xen_v09 action:stop call_id:166
> (obviously)
> Mar 05 10:40:11 h16 libvirtd[15490]: internal error: Failed to
> shutdown domain '20' with libxenlight
> (more nonsense)
> Mar 05 10:44:04 h16 VirtualDomain(prm_xen_v09)[17306]: INFO: Issuing
> forced shutdown (destroy) request for domain v09.
> (eventually)
> Mar 05 10:44:07 h16 pacemaker-controld[14022]:  notice: Result of
> stop operation for prm_xen_v09 on h16: ok
> Mar 05 10:44:07 h16 pacemaker-execd[14019]:  notice: executing -
> rsc:prm_xen_v09 action:start call_id:168
> 
> > 
> > If a probe returned 0 and it **shouldn't** have done so, then
> > either the
> > monitor operation needs to be redesigned, or resource-
> > discovery=never (or
> > resource-discovery=exclusive) can be used to prevent the probe from
> > happening where it should not.
> 
> Well, the situation here is using virtlockd with indirect locking in
> a cluster when the cluster provided the shared filesystem used for
> locking.
> 
> Then the obvious ordering is:
> 1) Provide shared filesystem (mount it)
> 2) start virtlockd (to put the lock files in a shared place)
> 3) run libvirtd (using virtlockd)
> 4) Manage VMs using libvirt
> 
> Unfortunately probes (expecting to use libvirt) are being run even
> before 1), and I don't know why they return success then.
> (Other VMs were probed as "not running")

Unfortunately yes, the probes will not be ordered. But with correct
probe results, it should get to a good point. If the probe returned
success when it shouldn't, that sounds like an RA issue.

> > If a probe returned 0 and it **should** have done so, but the stop
> > operation on the other node wasn't reflected in the CIB (so that
> > the
> > resource still appeared to be active there), then that's odd.
> 
> Well, when reviewing the logs, the cluster may actually have v09
> running on h16 even though the node was stopped.
>
> So the problem was on stopping, not starting, but still I doubt the
> probe at that time is quite reliable.
> 
> > 
> > A bug is certainly possible, though we can't say without more
> > detail :)
> 
> I see what you mean.
> 
> Regards,
> Ulrich
> 
> > 
> > On Sun, Mar 7, 2021 at 11:10 PM Ulrich Windl <
> > Ulrich.Windl at rz.uni-regensburg.de> wrote:
> > 
> > > > > > Reid Wahl <nwahl at redhat.com> schrieb am 05.03.2021 um 21:22
> > > > > > in
> > > 
> > > Nachricht
> > > <
> > > CAPiuu991O08DnaVkm9bc8N9BK-+NH9e0_F25o6DdiS5WZWGSsQ at mail.gmail.com
> > > >:
> > > > On Fri, Mar 5, 2021 at 10:13 AM Ken Gaillot <
> > > > kgaillot at redhat.com> wrote:
> > > > 
> > > > > On Fri, 2021-03-05 at 11:39 +0100, Ulrich Windl wrote:
> > > > > > Hi!
> > > > > > 
> > > > > > I'm unsure what actually causes a problem I see (a resource
> > > > > > was
> > > > > > "detected running" when it actually was not), but I'm sure
> > > > > > some probe
> > > > > > started on cluster node start cannot provide a useful
> > > > > > result until
> > > > > > some other resource has been started. AFAIK there is no way
> > > > > > to make a
> > > > > > probe obey odering or colocation constraints, so the only
> > > > > > work-around
> > > > > > seems to be a delay. However I'm unsure whether probes can
> > > > > > actually
> > > > > > be delayed.
> > > > > > 
> > > > > > Ideas?
> > > > > 
> > > > > Ordered probes are a thorny problem that we've never been
> > > > > able to come
> > > > > up with a general solution for. We do order certain probes
> > > > > where we
> > > > > have enough information to know it's safe. The problem is
> > > > > that it is
> > > > > very easy to introduce ordering loops.
> > > > > 
> > > > > I don't remember if there any workarounds.
> > > > > 
> > > > 
> > > > Maybe as a workaround:
> > > >   - Add an ocf:pacemaker:attribute resource after-and-with rsc1
> > > >   - Then configure a location rule for rsc2 with resource-
> > > > discovery=never
> > > > and score=-INFINITY with expression (in pseudocode) "attribute
> > > > is not set
> > > > to active value"
> > > > 
> > > > I haven't tested but that might cause rsc2's probe to wait
> > > > until rsc1 is
> > > > active.
> > > > 
> > > > And of course, use the usual constraints/rules to ensure rsc2's
> > > > probe
> > > 
> > > only
> > > > runs on rsc1's node.
> > > > 
> > > > 
> > > > > > Despite of that I wonder whether some probe/monitor
> > > > > > returncode like
> > > > > > OCF_NOT_READY would make sense if the operation detects
> > > > > > that it
> > > > > > cannot return a current status (so both "running" and
> > > > > > "stopped" would
> > > > > > be as inadequate as "starting" and "stopping" would be
> > > > > > (despite of
> > > > > > the fact that the latter two do not exist)).
> > > > 
> > > > This seems logically reasonable, independent of any
> > > > implementation
> > > > complexity and considerations of what we would do with that
> > > > return code.
> > > 
> > > Thanks for the proposal!
> > > The actual problem I was facing was that the cluster claimed some
> > > resource
> > > would be running on two nodes at the same time, when actually one
> > > node had
> > > been stopped properly (with all the resources). The bad state in
> > > the CIB
> > > was most likely due to a software bug in pacemaker, but probes on
> > > re-starting the node seemed not to prevent pacemaker from doing a
> > > really
> > > wrong "recovery action".
> > > My hope was that probes might update the CIB before some stupid
> > > action is
> > > being dopne. Maybe it's just another software bug...
> > > 
> > > Regards,
> > > Ulrich
> > > 
> > > > 
> > > > 
> > > > > > Regards,
> > > > > > Ulrich
> > > > > 
> > > > > --
> > > > > Ken Gaillot <kgaillot at redhat.com>
> > > > > 
> > > > > _______________________________________________
> > > > > Manage your subscription:
> > > > > https://lists.clusterlabs.org/mailman/listinfo/users 
> > > > > 
> > > > > ClusterLabs home: https://www.clusterlabs.org/ 
> > > > > 
> > > > > 
> > > > 
> > > > --
> > > > Regards,
> > > > 
> > > > Reid Wahl, RHCA
> > > > Senior Software Maintenance Engineer, Red Hat
> > > > CEE - Platform Support Delivery - ClusterHA
> > > 
> > > 
> > > 
> > > 
> > > _______________________________________________
> > > Manage your subscription:
> > > https://lists.clusterlabs.org/mailman/listinfo/users 
> > > 
> > > ClusterLabs home: https://www.clusterlabs.org/ 
> > > 
> > > 
> > 
> > -- 
> > Regards,
> > 
> > Reid Wahl, RHCA
> > Senior Software Maintenance Engineer, Red Hat
> > CEE - Platform Support Delivery - ClusterHA
> 
> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
> 
-- 
Ken Gaillot <kgaillot at redhat.com>