[ClusterLabs] Antw: Re: Antw: Re: Antw: [EXT] Re: Q: constrain or delay "probes"?

Ulrich Windl Ulrich.Windl at rz.uni-regensburg.de
Tue Mar 9 02:48:08 EST 2021


>>> Ken Gaillot <kgaillot at redhat.com> schrieb am 08.03.2021 um 17:47 in
Nachricht
<76793e7b39e2194d328821c7ac9a5d3b82778d5e.camel at redhat.com>:
> On Mon, 2021‑03‑08 at 09:57 +0100, Ulrich Windl wrote:
>> > > > Reid Wahl <nwahl at redhat.com> schrieb am 08.03.2021 um 08:42 in
>> > > > Nachricht
>> 
>> <CAPiuu9_V0‑3k9k‑Z8+z5u5t8bMh3sL3PzzdOLH9g8XCdmfqDow at mail.gmail.com>:
>> > Did the "active on too many nodes" message happen right after a
>> > probe? If
>> > so, then it does sound like the probe returned code 0.
>> 
>> Events were like this (I greatly condensed the logs):
>> (DC h16 being stopped)
>> Mar 05 09:53:45 h16 pacemaker‑schedulerd[7189]:  notice:  *
>> Migrate    prm_xen_v09              ( h16 ‑> h18 )
>> Mar 05 09:54:23 h16 pacemaker‑controld[7190]:  notice: Initiating
>> migrate_to operation prm_xen_v09_migrate_to_0 locally on h16
>> Mar 05 09:54:24 h16 libvirtd[8531]: internal error: Failed to send
>> migration data to destination host
>> Mar 05 09:54:24 h16 VirtualDomain(prm_xen_v09)[1834]: ERROR: v09:
>> live migration to h18 failed: 1
>> Mar 05 09:54:24 h16 pacemaker‑controld[7190]:  notice: Transition
>> 1000 action 125 (prm_xen_v09_migrate_to_0 on h16): expected 'ok' but
>> got 'error'
>> Mar 05 09:54:47 h16 pacemaker‑schedulerd[7189]:  error: Resource
>> prm_xen_v09 is active on 2 nodes (attempting recovery)
> 
> When a live migration fails, the state of the resource on both the
> source and target nodes is now unknown. Since it's unknown, Pacemaker
> has to assume it could be running on both. At some point, I'll probably
> edit this message to say "may be active" instead of "is active".

Hi Ken!

My initial guess was that when live migration failed, the VM would still be
operable on the source system.
Of course to make sure, the cluster probably should probe both source and
destination after a failed migration.
Maybe THEN write what to do.

> 
> To put it another way, the recovery from a failed migration is the same
> as the recovery from multiple‑active.

It seems the cluster gets it wrong some times, specifically if the DC was shut
down with a failed migration.

> 
>> (not really active on two nodes; DC recovers on h18 where v09
>> probably isn't running, but should stop on h16 first)
> 
> recovery = stop + start

Seems OK for a single resource, but for a failed migration the really clean
recovery would be to stop all instances first, the start one.
However, rechecking the cluster first may show that one instance is actually
the only running one and no recovery is necessary.
The better approach IMHO.

> 
>> Mar 05 09:54:47 h16 pacemaker‑schedulerd[7189]:  notice:  *
>> Recover    prm_xen_v09              (             h18 )
>> Mar 05 09:54:47 h16 VirtualDomain(prm_xen_v09)[2068]: INFO: Issuing
>> graceful shutdown request for domain v09.
>> Mar 05 09:55:12 h16 pacemaker‑execd[7187]:  notice: prm_xen_v09 stop
>> (call 297, PID 2035) exited with status 0 (execution time 25101ms,
>> queue time 0ms)
>> Mar 05 09:55:12 h16 pacemaker‑controld[7190]:  notice: Result of stop
>> operation for prm_xen_v09 on h16: ok
> 
> the stop is done on h16 first
> 
>> Mar 05 09:55:14 h16 pacemaker‑controld[7190]:  notice: Transition
>> 1001 aborted by operation prm_xen_v09_start_0 'modify' on h18: Event
>> failed
>> Mar 05 09:55:14 h16 pacemaker‑controld[7190]:  notice: Transition
>> 1001 action 117 (prm_xen_v09_start_0 on h18): expected 'ok' but got
>> 'error'
> 
> but then the start on h18 returned an error

Yes, configuration error on h18 (shared disk was missing).

> 
>> Mar 05 09:55:15 h16 pacemaker‑schedulerd[7189]:  warning: Unexpected
>> result (error: v09: live migration to h18 failed: 1) was recorded for
>> migrate_to of prm_xen_v09 on h16 at Mar  5 09:54:23 2021
> 
> note this is historical, not new
> 
>> Mar 05 09:55:15 h18 pacemaker‑execd[7129]:  notice: prm_xen_v09 stop
>> (call 262, PID 46737) exited with status 0 (execution time 309ms,
>> queue time 0ms)
> 
> we stop on h18 to get back to a known state after the failed start
> there

Here a probe might have shown that there was nothing to stop.

BTW: There also was an "interesting" SNMP glitch during migration:

Mar 05 09:54:18 h16 kernel: device vif15.0 left promiscuous mode
Mar 05 09:54:18 h16 kernel: br0: port 5(vif15.0) entered disabled state
Mar 05 09:54:18 h16 snmpd[6194]: ioctl 35123 returned -1
Mar 05 09:54:18 h16 snmpd[6194]: ioctl 35123 returned -1
Mar 05 09:54:18 h16 snmpd[6194]: IfIndex of an interface changed. Such
interfaces will appear multiple times in IF-MIB.
Mar 05 09:54:18 h16 snmpd[6194]: ioctl 35111 returned -1
Mar 05 09:54:18 h16 snmpd[6194]: ioctl 35091 returned -1
Mar 05 09:54:18 h16 snmpd[6194]: ioctl 35105 returned -1
Mar 05 09:54:18 h16 root[1571]: /etc/xen/scripts/block: remove
XENBUS_PATH=backend/vbd/15/51744
Mar 05 09:54:18 h16 kernel: br1: port 4(vif15.1) entered disabled state
Mar 05 09:54:18 h16 kernel: device vif15.1 left promiscuous mode

> 
>> (DC shut down)
>> Mar 05 09:55:20 h16 pacemakerd[7183]:  notice: Shutdown complete
>> Mar 05 09:55:20 h16 systemd[1]: Stopped Corosync Cluster Engine.
>> 
>> (node starting after being stopped)
>> Mar 05 10:38:50 h16 systemd[1]: Starting Shared‑storage based fencing
>> daemon...
>> Mar 05 10:38:50 h16 systemd[1]: Starting Corosync Cluster Engine...
>> Mar 05 10:38:59 h16 pacemaker‑controld[14022]:  notice: Quorum
>> acquired
>> Mar 05 10:39:00 h16 pacemaker‑controld[14022]:  notice: State
>> transition S_PENDING ‑> S_NOT_DC
>> (this probe probably reported nonsense)
>> Mar 05 10:39:02 h16 pacemaker‑controld[14022]:  notice: Result of
>> probe operation for prm_xen_v09 on h16: ok
>> (DC noticed)
>> Mar 05 10:39:02 h18 pacemaker‑controld[7132]:  notice: Transition 5
>> action 58 (prm_xen_v09_monitor_0 on h16): expected 'not running' but
>> got 'ok'
>> (from now on probes should be more reliable)
>> Mar 05 10:39:07 h16 systemd[1]: Started Virtualization daemon.
>> (there is nothing to stop)
>> Mar 05 10:39:09 h16 pacemaker‑execd[14019]:  notice: executing ‑
>> rsc:prm_xen_v09 action:stop call_id:166
>> (obviously)
>> Mar 05 10:40:11 h16 libvirtd[15490]: internal error: Failed to
>> shutdown domain '20' with libxenlight
>> (more nonsense)
>> Mar 05 10:44:04 h16 VirtualDomain(prm_xen_v09)[17306]: INFO: Issuing
>> forced shutdown (destroy) request for domain v09.
>> (eventually)
>> Mar 05 10:44:07 h16 pacemaker‑controld[14022]:  notice: Result of
>> stop operation for prm_xen_v09 on h16: ok
>> Mar 05 10:44:07 h16 pacemaker‑execd[14019]:  notice: executing ‑
>> rsc:prm_xen_v09 action:start call_id:168
>> 
>> > 
>> > If a probe returned 0 and it **shouldn't** have done so, then
>> > either the
>> > monitor operation needs to be redesigned, or resource‑
>> > discovery=never (or
>> > resource‑discovery=exclusive) can be used to prevent the probe from
>> > happening where it should not.
>> 
>> Well, the situation here is using virtlockd with indirect locking in
>> a cluster when the cluster provided the shared filesystem used for
>> locking.
>> 
>> Then the obvious ordering is:
>> 1) Provide shared filesystem (mount it)
>> 2) start virtlockd (to put the lock files in a shared place)
>> 3) run libvirtd (using virtlockd)
>> 4) Manage VMs using libvirt
>> 
>> Unfortunately probes (expecting to use libvirt) are being run even
>> before 1), and I don't know why they return success then.
>> (Other VMs were probed as "not running")
> 
> Unfortunately yes, the probes will not be ordered. But with correct
> probe results, it should get to a good point. If the probe returned
> success when it shouldn't, that sounds like an RA issue.
> 
> 
>> > If a probe returned 0 and it **should** have done so, but the stop
>> > operation on the other node wasn't reflected in the CIB (so that
>> > the
>> > resource still appeared to be active there), then that's odd.
>> 
>> Well, when reviewing the logs, the cluster may actually have v09
>> running on h16 even though the node was stopped.
>>
>> So the problem was on stopping, not starting, but still I doubt the
>> probe at that time is quite reliable.
>> 
>> > 
>> > A bug is certainly possible, though we can't say without more
>> > detail :)
>> 
>> I see what you mean.
>> 
>> Regards,
>> Ulrich
[...]




More information about the Users mailing list