[ClusterLabs] Xen, SLES15, libvirt live-migration and a fencing loop

Wed Feb 16 15:37:41 EST 2022

On Wed, 2022-02-16 at 10:07 +0100, Ulrich Windl wrote:
> Hi!
> 
> I just wanted to report some problem we had after upgrading the last
> on our three-node Xen cluster to SLES15 SP3:
> When the cluster tried to balance resources at night, live-migrating
> VMs, live-migration to the newly upgraded node (h19) failed (for
> unkonwn reasons, see below).
> Unfortunately this ended in a fencing loop of h19.
> 
> The reasons for the fencing loop were (IMHO):
> The failed migration status was not cleared by fencing node h19, to
> the cluster tried to "fix" it ("resource active on two nodes") when
> h19 became online after fencing again.
> In addition the VirtualDomain RA returned inconsistent results,
> causing another fencing.
> 
> How it all began:
> Feb 15 20:07:31 h16 VirtualDomain (prm_xen_v13)[30174]: ERROR: v13:
> live migration to h19 failed: 1
> Feb 15 20:07:31 h16 pacemaker-execd[7259]: notice:
> prm_xen_v13_migrate_to_0[30123] error output [ error: internal error:
> libxenlight state driver is not active ]
> Feb 15 20:07:32 h18 VirtualDomain (prm_xen_v15)[43671]: ERROR: v15:
> live migration to h19 failed: 1
> Feb 15 20:07:31 h16 pacemaker-execd[7259]: notice:
> prm_xen_v13_migrate_to_0[30123] error output [ error: internal error:
> libxenlight state driver is not active ]
> 
> Feb 15 20:08:03 h16 pacemaker-schedulerd[7263]: error: Resource
> prm_xen_v15 is active on 2 nodes (attempting recovery)
> Feb 15 20:08:03 h16 pacemaker-schedulerd[7263]: error: Resource
> prm_xen_v13 is active on 2 nodes (attempting recovery)
> 
> Feb 15 20:08:03 h16 pacemaker-controld[7266]:  notice: Initiating
> stop operation prm_xen_v15_stop_0 on h19
> Feb 15 20:08:03 h16 pacemaker-controld[7266]:  notice: Initiating
> stop operation prm_xen_v15_stop_0 on h18
> Feb 15 20:08:03 h16 pacemaker-controld[7266]:  notice: Initiating
> stop operation prm_xen_v13_stop_0 on h19
> Feb 15 20:08:03 h16 pacemaker-controld[7266]:  notice: Initiating
> stop operation prm_xen_v13_stop_0 locally on h16
> 
> Feb 15 20:08:03 h16 VirtualDomain (prm_xen_v13)[30732]: INFO: Issuing
> graceful shutdown request for domain v13.
> Feb 15 20:08:24 h16 pacemaker-execd[7259]: notice: prm_xen_v13 stop
> (call 331, PID 30666) exited with status 0 (execution time 20705ms,
> queue time 0ms)
> Feb 15 20:11:54 h16 pacemaker-schedulerd[7263]:  warning: Unexpected
> result (error: forced stop failed) was recorded for stop of
> prm_xen_v15 on h19 at Feb 15 20:08:03 2022
> 
> Feb 15 20:11:55 h16 pacemaker-schedulerd[7263]:  warning: Scheduling
> Node h19 for STONITH
> 
> OK, so far, so good.
> 
> After fencing:
> Feb 15 20:16:01 h16 pacemaker-controld[7266]:  notice: Node h19 state
> is now member

I would expect that around this point, the DC would get the fencing
notification and clear h19's status section. There would be a log like
"Peer h19 was terminated". (If h19 was the DC, the other nodes would
instead have a log with "our leader" in it and would clear the status
section.)

It's possible the VM rebooted before the fencing notification was
received. That could lead to an interesting corner case that might
cause this issue, since the failures would still appear in the status
section (basically, since the fencing notification hasn't been
received, the cluster has to assume the node is leaving and rejoining
due to its own problems and hasn't actually been fenced yet).

I've heard of people putting a delay in VMs' boot sequence to help
avoid such issues.

> 
> Feb 15 20:16:13 h16 pacemaker-execd[7259]:  notice: prm_xen_v13 start
> (call 356, PID 35368) exited with status 0 (execution time 7059ms,
> queue time 0ms)
> Feb 15 20:16:13 h16 pacemaker-execd[7259]:  notice: prm_xen_v15 start
> (call 355, PID 35367) exited with status 0 (execution time 7505ms,
> queue time 0ms)
> 
> Well, everything OK now? No, because the old error status pops up:
> Feb 15 20:16:13 h16 pacemaker-schedulerd[7263]:  warning: Unexpected
> result (error: v13: live migration to h19 failed: 1) was recorded for
> migrate_to of prm_xen_v13 on h16 at ...
> Feb 15 20:16:13 h16 pacemaker-schedulerd[7263]:  warning: Unexpected
> result (error: v15: live migration to h19 failed: 1) was recorded for
> migrate_to of prm_xen_v15 on h18 at ...
> Feb 15 20:16:13 h16 pacemaker-schedulerd[7263]:  error: Resource
> prm_xen_v15 is active on 2 nodes (attempting recovery)
> Feb 15 20:16:13 h16 pacemaker-schedulerd[7263]:  error: Resource
> prm_xen_v13 is active on 2 nodes (attempting recovery)
> This was where the nonsense started.
> 
> Feb 15 20:16:14 h16 pacemaker-controld[7266]:  notice: Initiating
> stop operation prm_xen_v13_stop_0 locally on h16
> Feb 15 20:16:14 h16 pacemaker-controld[7266]:  notice: Initiating
> stop operation prm_xen_v13_stop_0 on h19
> Feb 15 20:16:14 h16 pacemaker-controld[7266]:  notice: Initiating
> stop operation prm_xen_v15_stop_0 on h19
> Feb 15 20:16:14 h16 pacemaker-controld[7266]:  notice: Initiating
> stop operation prm_xen_v15_stop_0 locally on h16
> 
> Feb 15 20:16:16 h16 pacemaker-controld[7266]:  notice: Transition 70
> action 192 (prm_xen_v13_stop_0 on h19): expected 'ok' but got 'error'
> Feb 15 20:16:16 h16 pacemaker-controld[7266]:  notice: Transition 70
> action 163 (prm_xen_v15_stop_0 on h19): expected 'ok' but got 'error'
> 
> (That error was to examine, but the whole "recovery" plan was
> unnecessary!)
> 
> Well, Ok, you know: A failed stop causes fencing, so repeat reading
> at "After fencing".
> 
> On the (most likely) cause of the failed migration: For some reason,
> the GRUB boot menu had lost the default "Xen hypervisor" boot entry,
> and thus booted the non-Xen kernel.
> But still, when booting correctly, the cluster would still try to
> "recover" from the false "is active on 2 nodes",  so the true fix was
> a manual "cleanup" of the migration entries.
> 
> Pacemaker "version 2.0.5+20201202.ba59be712-4.13.1-
> 2.0.5+20201202.ba59be712" (pacemaker-2.0.5+20201202.ba59be712-
> 4.13.1.x86_64)
> libvirt 7.1.0, resource-agents-4.8.0+git30.d0077df0-8.17.1.x86_64
> 
> Regards,
> Ulrich
> 
> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
> 
-- 
Ken Gaillot <kgaillot at redhat.com>