[ClusterLabs] Xen, SLES15, libvirt live-migration and a fencing loop

Wed Feb 16 04:07:25 EST 2022

Hi!

I just wanted to report some problem we had after upgrading the last on our three-node Xen cluster to SLES15 SP3:
When the cluster tried to balance resources at night, live-migrating VMs, live-migration to the newly upgraded node (h19) failed (for unkonwn reasons, see below).
Unfortunately this ended in a fencing loop of h19.

The reasons for the fencing loop were (IMHO):
The failed migration status was not cleared by fencing node h19, to the cluster tried to "fix" it ("resource active on two nodes") when h19 became online after fencing again.
In addition the VirtualDomain RA returned inconsistent results, causing another fencing.

How it all began:
Feb 15 20:07:31 h16 VirtualDomain (prm_xen_v13)[30174]: ERROR: v13: live migration to h19 failed: 1
Feb 15 20:07:31 h16 pacemaker-execd[7259]: notice: prm_xen_v13_migrate_to_0[30123] error output [ error: internal error: libxenlight state driver is not active ]
Feb 15 20:07:32 h18 VirtualDomain (prm_xen_v15)[43671]: ERROR: v15: live migration to h19 failed: 1
Feb 15 20:07:31 h16 pacemaker-execd[7259]: notice: prm_xen_v13_migrate_to_0[30123] error output [ error: internal error: libxenlight state driver is not active ]

Feb 15 20:08:03 h16 pacemaker-schedulerd[7263]: error: Resource prm_xen_v15 is active on 2 nodes (attempting recovery)
Feb 15 20:08:03 h16 pacemaker-schedulerd[7263]: error: Resource prm_xen_v13 is active on 2 nodes (attempting recovery)

Feb 15 20:08:03 h16 pacemaker-controld[7266]:  notice: Initiating stop operation prm_xen_v15_stop_0 on h19
Feb 15 20:08:03 h16 pacemaker-controld[7266]:  notice: Initiating stop operation prm_xen_v15_stop_0 on h18
Feb 15 20:08:03 h16 pacemaker-controld[7266]:  notice: Initiating stop operation prm_xen_v13_stop_0 on h19
Feb 15 20:08:03 h16 pacemaker-controld[7266]:  notice: Initiating stop operation prm_xen_v13_stop_0 locally on h16

Feb 15 20:08:03 h16 VirtualDomain (prm_xen_v13)[30732]: INFO: Issuing graceful shutdown request for domain v13.
Feb 15 20:08:24 h16 pacemaker-execd[7259]: notice: prm_xen_v13 stop (call 331, PID 30666) exited with status 0 (execution time 20705ms, queue time 0ms)
Feb 15 20:11:54 h16 pacemaker-schedulerd[7263]:  warning: Unexpected result (error: forced stop failed) was recorded for stop of prm_xen_v15 on h19 at Feb 15 20:08:03 2022

Feb 15 20:11:55 h16 pacemaker-schedulerd[7263]:  warning: Scheduling Node h19 for STONITH

OK, so far, so good.

After fencing:
Feb 15 20:16:01 h16 pacemaker-controld[7266]:  notice: Node h19 state is now member

Feb 15 20:16:13 h16 pacemaker-execd[7259]:  notice: prm_xen_v13 start (call 356, PID 35368) exited with status 0 (execution time 7059ms, queue time 0ms)
Feb 15 20:16:13 h16 pacemaker-execd[7259]:  notice: prm_xen_v15 start (call 355, PID 35367) exited with status 0 (execution time 7505ms, queue time 0ms)

Well, everything OK now? No, because the old error status pops up:
Feb 15 20:16:13 h16 pacemaker-schedulerd[7263]:  warning: Unexpected result (error: v13: live migration to h19 failed: 1) was recorded for migrate_to of prm_xen_v13 on h16 at ...
Feb 15 20:16:13 h16 pacemaker-schedulerd[7263]:  warning: Unexpected result (error: v15: live migration to h19 failed: 1) was recorded for migrate_to of prm_xen_v15 on h18 at ...
Feb 15 20:16:13 h16 pacemaker-schedulerd[7263]:  error: Resource prm_xen_v15 is active on 2 nodes (attempting recovery)
Feb 15 20:16:13 h16 pacemaker-schedulerd[7263]:  error: Resource prm_xen_v13 is active on 2 nodes (attempting recovery)

This was where the nonsense started.

Feb 15 20:16:14 h16 pacemaker-controld[7266]:  notice: Initiating stop operation prm_xen_v13_stop_0 locally on h16
Feb 15 20:16:14 h16 pacemaker-controld[7266]:  notice: Initiating stop operation prm_xen_v13_stop_0 on h19
Feb 15 20:16:14 h16 pacemaker-controld[7266]:  notice: Initiating stop operation prm_xen_v15_stop_0 on h19
Feb 15 20:16:14 h16 pacemaker-controld[7266]:  notice: Initiating stop operation prm_xen_v15_stop_0 locally on h16

Feb 15 20:16:16 h16 pacemaker-controld[7266]:  notice: Transition 70 action 192 (prm_xen_v13_stop_0 on h19): expected 'ok' but got 'error'
Feb 15 20:16:16 h16 pacemaker-controld[7266]:  notice: Transition 70 action 163 (prm_xen_v15_stop_0 on h19): expected 'ok' but got 'error'

(That error was to examine, but the whole "recovery" plan was unnecessary!)

Well, Ok, you know: A failed stop causes fencing, so repeat reading at "After fencing".

On the (most likely) cause of the failed migration: For some reason, the GRUB boot menu had lost the default "Xen hypervisor" boot entry, and thus booted the non-Xen kernel.
But still, when booting correctly, the cluster would still try to "recover" from the false "is active on 2 nodes",  so the true fix was a manual "cleanup" of the migration entries.

Pacemaker "version 2.0.5+20201202.ba59be712-4.13.1-2.0.5+20201202.ba59be712" (pacemaker-2.0.5+20201202.ba59be712-4.13.1.x86_64)
libvirt 7.1.0, resource-agents-4.8.0+git30.d0077df0-8.17.1.x86_64

Regards,
Ulrich