[ClusterLabs] Failed migration causing fencing loop

Ulrich Windl Ulrich.Windl at rz.uni-regensburg.de
Thu Mar 31 03:03:50 EDT 2022


Hi!

I just wanted to point out one thing that hit us with SLES15 SP3:
Some failed live VM migration causing node fencing resulted in a fencing loop, because of two reasons:

1) Pacemaker thinks that even _after_ fencing there is some migration to "clean up". Pacemaker treats the situation as if the VM is running on both nodes, thus (50% chance?) trying to stop the VM on the node that just booted after fencing. That's supid but shouldn't be fatal IF there weren't...

2) The stop operation of the VM (that atually isn't running) fails, causing a node fence. So the loop is complete.

Some details (many unrelated messages left out):

Mar 30 16:06:14 h16 libvirtd[13637]: internal error: libxenlight failed to restore domain 'v15'

Mar 30 16:06:15 h19 pacemaker-schedulerd[7350]:  warning: Unexpected result (error: v15: live migration to h16 failed: 1) was recorded for migrate_to of prm_xen_v15 on h18 at Mar 30 16:06:13 2022

Mar 30 16:13:37 h19 pacemaker-schedulerd[7350]:  warning: Unexpected result (OCF_TIMEOUT) was recorded for stop of prm_libvirtd:0 on h18 at Mar 30 16:13:36 2022
Mar 30 16:13:37 h19 pacemaker-schedulerd[7350]:  warning: Unexpected result (OCF_TIMEOUT) was recorded for stop of prm_libvirtd:0 on h18 at Mar 30 16:13:36 2022
Mar 30 16:13:37 h19 pacemaker-schedulerd[7350]:  warning: Cluster node h18 will be fenced: prm_libvirtd:0 failed there

Mar 30 16:19:00 h19 pacemaker-schedulerd[7350]:  warning: Unexpected result (error: v15: live migration to h18 failed: 1) was recorded for migrate_to of prm_xen_v15 on h16 at Mar 29 23:58:40 2022
Mar 30 16:19:00 h19 pacemaker-schedulerd[7350]:  error: Resource prm_xen_v15 is active on 2 nodes (attempting recovery)

Mar 30 16:19:00 h19 pacemaker-schedulerd[7350]:  notice:  * Restart    prm_xen_v15              (             h18 )

Mar 30 16:19:04 h18 VirtualDomain(prm_xen_v15)[8768]: INFO: Virtual domain v15 currently has no state, retrying.
Mar 30 16:19:05 h18 VirtualDomain(prm_xen_v15)[8787]: INFO: Virtual domain v15 currently has no state, retrying.
Mar 30 16:19:07 h18 VirtualDomain(prm_xen_v15)[8822]: ERROR: Virtual domain v15 has no state during stop operation, bailing out.
Mar 30 16:19:07 h18 VirtualDomain(prm_xen_v15)[8836]: INFO: Issuing forced shutdown (destroy) request for domain v15.
Mar 30 16:19:07 h18 VirtualDomain(prm_xen_v15)[8860]: ERROR: forced stop failed

Mar 30 16:19:07 h19 pacemaker-controld[7351]:  notice: Transition 124 action 115 (prm_xen_v15_stop_0 on h18): expected 'ok' but got 'error'

Note: Our cluster nodes start pacemaker during boot. Yesterday I was there when the problem happened. But as we had another boot loop some time ago I wrote a systemd service that counts boots, and if too many happen within a short time, pacemaker will be disabled on that node. As it it set now, the counter is reset if the node is up for at least 15 minutes; if it fails more than 4 times to do so, pacemaker will be disabled. If someone wants to try that or give feedback, drop me a line, so I could provide the RPM (boot-loop-handler-0.0.5-0.0.noarch)...

Regards,
Ulrich





More information about the Users mailing list