[ClusterLabs] Failed migration causing fencing loop

Thu Mar 31 05:18:39 EDT 2022

On 2022/3/31 9:03, Ulrich Windl wrote:
> Hi!
> 
> I just wanted to point out one thing that hit us with SLES15 SP3:
> Some failed live VM migration causing node fencing resulted in a fencing loop, because of two reasons:
> 
> 1) Pacemaker thinks that even _after_ fencing there is some migration to "clean up". Pacemaker treats the situation as if the VM is running on both nodes, thus (50% chance?) trying to stop the VM on the node that just booted after fencing. That's supid but shouldn't be fatal IF there weren't...
> 
> 2) The stop operation of the VM (that atually isn't running) fails,

AFAICT it could not connect to the hypervisor, but the logic in the RA 
is kind of arguable that the probe (monitor) of the VM returned "not 
running", but the stop right after that returned failure...

OTOH, the point about pacemaker is the stop of the resource on the 
fenced and rejoined node is not really necessary. There has been 
discussions about this here and we are trying to figure out a solution 
for it:

https://github.com/ClusterLabs/pacemaker/pull/2146#discussion_r828204919

For now it requires administrator's intervene if the situation happens:
1) Fix the access to hypervisor before the fenced node rejoins.
2) Manually cleanup the resource, which tells pacemaker it can safely 
forget the historical migrate_to failure.

Regards,
   Yan

> causing a node fence. So the loop is complete.
> 
> Some details (many unrelated messages left out):
> 
> Mar 30 16:06:14 h16 libvirtd[13637]: internal error: libxenlight failed to restore domain 'v15'
> 
> Mar 30 16:06:15 h19 pacemaker-schedulerd[7350]:  warning: Unexpected result (error: v15: live migration to h16 failed: 1) was recorded for migrate_to of prm_xen_v15 on h18 at Mar 30 16:06:13 2022
> 
> Mar 30 16:13:37 h19 pacemaker-schedulerd[7350]:  warning: Unexpected result (OCF_TIMEOUT) was recorded for stop of prm_libvirtd:0 on h18 at Mar 30 16:13:36 2022
> Mar 30 16:13:37 h19 pacemaker-schedulerd[7350]:  warning: Unexpected result (OCF_TIMEOUT) was recorded for stop of prm_libvirtd:0 on h18 at Mar 30 16:13:36 2022
> Mar 30 16:13:37 h19 pacemaker-schedulerd[7350]:  warning: Cluster node h18 will be fenced: prm_libvirtd:0 failed there
> 
> Mar 30 16:19:00 h19 pacemaker-schedulerd[7350]:  warning: Unexpected result (error: v15: live migration to h18 failed: 1) was recorded for migrate_to of prm_xen_v15 on h16 at Mar 29 23:58:40 2022
> Mar 30 16:19:00 h19 pacemaker-schedulerd[7350]:  error: Resource prm_xen_v15 is active on 2 nodes (attempting recovery)
> 
> Mar 30 16:19:00 h19 pacemaker-schedulerd[7350]:  notice:  * Restart    prm_xen_v15              (             h18 )
> 
> Mar 30 16:19:04 h18 VirtualDomain(prm_xen_v15)[8768]: INFO: Virtual domain v15 currently has no state, retrying.
> Mar 30 16:19:05 h18 VirtualDomain(prm_xen_v15)[8787]: INFO: Virtual domain v15 currently has no state, retrying.
> Mar 30 16:19:07 h18 VirtualDomain(prm_xen_v15)[8822]: ERROR: Virtual domain v15 has no state during stop operation, bailing out.
> Mar 30 16:19:07 h18 VirtualDomain(prm_xen_v15)[8836]: INFO: Issuing forced shutdown (destroy) request for domain v15.
> Mar 30 16:19:07 h18 VirtualDomain(prm_xen_v15)[8860]: ERROR: forced stop failed
> 
> Mar 30 16:19:07 h19 pacemaker-controld[7351]:  notice: Transition 124 action 115 (prm_xen_v15_stop_0 on h18): expected 'ok' but got 'error'
> 
> Note: Our cluster nodes start pacemaker during boot. Yesterday I was there when the problem happened. But as we had another boot loop some time ago I wrote a systemd service that counts boots, and if too many happen within a short time, pacemaker will be disabled on that node. As it it set now, the counter is reset if the node is up for at least 15 minutes; if it fails more than 4 times to do so, pacemaker will be disabled. If someone wants to try that or give feedback, drop me a line, so I could provide the RPM (boot-loop-handler-0.0.5-0.0.noarch)...
> 
> Regards,
> Ulrich
> 
> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
>