[ClusterLabs] Failed migration causing fencing loop
Gao,Yan
ygao at suse.com
Wed May 25 06:37:32 EDT 2022
Hi Ulrich,
On 2022/3/31 11:18, Gao,Yan via Users wrote:
> On 2022/3/31 9:03, Ulrich Windl wrote:
>> Hi!
>>
>> I just wanted to point out one thing that hit us with SLES15 SP3:
>> Some failed live VM migration causing node fencing resulted in a
>> fencing loop, because of two reasons:
>>
>> 1) Pacemaker thinks that even _after_ fencing there is some migration
>> to "clean up". Pacemaker treats the situation as if the VM is running
>> on both nodes, thus (50% chance?) trying to stop the VM on the node
>> that just booted after fencing. That's supid but shouldn't be fatal IF
>> there weren't...
>>
>> 2) The stop operation of the VM (that atually isn't running) fails,
>
> AFAICT it could not connect to the hypervisor, but the logic in the RA
> is kind of arguable that the probe (monitor) of the VM returned "not
> running", but the stop right after that returned failure...
>
> OTOH, the point about pacemaker is the stop of the resource on the
> fenced and rejoined node is not really necessary. There has been
> discussions about this here and we are trying to figure out a solution
> for it:
>
> https://github.com/ClusterLabs/pacemaker/pull/2146#discussion_r828204919
FYI, this issue has been addressed with:
https://github.com/ClusterLabs/pacemaker/pull/2705
Regards,
Yan
>
> For now it requires administrator's intervene if the situation happens:
> 1) Fix the access to hypervisor before the fenced node rejoins.
> 2) Manually cleanup the resource, which tells pacemaker it can safely
> forget the historical migrate_to failure.
>
> Regards,
> Yan
>
>> causing a node fence. So the loop is complete.
>>
>> Some details (many unrelated messages left out):
>>
>> Mar 30 16:06:14 h16 libvirtd[13637]: internal error: libxenlight
>> failed to restore domain 'v15'
>>
>> Mar 30 16:06:15 h19 pacemaker-schedulerd[7350]: warning: Unexpected
>> result (error: v15: live migration to h16 failed: 1) was recorded for
>> migrate_to of prm_xen_v15 on h18 at Mar 30 16:06:13 2022
>>
>> Mar 30 16:13:37 h19 pacemaker-schedulerd[7350]: warning: Unexpected
>> result (OCF_TIMEOUT) was recorded for stop of prm_libvirtd:0 on h18 at
>> Mar 30 16:13:36 2022
>> Mar 30 16:13:37 h19 pacemaker-schedulerd[7350]: warning: Unexpected
>> result (OCF_TIMEOUT) was recorded for stop of prm_libvirtd:0 on h18 at
>> Mar 30 16:13:36 2022
>> Mar 30 16:13:37 h19 pacemaker-schedulerd[7350]: warning: Cluster node
>> h18 will be fenced: prm_libvirtd:0 failed there
>>
>> Mar 30 16:19:00 h19 pacemaker-schedulerd[7350]: warning: Unexpected
>> result (error: v15: live migration to h18 failed: 1) was recorded for
>> migrate_to of prm_xen_v15 on h16 at Mar 29 23:58:40 2022
>> Mar 30 16:19:00 h19 pacemaker-schedulerd[7350]: error: Resource
>> prm_xen_v15 is active on 2 nodes (attempting recovery)
>>
>> Mar 30 16:19:00 h19 pacemaker-schedulerd[7350]: notice: * Restart
>> prm_xen_v15 ( h18 )
>>
>> Mar 30 16:19:04 h18 VirtualDomain(prm_xen_v15)[8768]: INFO: Virtual
>> domain v15 currently has no state, retrying.
>> Mar 30 16:19:05 h18 VirtualDomain(prm_xen_v15)[8787]: INFO: Virtual
>> domain v15 currently has no state, retrying.
>> Mar 30 16:19:07 h18 VirtualDomain(prm_xen_v15)[8822]: ERROR: Virtual
>> domain v15 has no state during stop operation, bailing out.
>> Mar 30 16:19:07 h18 VirtualDomain(prm_xen_v15)[8836]: INFO: Issuing
>> forced shutdown (destroy) request for domain v15.
>> Mar 30 16:19:07 h18 VirtualDomain(prm_xen_v15)[8860]: ERROR: forced
>> stop failed
>>
>> Mar 30 16:19:07 h19 pacemaker-controld[7351]: notice: Transition 124
>> action 115 (prm_xen_v15_stop_0 on h18): expected 'ok' but got 'error'
>>
>> Note: Our cluster nodes start pacemaker during boot. Yesterday I was
>> there when the problem happened. But as we had another boot loop some
>> time ago I wrote a systemd service that counts boots, and if too many
>> happen within a short time, pacemaker will be disabled on that node.
>> As it it set now, the counter is reset if the node is up for at least
>> 15 minutes; if it fails more than 4 times to do so, pacemaker will be
>> disabled. If someone wants to try that or give feedback, drop me a
>> line, so I could provide the RPM (boot-loop-handler-0.0.5-0.0.noarch)...
>>
>> Regards,
>> Ulrich
>>
>>
>>
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>>
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
More information about the Users
mailing list