[ClusterLabs] Failed migration causing fencing loop

Wed May 25 06:37:32 EDT 2022

Hi Ulrich,

On 2022/3/31 11:18, Gao,Yan via Users wrote:
> On 2022/3/31 9:03, Ulrich Windl wrote:
>> Hi!
>>
>> I just wanted to point out one thing that hit us with SLES15 SP3:
>> Some failed live VM migration causing node fencing resulted in a 
>> fencing loop, because of two reasons:
>>
>> 1) Pacemaker thinks that even _after_ fencing there is some migration 
>> to "clean up". Pacemaker treats the situation as if the VM is running 
>> on both nodes, thus (50% chance?) trying to stop the VM on the node 
>> that just booted after fencing. That's supid but shouldn't be fatal IF 
>> there weren't...
>>
>> 2) The stop operation of the VM (that atually isn't running) fails,
> 
> AFAICT it could not connect to the hypervisor, but the logic in the RA 
> is kind of arguable that the probe (monitor) of the VM returned "not 
> running", but the stop right after that returned failure...
> 
> OTOH, the point about pacemaker is the stop of the resource on the 
> fenced and rejoined node is not really necessary. There has been 
> discussions about this here and we are trying to figure out a solution 
> for it:
> 
> https://github.com/ClusterLabs/pacemaker/pull/2146#discussion_r828204919

FYI, this issue has been addressed with:
https://github.com/ClusterLabs/pacemaker/pull/2705

Regards,
   Yan

> 
> For now it requires administrator's intervene if the situation happens:
> 1) Fix the access to hypervisor before the fenced node rejoins.
> 2) Manually cleanup the resource, which tells pacemaker it can safely 
> forget the historical migrate_to failure.
> 
> Regards,
>    Yan
> 
>> causing a node fence. So the loop is complete.
>>
>> Some details (many unrelated messages left out):
>>
>> Mar 30 16:06:14 h16 libvirtd[13637]: internal error: libxenlight 
>> failed to restore domain 'v15'
>>
>> Mar 30 16:06:15 h19 pacemaker-schedulerd[7350]:  warning: Unexpected 
>> result (error: v15: live migration to h16 failed: 1) was recorded for 
>> migrate_to of prm_xen_v15 on h18 at Mar 30 16:06:13 2022
>>
>> Mar 30 16:13:37 h19 pacemaker-schedulerd[7350]:  warning: Unexpected 
>> result (OCF_TIMEOUT) was recorded for stop of prm_libvirtd:0 on h18 at 
>> Mar 30 16:13:36 2022
>> Mar 30 16:13:37 h19 pacemaker-schedulerd[7350]:  warning: Unexpected 
>> result (OCF_TIMEOUT) was recorded for stop of prm_libvirtd:0 on h18 at 
>> Mar 30 16:13:36 2022
>> Mar 30 16:13:37 h19 pacemaker-schedulerd[7350]:  warning: Cluster node 
>> h18 will be fenced: prm_libvirtd:0 failed there
>>
>> Mar 30 16:19:00 h19 pacemaker-schedulerd[7350]:  warning: Unexpected 
>> result (error: v15: live migration to h18 failed: 1) was recorded for 
>> migrate_to of prm_xen_v15 on h16 at Mar 29 23:58:40 2022
>> Mar 30 16:19:00 h19 pacemaker-schedulerd[7350]:  error: Resource 
>> prm_xen_v15 is active on 2 nodes (attempting recovery)
>>
>> Mar 30 16:19:00 h19 pacemaker-schedulerd[7350]:  notice:  * Restart    
>> prm_xen_v15              (             h18 )
>>
>> Mar 30 16:19:04 h18 VirtualDomain(prm_xen_v15)[8768]: INFO: Virtual 
>> domain v15 currently has no state, retrying.
>> Mar 30 16:19:05 h18 VirtualDomain(prm_xen_v15)[8787]: INFO: Virtual 
>> domain v15 currently has no state, retrying.
>> Mar 30 16:19:07 h18 VirtualDomain(prm_xen_v15)[8822]: ERROR: Virtual 
>> domain v15 has no state during stop operation, bailing out.
>> Mar 30 16:19:07 h18 VirtualDomain(prm_xen_v15)[8836]: INFO: Issuing 
>> forced shutdown (destroy) request for domain v15.
>> Mar 30 16:19:07 h18 VirtualDomain(prm_xen_v15)[8860]: ERROR: forced 
>> stop failed
>>
>> Mar 30 16:19:07 h19 pacemaker-controld[7351]:  notice: Transition 124 
>> action 115 (prm_xen_v15_stop_0 on h18): expected 'ok' but got 'error'
>>
>> Note: Our cluster nodes start pacemaker during boot. Yesterday I was 
>> there when the problem happened. But as we had another boot loop some 
>> time ago I wrote a systemd service that counts boots, and if too many 
>> happen within a short time, pacemaker will be disabled on that node. 
>> As it it set now, the counter is reset if the node is up for at least 
>> 15 minutes; if it fails more than 4 times to do so, pacemaker will be 
>> disabled. If someone wants to try that or give feedback, drop me a 
>> line, so I could provide the RPM (boot-loop-handler-0.0.5-0.0.noarch)...
>>
>> Regards,
>> Ulrich
>>
>>
>>
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>>
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
>