[ClusterLabs] Antw: [EXT] Re: Failed migration causing fencing loop

Thu Mar 31 07:02:41 EDT 2022

>>> "Gao,Yan" <ygao at suse.com> schrieb am 31.03.2022 um 11:18 in Nachricht
<67785c2f-f875-cb16-608b-77d63d9b02c4 at suse.com>:
> On 2022/3/31 9:03, Ulrich Windl wrote:
>> Hi!
>> 
>> I just wanted to point out one thing that hit us with SLES15 SP3:
>> Some failed live VM migration causing node fencing resulted in a fencing 
> loop, because of two reasons:
>> 
>> 1) Pacemaker thinks that even _after_ fencing there is some migration to 
> "clean up". Pacemaker treats the situation as if the VM is running on both 
> nodes, thus (50% chance?) trying to stop the VM on the node that just booted 
> after fencing. That's supid but shouldn't be fatal IF there weren't...
>> 
>> 2) The stop operation of the VM (that atually isn't running) fails,
> 
> AFAICT it could not connect to the hypervisor, but the logic in the RA 
> is kind of arguable that the probe (monitor) of the VM returned "not 
> running", but the stop right after that returned failure...
> 
> OTOH, the point about pacemaker is the stop of the resource on the 
> fenced and rejoined node is not really necessary. There has been 
> discussions about this here and we are trying to figure out a solution 
> for it:
> 
> https://github.com/ClusterLabs/pacemaker/pull/2146#discussion_r828204919 
> 
> For now it requires administrator's intervene if the situation happens:
> 1) Fix the access to hypervisor before the fenced node rejoins.

Thanks for the explanation!

Unfortunately this can be tricky if libvirtd is involved (as it is here):
libvird uses locking (virtlockd), which in turn needs a cluster-wird filesystem for locks across the nodes.
When that filesystem is provided by the cluster, it's hard to delay node joining until filesystem,  virtlockd and libvirtd are running.

(The issue had been discussed before: It does not make sense to run some probes when those probes need other resources to detect the status.
With just a Boolean status return at best all those probes could say "not running". Ideally a third status like "please try again some later time"
would be needed, or probes should follow the dependencies of their resources (which may open another can of worms).

Regards,
Ulrich

> 2) Manually cleanup the resource, which tells pacemaker it can safely 
> forget the historical migrate_to failure.
> 
> Regards,
>    Yan
> 
>> causing a node fence. So the loop is complete.
>> 
>> Some details (many unrelated messages left out):
>> 
>> Mar 30 16:06:14 h16 libvirtd[13637]: internal error: libxenlight failed to 
> restore domain 'v15'
>> 
>> Mar 30 16:06:15 h19 pacemaker-schedulerd[7350]:  warning: Unexpected result 
> (error: v15: live migration to h16 failed: 1) was recorded for migrate_to of 
> prm_xen_v15 on h18 at Mar 30 16:06:13 2022
>> 
>> Mar 30 16:13:37 h19 pacemaker-schedulerd[7350]:  warning: Unexpected result 
> (OCF_TIMEOUT) was recorded for stop of prm_libvirtd:0 on h18 at Mar 30 
> 16:13:36 2022
>> Mar 30 16:13:37 h19 pacemaker-schedulerd[7350]:  warning: Unexpected result 
> (OCF_TIMEOUT) was recorded for stop of prm_libvirtd:0 on h18 at Mar 30 
> 16:13:36 2022
>> Mar 30 16:13:37 h19 pacemaker-schedulerd[7350]:  warning: Cluster node h18 
> will be fenced: prm_libvirtd:0 failed there
>> 
>> Mar 30 16:19:00 h19 pacemaker-schedulerd[7350]:  warning: Unexpected result 
> (error: v15: live migration to h18 failed: 1) was recorded for migrate_to of 
> prm_xen_v15 on h16 at Mar 29 23:58:40 2022
>> Mar 30 16:19:00 h19 pacemaker-schedulerd[7350]:  error: Resource prm_xen_v15 
> is active on 2 nodes (attempting recovery)
>> 
>> Mar 30 16:19:00 h19 pacemaker-schedulerd[7350]:  notice:  * Restart    
> prm_xen_v15              (             h18 )
>> 
>> Mar 30 16:19:04 h18 VirtualDomain(prm_xen_v15)[8768]: INFO: Virtual domain 
> v15 currently has no state, retrying.
>> Mar 30 16:19:05 h18 VirtualDomain(prm_xen_v15)[8787]: INFO: Virtual domain 
> v15 currently has no state, retrying.
>> Mar 30 16:19:07 h18 VirtualDomain(prm_xen_v15)[8822]: ERROR: Virtual domain 
> v15 has no state during stop operation, bailing out.
>> Mar 30 16:19:07 h18 VirtualDomain(prm_xen_v15)[8836]: INFO: Issuing forced 
> shutdown (destroy) request for domain v15.
>> Mar 30 16:19:07 h18 VirtualDomain(prm_xen_v15)[8860]: ERROR: forced stop 
> failed
>> 
>> Mar 30 16:19:07 h19 pacemaker-controld[7351]:  notice: Transition 124 action 
> 115 (prm_xen_v15_stop_0 on h18): expected 'ok' but got 'error'
>> 
>> Note: Our cluster nodes start pacemaker during boot. Yesterday I was there 
> when the problem happened. But as we had another boot loop some time ago I 
> wrote a systemd service that counts boots, and if too many happen within a 
> short time, pacemaker will be disabled on that node. As it it set now, the 
> counter is reset if the node is up for at least 15 minutes; if it fails more 
> than 4 times to do so, pacemaker will be disabled. If someone wants to try 
> that or give feedback, drop me a line, so I could provide the RPM 
> (boot-loop-handler-0.0.5-0.0.noarch)...
>> 
>> Regards,
>> Ulrich
>> 
>> 
>> 
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>> 
>> ClusterLabs home: https://www.clusterlabs.org/ 
>>