[ClusterLabs] Instable SLES15 SP3 kernel

Mon Apr 4 09:13:12 EDT 2022

>>> "Gao,Yan" <ygao at suse.com> schrieb am 04.04.2022 um 11:58 in Nachricht
<0d0f2b5e-3238-22df-4105-31e5a640d924 at suse.com>:
> On 2022/4/4 8:58, Ulrich Windl wrote:
>>>>> Andrei Borzenkov <arvidjaar at gmail.com> schrieb am 04.04.2022 um 06:39
in
>> Nachricht <e351f140-fe35-6b4d-16ce-008aee0d1679 at gmail.com>:
>>> On 31.03.2022 14:02, Ulrich Windl wrote:
>>>>>>> "Gao,Yan" <ygao at suse.com> schrieb am 31.03.2022 um 11:18 in Nachricht
>>>> <67785c2f‑f875‑cb16‑608b‑77d63d9b02c4 at suse.com>:
>>>>> On 2022/3/31 9:03, Ulrich Windl wrote:
>>>>>> Hi!
>>>>>>
>>>>>> I just wanted to point out one thing that hit us with SLES15 SP3:
>>>>>> Some failed live VM migration causing node fencing resulted in a
fencing
>> 
>>>>> loop, because of two reasons:
>>>>>>
>>>>>> 1) Pacemaker thinks that even _after_ fencing there is some migration
to
>> 
>>>>> "clean up". Pacemaker treats the situation as if the VM is running on
both
>> 
>>>>> nodes, thus (50% chance?) trying to stop the VM on the node that just
>> booted
>>>
>>>>> after fencing. That's supid but shouldn't be fatal IF there weren't...
>>>>>>
>>>>>> 2) The stop operation of the VM (that atually isn't running) fails,
>>>>>
>>>>> AFAICT it could not connect to the hypervisor, but the logic in the RA
>>>>> is kind of arguable that the probe (monitor) of the VM returned "not
>>>>> running", but the stop right after that returned failure...
>>>>>
>>>>> OTOH, the point about pacemaker is the stop of the resource on the
>>>>> fenced and rejoined node is not really necessary. There has been
>>>>> discussions about this here and we are trying to figure out a solution
>>>>> for it:
>>>>>
>>>>> https://github.com/ClusterLabs/pacemaker/pull/2146#discussion_r828204919

>>>>>
>>>>> For now it requires administrator's intervene if the situation happens:
>>>>> 1) Fix the access to hypervisor before the fenced node rejoins.
>>>>
>>>> Thanks for the explanation!
>>>>
>>>> Unfortunately this can be tricky if libvirtd is involved (as it is
here):
>>>> libvird uses locking (virtlockd), which in turn needs a cluster‑wird
>>> filesystem for locks across the nodes.
>>>> When that filesystem is provided by the cluster, it's hard to delay node
>>> joining until filesystem,  virtlockd and libvirtd are running.
>>>>
>>>
>>> So do not use filesystem provided by the same cluster. Use separate
>>> filesystem mounted outside of cluster, like separate high‑available NFS.
>> 
>> Hi!
>> 
>> Having a second cluster just pto provide VM locking seems a big overkill.
>> Actually I absolutely regret that I ever followed the advice to use
libvirt
>> and VIrtualDomain as it seems to have no real benefit for Xen and PVMs.
>> As a matter of fact after more than 10 years using Xen PVMs in a cluster
we
>> will move to VMware as SLES15 SP3 is the most unstable SLES ever seen (I
>> started with SLES 8).
>> SUSE support seems unable to either fix the memory corruption, or to
provide 
> a
>> kernel that does not have it (it seems SP2 did not have it).
> 
> Sounds like there's certain kernel issue related to Xen? Probably ask 
> SUSE support to raise the priority of the ticket?

Hi!

Actually it's sufficient to use either rear to create a recovery image or to
copy a large file from OCFS2 to trigger the bug.
Unfortunately support isn't really making progress it seems (we have a PTF
kernel, but that isn't any better).

To prevent kernel panics and lots of failing VMs I'm running this script as
cron job:
---
# cat /etc/crontabs/reboot-before-panic.sh
#!/usr/bin/sh
# Detect RAM corruption. If detected log a message and reboot
# to prevent kernel panic

#cron jobs need a PATH
PATH=/sbin:/usr/sbin:/usr/bin:/bin
if journalctl -b -g 'Code: Bad RIP value|BUG: Bad rss-counter state mm:'
>/dev/null
then
    MSG='RAM corruption detected, starting pro-active reboot'
    logger -t reboot-before-panic -p local0.notice "$MSG"
    shutdown -r +1 "$MSG"
fi
if journalctl -b -k | grep -q 'kernel: OCFS2: File system is now read-only\.'
then
    MSG='OCFS2 problem detected, stopping cluster node, then reboot'
    logger -t reboot-before-panic -p local0.notice "$MSG"
    crm cluster stop
    shutdown -r +1 "$MSG"
fi
---

Regards,
Ulrich

> 
> Regards,
>    Yan
> 
> 
>> 
>> Regards,
>> Ulrich
>> 
>> 
>>>
>>> _______________________________________________
>>> Manage your subscription:
>>> https://lists.clusterlabs.org/mailman/listinfo/users 
>>>
>>> ClusterLabs home: https://www.clusterlabs.org/ 
>> 
>> 
>> 
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>> 
>> ClusterLabs home: https://www.clusterlabs.org/