[ClusterLabs] Antw: Re: Antw: Re: Antw: [EXT] Node fenced for unknown reason

Mon Apr 19 02:36:02 EDT 2021

>>> Andrei Borzenkov <arvidjaar at gmail.com> schrieb am 16.04.2021 um 06:56 in
Nachricht <4deaf7d9-1d38-cb03-e908-1d4f2c58cf16 at gmail.com>:
> On 15.04.2021 23:09, Steffen Vinther Sørensen wrote:
>> On Thu, Apr 15, 2021 at 3:39 PM Klaus Wenninger <kwenning at redhat.com>
wrote:
>>>
>>> On 4/15/21 3:26 PM, Ulrich Windl wrote:
>>>>>>> Steffen Vinther Sørensen <svinther at gmail.com> schrieb am 15.04.2021
um
>>>> 14:56 in
>>>> Nachricht
>>>> <CALhdMBiXZoYF-Gxg82oNT4MGFm6Q-_imCeUVHyPgWKy41JjFSg at mail.gmail.com>:
>>>>> On Thu, Apr 15, 2021 at 2:29 PM Ulrich Windl
>>>>> <Ulrich.Windl at rz.uni-regensburg.de> wrote:
>>>>>>>>> Steffen Vinther Sørensen <svinther at gmail.com> schrieb am 15.04.2021
um
>>>>>> 13:10 in
>>>>>> Nachricht
>>>>>> <CALhdMBhMQRwmgoWEWuiGMDr7HfVOTTKvW8=NQMs2P2e9p8y9Jw at mail.gmail.com>:
>>>>>>> Hi there,
>>>>>>>
>>>>>>> In this 3 node cluster, node03 been offline for a while, and being
>>>>>>> brought up to service. Then a migration of a VirtualDomain is being
>>>>>>> attempted, and node02 is then fenced.
>>>>>>>
>>>>>>> Provided is logs from all 2 nodes, and the 'pcs config' as well as a
>>>>>>> bzcatted pe-warn. Anyone with an idea of why the node was fenced ? Is
>>>>>>> it because of the failed ipmi monitor warning ?
>>>>>> After a short glace it looks as if the network traffic used for VM
>>>> migration
>>>>>> killed the corosync (or other) communication.
>>>>>>
>>>>> May I ask what part is making you think so ?
>>>> The part that I saw no reason for an intended fencing.
>>> And it looks like node02 is being cut off from all
>>> networking-communication - both corosync & ipmi.
>>> May really be the networking-load although I would
>>> rather bet on something more systematic like a
>>> Mac/IP-conflict with the VM or something.
>>> I see you are having libvirtd under cluster-control.
>>> Maybe bringing up the network-topology destroys the
>>> connection between the nodes.
>>> Has the cluster been working with the 3 nodes before?
>>>
>>>
>>> Klaus
>> 
>> Hi Klaus
>> 
>> Yes it has been working before with all 3 nodes and migrations back
>> and forth, but a few more VirtualDomains have been deployed since the
>> last migration test.
>> 
>> It happens very fast, almost immediately after migration is starting.
>> Could it be that some timeout values should be adjusted ?
>> I just don't have any idea where to start looking, as to me there is
>> nothing obviously suspicious found in the logs.
>> 
> 
> 
> I would look at performance stats, may be node02 was overloaded and
> could not answer in time. Although standard sar stats are collected
> every 15 minutes which is usually too coarse for it.

It's proably too slow, but recent monit can:
       Link saturation

       You can check the network link saturation. Monit then computes the
link
       utilisation based on the current transfer rate vs. link capacity. 
This
       test may only be used within a check network service entry in the
Monit
       control file.

       Syntax:

        IF SATURATION operator value% THEN action

       operator is a choice of "<",">","!=","==" in c notation, "gt", "lt",
       "eq", "ne" in shell sh notation and "greater", "less", "equal",
       "notequal" in human readable form (if not specified, default is
EQUAL).

       action is a choice of "ALERT", "RESTART", "START", "STOP", "EXEC" or
       "UNMONITOR".

       NOTE: this test depends on the availability of the speed attribute and
       not all interface types have this attribute. See the LINK SPEED test
       description.

       Example:

        check network eth0 with interface eth0
              if saturation > 90% then alert

Regards,
Ulrich

> 
> Migration could stress network. Talk with your network support, any
> errors around this time?
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/