[ClusterLabs] Antw: Re: Antw: [EXT] Node fenced for unknown reason

Fri Apr 16 02:09:06 EDT 2021

On Fri, Apr 16, 2021 at 6:56 AM Andrei Borzenkov <arvidjaar at gmail.com> wrote:
>
> On 15.04.2021 23:09, Steffen Vinther Sørensen wrote:
> > On Thu, Apr 15, 2021 at 3:39 PM Klaus Wenninger <kwenning at redhat.com> wrote:
> >>
> >> On 4/15/21 3:26 PM, Ulrich Windl wrote:
> >>>>>> Steffen Vinther Sørensen <svinther at gmail.com> schrieb am 15.04.2021 um
> >>> 14:56 in
> >>> Nachricht
> >>> <CALhdMBiXZoYF-Gxg82oNT4MGFm6Q-_imCeUVHyPgWKy41JjFSg at mail.gmail.com>:
> >>>> On Thu, Apr 15, 2021 at 2:29 PM Ulrich Windl
> >>>> <Ulrich.Windl at rz.uni-regensburg.de> wrote:
> >>>>>>>> Steffen Vinther Sørensen <svinther at gmail.com> schrieb am 15.04.2021 um
> >>>>> 13:10 in
> >>>>> Nachricht
> >>>>> <CALhdMBhMQRwmgoWEWuiGMDr7HfVOTTKvW8=NQMs2P2e9p8y9Jw at mail.gmail.com>:
> >>>>>> Hi there,
> >>>>>>
> >>>>>> In this 3 node cluster, node03 been offline for a while, and being
> >>>>>> brought up to service. Then a migration of a VirtualDomain is being
> >>>>>> attempted, and node02 is then fenced.
> >>>>>>
> >>>>>> Provided is logs from all 2 nodes, and the 'pcs config' as well as a
> >>>>>> bzcatted pe-warn. Anyone with an idea of why the node was fenced ? Is
> >>>>>> it because of the failed ipmi monitor warning ?
> >>>>> After a short glace it looks as if the network traffic used for VM
> >>> migration
> >>>>> killed the corosync (or other) communication.
> >>>>>
> >>>> May I ask what part is making you think so ?
> >>> The part that I saw no reason for an intended fencing.
> >> And it looks like node02 is being cut off from all
> >> networking-communication - both corosync & ipmi.
> >> May really be the networking-load although I would
> >> rather bet on something more systematic like a
> >> Mac/IP-conflict with the VM or something.
> >> I see you are having libvirtd under cluster-control.
> >> Maybe bringing up the network-topology destroys the
> >> connection between the nodes.
> >> Has the cluster been working with the 3 nodes before?
> >>
> >>
> >> Klaus
> >
> > Hi Klaus
> >
> > Yes it has been working before with all 3 nodes and migrations back
> > and forth, but a few more VirtualDomains have been deployed since the
> > last migration test.
> >
> > It happens very fast, almost immediately after migration is starting.
> > Could it be that some timeout values should be adjusted ?
> > I just don't have any idea where to start looking, as to me there is
> > nothing obviously suspicious found in the logs.
> >
>
>
> I would look at performance stats, may be node02 was overloaded and
> could not answer in time. Although standard sar stats are collected
> every 15 minutes which is usually too coarse for it.
>
> Migration could stress network. Talk with your network support, any
> errors around this time?

I see no network errors around that time when checking e-mails and
syslogs from network equipment.

Last night I tried to bring up the node02 that was fenced earlier 'pcs
cluster start', and initiated a migration. Same thing happened, node03
was fenced almost immediately.

Then I tried to bring back up node03 and leave it for the night. This
morning I then did several migrations successfully. So it might be
something that needs more time to get up, maybe the
clustermanaged-libvirtd network components.

I have Prometheus scraping node_exporter from all 3 nodes, and I can
dig around network traffic around the incidents, for the 2 failing
incidents, upon migration the traffic rises to a stable 250Mb/s or
600Mb/s for a couple of minutes.

For successful migrations, network traffic always goes to 1000Mb/s
which is the max for single connection, the nodes have 4x1000Mb nics
bonded, and there is very low traffic going on there around any of the
incidents

/Steffen

> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/