[ClusterLabs] Antw: [EXT] Cluster breaks after pcs unstandby node

Mon Jan 4 13:55:58 EST 2021

On Mon, Jan 4, 2021 at 4:22 PM Ulrich Windl
<Ulrich.Windl at rz.uni-regensburg.de> wrote:
>
> >>> Steffen Vinther Sørensen <svinther at gmail.com> schrieb am 04.01.2021 um
> 16:08 in
> Nachricht
> <CALhdMBjMMHRF3ENE+=uHty7Lb9vku0o1a6+izpm3zpiM=rEPXA at mail.gmail.com>:
> > Hi all,
> > I am trying to stabilize a 3-node CentOS7 cluster for production
> > usage, VirtualDomains and GFS2 resources. However this following use
> > case ends up with node1 fenced, and some Virtualdomains in FAILED
> > state.
> >
> > ------------------
> > pcs standby node2
> > # everything is live migrated to the other 2 nodes
> >
> > pcs stop node2
> > pcs start node2
> > pcs unstandby node2
> > # node2 is becoming part of the cluster again, since resource
> > stickiness is >0 no resources are migrated at this point.
> >
> > # time of logs is 13:58:07
> > pcs standby node 3
> >
> > # node1 gets fenced after a short while
> >
> > # time of log 14:16:02 and repeats every 15 mins
> > node3 log ?
> > ------------------
> >
> >
> > I looked through the logs but I got no clue what is going wrong,
> > hoping someone may be able to provide a hint.
>
> Next time, also indicate which node is DC; it helps to pick the right log ;-)
> Quite a lot of systemd and SSH connections IMHO...
> At 13:58:27 node1 (kvm03-node01.avigol-gcs.dk) seems gone.
> Thus (I guess):
> Jan 04 13:59:03 kvm03-node02 stonith-ng[28902]:   notice: Requesting that
> kvm03-node03.avigol-gcs.dk perform 'reboot' action targeting
> kvm03-node01.avigol-gcs.dk
> Jan 04 13:59:22 kvm03-node02 crmd[28906]:   notice: Peer
> kvm03-node01.avigol-gcs.dk was terminated (reboot) by
> kvm03-node03.avigol-gcs.dk on behalf of stonith-api.33494: OK
>
> Maybe your network was flooded during migration of VMs?:
> Jan 04 13:58:33 kvm03-node03 corosync[37794]:  [TOTEM ] Retransmit List: 1 2
>
> You can limit the number of simultaneous migrations, BTW.
>
> Jan 04 13:58:33 kvm03-node03 cib[37814]:   notice: Node
> kvm03-node01.avigol-gcs.dk state is now member
> Jan 04 13:58:33 kvm03-node03 cib[37814]:   notice: Node
> kvm03-node01.avigol-gcs.dk state is now lost
> Jan 04 13:58:33 kvm03-node03 crmd[37819]:  warning: No reason to expect node 1
> to be down
>
> The above explains fencing.
>
> Regards,
> Ulrich
>
>
> >
> > Please find attached
> >
> > output of 'pcs config'
> > logs from all 3 nodes
> > the bzcats of pe-error-24.bz2 and pe-error-25.bz2 from node3
>
>
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/

Setting the cluster property migration-limit=2 seemed to help.

Thank you for the advice, and sorry for the excessive logs.

/Steffen