[ClusterLabs] Antw: [EXT] Cluster breaks after pcs unstandby node
Ulrich Windl
Ulrich.Windl at rz.uni-regensburg.de
Mon Jan 4 10:22:47 EST 2021
>>> Steffen Vinther Sørensen <svinther at gmail.com> schrieb am 04.01.2021 um
16:08 in
Nachricht
<CALhdMBjMMHRF3ENE+=uHty7Lb9vku0o1a6+izpm3zpiM=rEPXA at mail.gmail.com>:
> Hi all,
> I am trying to stabilize a 3-node CentOS7 cluster for production
> usage, VirtualDomains and GFS2 resources. However this following use
> case ends up with node1 fenced, and some Virtualdomains in FAILED
> state.
>
> ------------------
> pcs standby node2
> # everything is live migrated to the other 2 nodes
>
> pcs stop node2
> pcs start node2
> pcs unstandby node2
> # node2 is becoming part of the cluster again, since resource
> stickiness is >0 no resources are migrated at this point.
>
> # time of logs is 13:58:07
> pcs standby node 3
>
> # node1 gets fenced after a short while
>
> # time of log 14:16:02 and repeats every 15 mins
> node3 log ?
> ------------------
>
>
> I looked through the logs but I got no clue what is going wrong,
> hoping someone may be able to provide a hint.
Next time, also indicate which node is DC; it helps to pick the right log ;-)
Quite a lot of systemd and SSH connections IMHO...
At 13:58:27 node1 (kvm03-node01.avigol-gcs.dk) seems gone.
Thus (I guess):
Jan 04 13:59:03 kvm03-node02 stonith-ng[28902]: notice: Requesting that
kvm03-node03.avigol-gcs.dk perform 'reboot' action targeting
kvm03-node01.avigol-gcs.dk
Jan 04 13:59:22 kvm03-node02 crmd[28906]: notice: Peer
kvm03-node01.avigol-gcs.dk was terminated (reboot) by
kvm03-node03.avigol-gcs.dk on behalf of stonith-api.33494: OK
Maybe your network was flooded during migration of VMs?:
Jan 04 13:58:33 kvm03-node03 corosync[37794]: [TOTEM ] Retransmit List: 1 2
You can limit the number of simultaneous migrations, BTW.
Jan 04 13:58:33 kvm03-node03 cib[37814]: notice: Node
kvm03-node01.avigol-gcs.dk state is now member
Jan 04 13:58:33 kvm03-node03 cib[37814]: notice: Node
kvm03-node01.avigol-gcs.dk state is now lost
Jan 04 13:58:33 kvm03-node03 crmd[37819]: warning: No reason to expect node 1
to be down
The above explains fencing.
Regards,
Ulrich
>
> Please find attached
>
> output of 'pcs config'
> logs from all 3 nodes
> the bzcats of pe-error-24.bz2 and pe-error-25.bz2 from node3
More information about the Users
mailing list