[ClusterLabs] Antw: [EXT] Cluster breaks after pcs unstandby node
Ulrich Windl
Ulrich.Windl at rz.uni-regensburg.de
Thu Jan 14 02:26:38 EST 2021
Hi!
I'm using SLES, but I think your configuration misses many colocations (IMHO
every ordering should have a correspoonding colocation).
>From the logs of node1, this looks odd to me:
attrd[11024]: error: Connection to the CPG API failed: Library error (2)
After
systemd[1]: Unit pacemaker.service entered failed state.
it's expected that the node be fenced.
However this is not fencing IMHO:
Jan 04 13:59:04 kvm03-node01 systemd-logind[5456]: Power key pressed.
Jan 04 13:59:04 kvm03-node01 systemd-logind[5456]: Powering Off...
The main question is what makes the cluster think the node is lost:
Jan 04 13:58:27 kvm03-node01 corosync[10995]: [TOTEM ] A processor failed,
forming new configuration.
Jan 04 13:58:27 kvm03-node02 corosync[28814]: [TOTEM ] A processor failed,
forming new configuration.
The answer seems to be node3:
Jan 04 13:58:07 kvm03-node03 crmd[37819]: notice: Initiating monitor
operation ipmi-fencing-node02_monitor_60000 on kvm03-node02.avigol-gcs.dk
Jan 04 13:58:07 kvm03-node03 crmd[37819]: notice: Initiating monitor
operation ipmi-fencing-node03_monitor_60000 on kvm03-node01.avigol-gcs.dk
Jan 04 13:58:25 kvm03-node03 corosync[37794]: [TOTEM ] A new membership
(172.31.0.31:1044) was formed. Members
Jan 04 13:58:25 kvm03-node03 corosync[37794]: [CPG ] downlist left_list: 0
received
Jan 04 13:58:25 kvm03-node03 corosync[37794]: [CPG ] downlist left_list: 0
received
Jan 04 13:58:25 kvm03-node03 corosync[37794]: [CPG ] downlist left_list: 0
received
Jan 04 13:58:27 kvm03-node03 corosync[37794]: [TOTEM ] A processor failed,
forming new configuration.
Before:
Jan 04 13:54:18 kvm03-node03 crmd[37819]: notice: Node
kvm03-node02.avigol-gcs.dk state is now lost
Jan 04 13:54:18 kvm03-node03 crmd[37819]: notice: Node
kvm03-node02.avigol-gcs.dk state is now lost
No idea why, but then:
Jan 04 13:54:18 kvm03-node03 crmd[37819]: notice: Node
kvm03-node02.avigol-gcs.dk state is now lost
Why "shutdown" and not "fencing"?
(A side-note on "pe-input-497.bz2": You may want to limit the number of policy
files being kept; here I use 100 as limit)
Node2 then seems to have rejoined before being fenced:
Jan 04 13:57:21 kvm03-node03 crmd[37819]: notice: State transition S_IDLE ->
S_POLICY_ENGINE
The node3 seems unavailable, moding resource to node2:
Jan 04 13:58:07 kvm03-node03 crmd[37819]: notice: State transition S_IDLE ->
S_POLICY_ENGINE
Jan 04 13:58:07 kvm03-node03 pengine[37818]: notice: * Move
ipmi-fencing-node02 ( kvm03-node03.avigol-gcs.dk ->
kvm03-node02.avigol-gcs.dk )
Jan 04 13:58:07 kvm03-node03 pengine[37818]: notice: * Move
ipmi-fencing-node03 ( kvm03-node03.avigol-gcs.dk ->
kvm03-node01.avigol-gcs.dk )
Jan 04 13:58:07 kvm03-node03 pengine[37818]: notice: * Stop dlm:2
( kvm03-node03.avigol-gcs.dk )
due to node availability
Then node1 seems gone:
Jan 04 13:58:27 kvm03-node03 corosync[37794]: [TOTEM ] A processor failed,
forming new configuration.
The suddenly node-1 is here again:
Jan 04 13:58:33 kvm03-node03 crmd[37819]: notice: Stonith/shutdown of
kvm03-node01.avigol-gcs.dk not matched
Jan 04 13:58:33 kvm03-node03 crmd[37819]: notice: Transition aborted: Node
failure
Jan 04 13:58:33 kvm03-node03 cib[37814]: notice: Node
kvm03-node01.avigol-gcs.dk state is now member
Jan 04 13:58:33 kvm03-node03 attrd[37817]: notice: Node
kvm03-node01.avigol-gcs.dk state is now member
Jan 04 13:58:33 kvm03-node03 dlm_controld[39252]: 5452 cpg_mcast_joined retry
300 plock
Jan 04 13:58:33 kvm03-node03 stonith-ng[37815]: notice: Node
kvm03-node01.avigol-gcs.dk state is now member
And it's lost again:
Jan 04 13:58:33 kvm03-node03 attrd[37817]: notice: Node
kvm03-node01.avigol-gcs.dk state is now lost
Jan 04 13:58:33 kvm03-node03 cib[37814]: notice: Node
kvm03-node01.avigol-gcs.dk state is now lost
Jan 04 13:58:33 kvm03-node03 crmd[37819]: warning: No reason to expect node 1
to be down
Jan 04 13:58:33 kvm03-node03 crmd[37819]: notice: Stonith/shutdown of
kvm03-node01.avigol-gcs.dk not matched
Then it seems only node1 can fence node1, but communication with node1 is
lost:
Jan 04 13:59:03 kvm03-node03 stonith-ng[37815]: notice: ipmi-fencing-node02
can not fence (reboot) kvm03-node01.avigol-gcs.dk: static-list
Jan 04 13:59:03 kvm03-node03 stonith-ng[37815]: notice: ipmi-fencing-node03
can not fence (reboot) kvm03-node01.avigol-gcs.dk: static-list
Jan 04 13:59:03 kvm03-node03 stonith-ng[37815]: notice: ipmi-fencing-node01
can fence (reboot) kvm03-node01.avigol-gcs.dk: static-list
Jan 04 13:59:03 kvm03-node03 stonith-ng[37815]: notice: ipmi-fencing-node02
can not fence (reboot) kvm03-node01.avigol-gcs.dk: static-list
Jan 04 13:59:03 kvm03-node03 stonith-ng[37815]: notice: ipmi-fencing-node03
can not fence (reboot) kvm03-node01.avigol-gcs.dk: static-list
Jan 04 13:59:03 kvm03-node03 stonith-ng[37815]: notice: ipmi-fencing-node01
can fence (reboot) kvm03-node01.avigol-gcs.dk: static-list
No surprise then:
Jan 04 13:59:22 kvm03-node03 VirtualDomain(highk32)[25015]: ERROR: highk32:
live migration to kvm03-node02.avigol-gcs.dk failed: 1
At Jan 04 13:59:23 node3 seems down.
Jan 04 13:59:23 kvm03-node03 pengine[37818]: notice: * Stop dlm:1
(
kvm03-node03.avigol-gcs.dk ) due to node availability
This will triffer fencing of node3:
Jan 04 14:00:56 kvm03-node03 VirtualDomain(highk35)[27057]: ERROR: forced stop
failed
Jan 04 14:00:56 kvm03-node03 pengine[37818]: notice: * Stop dlm:1
(
kvm03-node03.avigol-gcs.dk ) due to node availability
At
Jan 04 14:00:58 kvm03-node03 Filesystem(sharedfs01)[27209]: INFO: Trying to
unmount /usr/local/sharedfs01
it seems there are VMs running the cluster did not know about.
-- The virtual machine qemu-16-centos7-virt-builder-docker-demo01 with its
leader PID 1029 has been
-- shut down.
Jan 04 14:00:58 kvm03-node03 kernel: br40: port 7(vnet8) entered disabled
state
So I see multiple issues with this configuration.
I suggest to start with one VM configured and then make tests; if successful
add one or more VMs and repeat testing.
If test was not successful find out what went wrong and try to fix it. Repeat
test.
Sorry, I don't have a better answer for you.
Regards,
Ulrich
>>> Steffen Vinther Sørensen <svinther at gmail.com> schrieb am 04.01.2021 um
16:08 in
Nachricht
<CALhdMBjMMHRF3ENE+=uHty7Lb9vku0o1a6+izpm3zpiM=rEPXA at mail.gmail.com>:
> Hi all,
> I am trying to stabilize a 3-node CentOS7 cluster for production
> usage, VirtualDomains and GFS2 resources. However this following use
> case ends up with node1 fenced, and some Virtualdomains in FAILED
> state.
>
> ------------------
> pcs standby node2
> # everything is live migrated to the other 2 nodes
>
> pcs stop node2
> pcs start node2
> pcs unstandby node2
> # node2 is becoming part of the cluster again, since resource
> stickiness is >0 no resources are migrated at this point.
>
> # time of logs is 13:58:07
> pcs standby node 3
>
> # node1 gets fenced after a short while
>
> # time of log 14:16:02 and repeats every 15 mins
> node3 log ?
> ------------------
>
>
> I looked through the logs but I got no clue what is going wrong,
> hoping someone may be able to provide a hint.
>
> Please find attached
>
> output of 'pcs config'
> logs from all 3 nodes
> the bzcats of pe-error-24.bz2 and pe-error-25.bz2 from node3
More information about the Users
mailing list