[ClusterLabs] Antw: [EXT] Cluster breaks after pcs unstandby node

Thu Jan 14 02:26:38 EST 2021

Hi!

I'm using SLES, but I think your configuration misses many colocations (IMHO
every ordering should have a correspoonding colocation).

>From the logs of node1, this looks odd to me:
attrd[11024]:    error: Connection to the CPG API failed: Library error (2)

After
systemd[1]: Unit pacemaker.service entered failed state.
it's expected that the node be fenced.

However this is not fencing IMHO:
Jan 04 13:59:04 kvm03-node01 systemd-logind[5456]: Power key pressed.
Jan 04 13:59:04 kvm03-node01 systemd-logind[5456]: Powering Off...

The main question is what makes the cluster think the node is lost:
Jan 04 13:58:27 kvm03-node01 corosync[10995]:  [TOTEM ] A processor failed,
forming new configuration.
Jan 04 13:58:27 kvm03-node02 corosync[28814]:  [TOTEM ] A processor failed,
forming new configuration.

The answer seems to be node3:
Jan 04 13:58:07 kvm03-node03 crmd[37819]:   notice: Initiating monitor
operation ipmi-fencing-node02_monitor_60000 on kvm03-node02.avigol-gcs.dk
Jan 04 13:58:07 kvm03-node03 crmd[37819]:   notice: Initiating monitor
operation ipmi-fencing-node03_monitor_60000 on kvm03-node01.avigol-gcs.dk
Jan 04 13:58:25 kvm03-node03 corosync[37794]:  [TOTEM ] A new membership
(172.31.0.31:1044) was formed. Members
Jan 04 13:58:25 kvm03-node03 corosync[37794]:  [CPG   ] downlist left_list: 0
received
Jan 04 13:58:25 kvm03-node03 corosync[37794]:  [CPG   ] downlist left_list: 0
received
Jan 04 13:58:25 kvm03-node03 corosync[37794]:  [CPG   ] downlist left_list: 0
received
Jan 04 13:58:27 kvm03-node03 corosync[37794]:  [TOTEM ] A processor failed,
forming new configuration.

Before:
Jan 04 13:54:18 kvm03-node03 crmd[37819]:   notice: Node
kvm03-node02.avigol-gcs.dk state is now lost
Jan 04 13:54:18 kvm03-node03 crmd[37819]:   notice: Node
kvm03-node02.avigol-gcs.dk state is now lost

No idea why, but then:
Jan 04 13:54:18 kvm03-node03 crmd[37819]:   notice: Node
kvm03-node02.avigol-gcs.dk state is now lost
Why "shutdown" and not "fencing"?

(A side-note on "pe-input-497.bz2": You may want to limit the number of policy
files being kept; here I use 100 as limit)
Node2 then seems to have rejoined before being fenced:
Jan 04 13:57:21 kvm03-node03 crmd[37819]:   notice: State transition S_IDLE ->
S_POLICY_ENGINE

The node3 seems unavailable, moding resource to node2:
Jan 04 13:58:07 kvm03-node03 crmd[37819]:   notice: State transition S_IDLE ->
S_POLICY_ENGINE
Jan 04 13:58:07 kvm03-node03 pengine[37818]:   notice:  * Move      
ipmi-fencing-node02     ( kvm03-node03.avigol-gcs.dk ->
kvm03-node02.avigol-gcs.dk )
Jan 04 13:58:07 kvm03-node03 pengine[37818]:   notice:  * Move      
ipmi-fencing-node03     ( kvm03-node03.avigol-gcs.dk ->
kvm03-node01.avigol-gcs.dk )
Jan 04 13:58:07 kvm03-node03 pengine[37818]:   notice:  * Stop       dlm:2    
              (                               kvm03-node03.avigol-gcs.dk )  
due to node availability

Then node1 seems gone:
Jan 04 13:58:27 kvm03-node03 corosync[37794]:  [TOTEM ] A processor failed,
forming new configuration.

The suddenly node-1 is here again:
Jan 04 13:58:33 kvm03-node03 crmd[37819]:   notice: Stonith/shutdown of
kvm03-node01.avigol-gcs.dk not matched
Jan 04 13:58:33 kvm03-node03 crmd[37819]:   notice: Transition aborted: Node
failure
Jan 04 13:58:33 kvm03-node03 cib[37814]:   notice: Node
kvm03-node01.avigol-gcs.dk state is now member
Jan 04 13:58:33 kvm03-node03 attrd[37817]:   notice: Node
kvm03-node01.avigol-gcs.dk state is now member
Jan 04 13:58:33 kvm03-node03 dlm_controld[39252]: 5452 cpg_mcast_joined retry
300 plock
Jan 04 13:58:33 kvm03-node03 stonith-ng[37815]:   notice: Node
kvm03-node01.avigol-gcs.dk state is now member

And it's lost again:
Jan 04 13:58:33 kvm03-node03 attrd[37817]:   notice: Node
kvm03-node01.avigol-gcs.dk state is now lost
Jan 04 13:58:33 kvm03-node03 cib[37814]:   notice: Node
kvm03-node01.avigol-gcs.dk state is now lost

Jan 04 13:58:33 kvm03-node03 crmd[37819]:  warning: No reason to expect node 1
to be down
Jan 04 13:58:33 kvm03-node03 crmd[37819]:   notice: Stonith/shutdown of
kvm03-node01.avigol-gcs.dk not matched

Then it seems only node1 can fence node1, but communication with node1 is
lost:
Jan 04 13:59:03 kvm03-node03 stonith-ng[37815]:   notice: ipmi-fencing-node02
can not fence (reboot) kvm03-node01.avigol-gcs.dk: static-list
Jan 04 13:59:03 kvm03-node03 stonith-ng[37815]:   notice: ipmi-fencing-node03
can not fence (reboot) kvm03-node01.avigol-gcs.dk: static-list
Jan 04 13:59:03 kvm03-node03 stonith-ng[37815]:   notice: ipmi-fencing-node01
can fence (reboot) kvm03-node01.avigol-gcs.dk: static-list
Jan 04 13:59:03 kvm03-node03 stonith-ng[37815]:   notice: ipmi-fencing-node02
can not fence (reboot) kvm03-node01.avigol-gcs.dk: static-list
Jan 04 13:59:03 kvm03-node03 stonith-ng[37815]:   notice: ipmi-fencing-node03
can not fence (reboot) kvm03-node01.avigol-gcs.dk: static-list
Jan 04 13:59:03 kvm03-node03 stonith-ng[37815]:   notice: ipmi-fencing-node01
can fence (reboot) kvm03-node01.avigol-gcs.dk: static-list

No surprise then:
Jan 04 13:59:22 kvm03-node03 VirtualDomain(highk32)[25015]: ERROR: highk32:
live migration to kvm03-node02.avigol-gcs.dk failed: 1

At Jan 04 13:59:23 node3 seems down.
Jan 04 13:59:23 kvm03-node03 pengine[37818]:   notice:  * Stop       dlm:1    
                             (                              
kvm03-node03.avigol-gcs.dk )   due to node availability

This will triffer fencing of node3:
Jan 04 14:00:56 kvm03-node03 VirtualDomain(highk35)[27057]: ERROR: forced stop
failed

Jan 04 14:00:56 kvm03-node03 pengine[37818]:   notice:  * Stop       dlm:1    
                             (                              
kvm03-node03.avigol-gcs.dk )   due to node availability

At
Jan 04 14:00:58 kvm03-node03 Filesystem(sharedfs01)[27209]: INFO: Trying to
unmount /usr/local/sharedfs01

it seems there are VMs running the cluster did not know about.

-- The virtual machine qemu-16-centos7-virt-builder-docker-demo01 with its
leader PID 1029 has been
-- shut down.
Jan 04 14:00:58 kvm03-node03 kernel: br40: port 7(vnet8) entered disabled
state

So I see multiple issues with this configuration.

I suggest to start with one VM configured and then make tests; if successful
add one or more VMs and repeat testing.
If test was not successful find out what went wrong and try to fix it. Repeat
test.

Sorry, I don't have a better answer for you.

Regards,
Ulrich

>>> Steffen Vinther Sørensen <svinther at gmail.com> schrieb am 04.01.2021 um
16:08 in
Nachricht
<CALhdMBjMMHRF3ENE+=uHty7Lb9vku0o1a6+izpm3zpiM=rEPXA at mail.gmail.com>:
> Hi all,
> I am trying to stabilize a 3-node CentOS7 cluster for production
> usage, VirtualDomains and GFS2 resources. However this following use
> case ends up with node1 fenced, and some Virtualdomains in FAILED
> state.
> 
> ------------------
> pcs standby node2
> # everything is live migrated to the other 2 nodes
> 
> pcs stop node2
> pcs start node2
> pcs unstandby node2
> # node2 is becoming part of the cluster again, since resource
> stickiness is >0 no resources are migrated at this point.
> 
> # time of logs is 13:58:07
> pcs standby node 3
> 
> # node1 gets fenced after a short while
> 
> # time of log 14:16:02 and repeats every 15 mins
> node3 log ?
> ------------------
> 
> 
> I looked through the logs but I got no clue what is going wrong,
> hoping someone may be able to provide a hint.
> 
> Please find attached
> 
> output of 'pcs config'
> logs from all 3 nodes
> the bzcats of pe-error-24.bz2 and pe-error-25.bz2 from node3