[ClusterLabs] The 2 servers of the cluster randomly reboot almost together

Thu Feb 17 04:37:54 EST 2022

Hello CoroSync's team !

We currently have a proxmox cluster with 2 servers (at different providers
and different cities) and another server, in our company, with qdevice.

Schematic :

(A) Proxmox Server A (Provider One) ---------------------- (B) Proxmox
Server B (Provider Two)
                 |                                                          |
                 \----------------------------------------------------------/
                                               |
                  (C) Qdevice on Debian server (in the company)

Between each server, we have approximately 50 ms of latency.

Between servers A and B, each virtual server is synchronized each 5
minutes, so if a server stops working, the second server starts the same
virtual server.

We don't need High Availability. We can wait 5 minutes without services.
After this delay, the virtual machine must start on another server if the
first server does not work anymore.

With the corosync default configuration, fencing occurs on the servers
randomly (average of 4/5 days), so we modified the configuration with this
(bold text is our modification) :

logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: serverA
    nodeid: 1
    quorum_votes: 1
    ring0_addr: xx.xx.xx.xx
  }
  node {
    name: serverB
    nodeid: 3
    quorum_votes: 1
    ring0_addr: xx.xx.xx.xx
  }
}

quorum {
  device {
    model: net
    net {
      algorithm: ffsplit
      host: xx.xx.xx.xx
      tls: on
    }
    votes: 1
  }
  provider: corosync_votequorum
}

totem {
  cluster_name: cluster
  config_version: 24
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
  *token_retransmits_before_loss_const: 40*
  *token: 30000*

}

With this configuration, the fence of the servers continues but with an
average of 15 days.

Our current problem is that when fencing occurs on a server, the second
server has the same behaviour somes minutes after ... And each time.

I tested the cluster with a cut off power of the server A, and all worked
great. Server B starts the virtual machines of server A.

But in real life, when a server can't talk with another main server, it
seems that the two servers believe that they isoled of other.

So, after a lot of tests, I don't know which is the best way to have a
cluster that works correctly..

Currently, the cluster stops working more than the servers have a real
problem.

Maybe my configuration is not good or another ?

So, I need your help =)

*Here is the kernel logs of the reboot of the server A ( result the command
line << cat /var/log/daemon.log | grep -E 'watchdog|corosync' >> ) :*

...
Feb 16 09:55:00 serverA corosync[2762]:   [KNET  ] host: host: 3 (passive)
best link: 0 (pri: 1)
Feb 16 09:55:00 serverA corosync[2762]:   [KNET  ] host: host: 3 has no
active links
Feb 16 09:55:22 serverA corosync[2762]:   [TOTEM ] Token has not been
received in 22500 ms
Feb 16 09:55:30 serverA corosync[2762]:   [TOTEM ] A processor failed,
forming new configuration: token timed out (30000ms), waiting 36000ms for
consensus.
Feb 16 09:55:38 serverA corosync[2762]:   [KNET  ] rx: host: 3 link: 0 is up
Feb 16 09:55:38 serverA corosync[2762]:   [KNET  ] host: host: 3 (passive)
best link: 0 (pri: 1)
Feb 16 09:55:55 serverA watchdog-mux[1890]: client watchdog expired -
disable watchdog updates
*Reboot*
....

*Here is the kernel logs of the reboot of the server B **( result the
command line << cat /var/log/daemon.log | grep -E 'watchdog|corosync'
>> ) :*

Feb 16 09:48:42 serverB corosync[2728]:   [KNET  ] link: host: 1 link: 0 is
down
Feb 16 09:48:42 serverB corosync[2728]:   [KNET  ] host: host: 1 (passive)
best link: 0 (pri: 1)
Feb 16 09:48:42 serverB corosync[2728]:   [KNET  ] host: host: 1 has no
active links
Feb 16 09:48:57 serverB corosync[2728]:   [KNET  ] rx: host: 1 link: 0 is up
Feb 16 09:48:57 serverB corosync[2728]:   [KNET  ] host: host: 1 (passive)
best link: 0 (pri: 1)
Feb 16 09:53:56 serverB corosync[2728]:   [KNET  ] link: host: 1 link: 0 is
down
Feb 16 09:53:56 serverB corosync[2728]:   [KNET  ] host: host: 1 (passive)
best link: 0 (pri: 1)
Feb 16 09:53:56 serverB corosync[2728]:   [KNET  ] host: host: 1 has no
active links
Feb 16 09:54:12 serverB corosync[2728]:   [KNET  ] rx: host: 1 link: 0 is up
Feb 16 09:54:12 serverB corosync[2728]:   [KNET  ] host: host: 1 (passive)
best link: 0 (pri: 1)
Feb 16 09:55:22 serverB corosync[2728]:   [TOTEM ] Token has not been
received in 22500 ms
Feb 16 09:55:30 serverB corosync[2728]:   [TOTEM ] A processor failed,
forming new configuration: token timed out (30000ms), waiting 36000ms for
consensus.
Feb 16 09:55:35 serverB corosync[2728]:   [KNET  ] link: host: 1 link: 0 is
down
Feb 16 09:55:35 serverB corosync[2728]:   [KNET  ] host: host: 1 (passive)
best link: 0 (pri: 1)
Feb 16 09:55:35 serverB corosync[2728]:   [KNET  ] host: host: 1 has no
active links
Feb 16 09:55:55 serverB watchdog-mux[2280]: client watchdog expired -
disable watchdog updates
*Reboot*

Do you have an idea why when fencing occurs on one server, the other server
has the same behavior ?

Thanks for your help.

Best regards.

Seb.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20220217/36c5a411/attachment.htm>