[ClusterLabs] The 2 servers of the cluster randomly reboot almost together

Thu Feb 17 17:02:51 EST 2022

Token timeout -> network issue ?
Just run a continious ping (with timestamp) and log it into a file (from each host to other host + qdevice ip).
Best Regards,Strahil Nikolov

  On Thu, Feb 17, 2022 at 11:38, Sebastien BASTARD<sebastien at domalys.com> wrote:   Hello CoroSync's team !
We currently have a proxmox cluster with 2 servers (at different providers and different cities) and another server, in our company, with qdevice.

   Schematic :   (A) Proxmox Server A (Provider One) ---------------------- (B) Proxmox Server B (Provider Two)
                 |                                                          |
                 \----------------------------------------------------------/
                                               |
                  (C) Qdevice on Debian server (in the company)  
Between each server, we have approximately 50 ms of latency.

Between servers A and B, each virtual server is synchronized each 5 minutes, so if a server stops working, the second server starts the same virtual server.
We don't need High Availability. We can wait 5 minutes without services. After this delay, the virtual machine must start on another server if the first server does not work anymore.
With the corosync default configuration, fencing occurs on the servers randomly (average of 4/5 days), so we modified the configuration with this (bold text is our modification) :

logging {
  debug: off
  to_syslog: yes}
nodelist {
  node {
    name: serverA
    nodeid: 1
    quorum_votes: 1
    ring0_addr: xx.xx.xx.xx
  }
  node {
    name: serverB
    nodeid: 3
    quorum_votes: 1
    ring0_addr: xx.xx.xx.xx
  }}
quorum {
  device {
    model: net
    net {
      algorithm: ffsplit
      host: xx.xx.xx.xx
      tls: on
    }
    votes: 1
  }
  provider: corosync_votequorum}
totem {
  cluster_name: cluster
  config_version: 24
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
  token_retransmits_before_loss_const: 40
  token: 30000

}

With this configuration, the fence of the servers continues but with an average of 15 days.
Our current problem is that when fencing occurs on a server, the second server has the same behaviour somes minutes after ... And each time.
I tested the cluster with a cut off power of the server A, and all worked great. Server B starts the virtual machines of server A.
But in real life, when a server can't talk with another main server, it seems that the two servers believe that they isoled of other.
So, after a lot of tests, I don't know which is the best way to have a cluster that works correctly..

Currently, the cluster stops working more than the servers have a real problem.
Maybe my configuration is not good or another ? 
So, I need your help =)
Here is the kernel logs of the reboot of the server A ( result the command line << cat /var/log/daemon.log | grep -E 'watchdog|corosync' >> ) :
...
Feb 16 09:55:00 serverA corosync[2762]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Feb 16 09:55:00 serverA corosync[2762]:   [KNET  ] host: host: 3 has no active links
Feb 16 09:55:22 serverA corosync[2762]:   [TOTEM ] Token has not been received in 22500 ms 
Feb 16 09:55:30 serverA corosync[2762]:   [TOTEM ] A processor failed, forming new configuration: token timed out (30000ms), waiting 36000ms for consensus.
Feb 16 09:55:38 serverA corosync[2762]:   [KNET  ] rx: host: 3 link: 0 is up
Feb 16 09:55:38 serverA corosync[2762]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Feb 16 09:55:55 serverA watchdog-mux[1890]: client watchdog expired - disable watchdog updatesReboot....

Here is the kernel logs of the reboot of the server B ( result the command line << cat /var/log/daemon.log | grep -E 'watchdog|corosync' >> ) :

Feb 16 09:48:42 serverB corosync[2728]:   [KNET  ] link: host: 1 link: 0 is down
Feb 16 09:48:42 serverB corosync[2728]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Feb 16 09:48:42 serverB corosync[2728]:   [KNET  ] host: host: 1 has no active links
Feb 16 09:48:57 serverB corosync[2728]:   [KNET  ] rx: host: 1 link: 0 is up
Feb 16 09:48:57 serverB corosync[2728]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Feb 16 09:53:56 serverB corosync[2728]:   [KNET  ] link: host: 1 link: 0 is down
Feb 16 09:53:56 serverB corosync[2728]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Feb 16 09:53:56 serverB corosync[2728]:   [KNET  ] host: host: 1 has no active links
Feb 16 09:54:12 serverB corosync[2728]:   [KNET  ] rx: host: 1 link: 0 is up
Feb 16 09:54:12 serverB corosync[2728]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Feb 16 09:55:22 serverB corosync[2728]:   [TOTEM ] Token has not been received in 22500 ms 
Feb 16 09:55:30 serverB corosync[2728]:   [TOTEM ] A processor failed, forming new configuration: token timed out (30000ms), waiting 36000ms for consensus.
Feb 16 09:55:35 serverB corosync[2728]:   [KNET  ] link: host: 1 link: 0 is down
Feb 16 09:55:35 serverB corosync[2728]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Feb 16 09:55:35 serverB corosync[2728]:   [KNET  ] host: host: 1 has no active links
Feb 16 09:55:55 serverB watchdog-mux[2280]: client watchdog expired - disable watchdog updatesReboot

Do you have an idea why when fencing occurs on one server, the other server has the same behavior ? 
Thanks for your help.

Best regards.  

Seb._______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20220217/cb8c7a0b/attachment.htm>