[ClusterLabs] Antw: [EXT] The 2 servers of the cluster randomly reboot almost together

Thu Feb 17 06:21:50 EST 2022

Hi!

It seems your problem is the network. Maybe check the connectivity between all nodes (and quorum device).
Some time ago I wrote a simple script that can log ups and downs (you'll ahve to adjust for non-LAN traffic); maybe it helps:
----
# Test Host Status (Up, Down) via ping (ICMP Echo)
#$Id: up-down-test.sh,v 1.2 2018/03/07 15:17:32 windl Exp $

# Written for SLES 11 SP3 by Ulrich Windl
TESTHOST="${1:-localhost}"
SDELAY="${2:-300}"
IFACE_OPT="${3:+-I$3}"
STATE=0
WHEN=$(date +%s)

# add time stamp to message and echo it
log_time()
{
    typeset t="$1"; shift
    echo "$@ $t ($(date -d@"$t" -u +%F_%T))"
}

trap 'log_time $(date +%s) "---EXIT"' EXIT
log_time $(date +%s) "---START"
while sleep "$SDELAY"
do
    if ping -c3 -i0.33 $IFACE_OPT -n -q "$TESTHOST" >/dev/null; then
        _STATE=1
    else
        _STATE=0
    fi
    if [ $STATE -ne $_STATE ]; then
        _WHEN=$(date +%s)
        ((DELTA = $_WHEN - $WHEN))
        log_time $_WHEN "$STATE ($DELTA) -> $_STATE"
        STATE="$_STATE"
        WHEN="$_WHEN"
    fi
done
------

The script expects three parameters: host_to_test delay_between_checks_in_seconds [interface_to_use]
Without parameters it checks localhost every 5 minutes.

Obviously you cluster cannot have higher avaulability than your network.
First you need to get an impression how reliable your network is. Then, maybe, tune the cluster parameters.

Regards,
Ulrich

>>> Sebastien BASTARD <sebastien at domalys.com> schrieb am 17.02.2022 um 10:37 in
Nachricht
<CAAjZqdwouBRBcEa83Yogi4S9mPppirEXc5T3apFmteBsK+66WQ at mail.gmail.com>:
> Hello CoroSync's team !
> 
> We currently have a proxmox cluster with 2 servers (at different providers
> and different cities) and another server, in our company, with qdevice.
> 
> Schematic :
> 
> (A) Proxmox Server A (Provider One) ---------------------- (B) Proxmox
> Server B (Provider Two)
>                  |                                                          
> |
>                  
> \----------------------------------------------------------/
>                                                |
>                   (C) Qdevice on Debian server (in the company)
> 
> 
> Between each server, we have approximately 50 ms of latency.
> 
> Between servers A and B, each virtual server is synchronized each 5
> minutes, so if a server stops working, the second server starts the same
> virtual server.
> 
> We don't need High Availability. We can wait 5 minutes without services.
> After this delay, the virtual machine must start on another server if the
> first server does not work anymore.
> 
> With the corosync default configuration, fencing occurs on the servers
> randomly (average of 4/5 days), so we modified the configuration with this
> (bold text is our modification) :
> 
> logging {
>   debug: off
>   to_syslog: yes
> }
> 
> nodelist {
>   node {
>     name: serverA
>     nodeid: 1
>     quorum_votes: 1
>     ring0_addr: xx.xx.xx.xx
>   }
>   node {
>     name: serverB
>     nodeid: 3
>     quorum_votes: 1
>     ring0_addr: xx.xx.xx.xx
>   }
> }
> 
> quorum {
>   device {
>     model: net
>     net {
>       algorithm: ffsplit
>       host: xx.xx.xx.xx
>       tls: on
>     }
>     votes: 1
>   }
>   provider: corosync_votequorum
> }
> 
> totem {
>   cluster_name: cluster
>   config_version: 24
>   interface {
>     linknumber: 0
>   }
>   ip_version: ipv4-6
>   link_mode: passive
>   secauth: on
>   version: 2
>   *token_retransmits_before_loss_const: 40*
>   *token: 30000*
> 
> }
> 
> 
> 
> With this configuration, the fence of the servers continues but with an
> average of 15 days.
> 
> Our current problem is that when fencing occurs on a server, the second
> server has the same behaviour somes minutes after ... And each time.
> 
> I tested the cluster with a cut off power of the server A, and all worked
> great. Server B starts the virtual machines of server A.
> 
> But in real life, when a server can't talk with another main server, it
> seems that the two servers believe that they isoled of other.
> 
> So, after a lot of tests, I don't know which is the best way to have a
> cluster that works correctly..
> 
> Currently, the cluster stops working more than the servers have a real
> problem.
> 
> Maybe my configuration is not good or another ?
> 
> So, I need your help =)
> 
> *Here is the kernel logs of the reboot of the server A ( result the command
> line << cat /var/log/daemon.log | grep -E 'watchdog|corosync' >> ) :*
> 
> ...
> Feb 16 09:55:00 serverA corosync[2762]:   [KNET  ] host: host: 3 (passive)
> best link: 0 (pri: 1)
> Feb 16 09:55:00 serverA corosync[2762]:   [KNET  ] host: host: 3 has no
> active links
> Feb 16 09:55:22 serverA corosync[2762]:   [TOTEM ] Token has not been
> received in 22500 ms
> Feb 16 09:55:30 serverA corosync[2762]:   [TOTEM ] A processor failed,
> forming new configuration: token timed out (30000ms), waiting 36000ms for
> consensus.
> Feb 16 09:55:38 serverA corosync[2762]:   [KNET  ] rx: host: 3 link: 0 is up
> Feb 16 09:55:38 serverA corosync[2762]:   [KNET  ] host: host: 3 (passive)
> best link: 0 (pri: 1)
> Feb 16 09:55:55 serverA watchdog-mux[1890]: client watchdog expired -
> disable watchdog updates
> *Reboot*
> ....
> 
> 
> *Here is the kernel logs of the reboot of the server B **( result the
> command line << cat /var/log/daemon.log | grep -E 'watchdog|corosync'
>>> ) :*
> 
> Feb 16 09:48:42 serverB corosync[2728]:   [KNET  ] link: host: 1 link: 0 is
> down
> Feb 16 09:48:42 serverB corosync[2728]:   [KNET  ] host: host: 1 (passive)
> best link: 0 (pri: 1)
> Feb 16 09:48:42 serverB corosync[2728]:   [KNET  ] host: host: 1 has no
> active links
> Feb 16 09:48:57 serverB corosync[2728]:   [KNET  ] rx: host: 1 link: 0 is up
> Feb 16 09:48:57 serverB corosync[2728]:   [KNET  ] host: host: 1 (passive)
> best link: 0 (pri: 1)
> Feb 16 09:53:56 serverB corosync[2728]:   [KNET  ] link: host: 1 link: 0 is
> down
> Feb 16 09:53:56 serverB corosync[2728]:   [KNET  ] host: host: 1 (passive)
> best link: 0 (pri: 1)
> Feb 16 09:53:56 serverB corosync[2728]:   [KNET  ] host: host: 1 has no
> active links
> Feb 16 09:54:12 serverB corosync[2728]:   [KNET  ] rx: host: 1 link: 0 is up
> Feb 16 09:54:12 serverB corosync[2728]:   [KNET  ] host: host: 1 (passive)
> best link: 0 (pri: 1)
> Feb 16 09:55:22 serverB corosync[2728]:   [TOTEM ] Token has not been
> received in 22500 ms
> Feb 16 09:55:30 serverB corosync[2728]:   [TOTEM ] A processor failed,
> forming new configuration: token timed out (30000ms), waiting 36000ms for
> consensus.
> Feb 16 09:55:35 serverB corosync[2728]:   [KNET  ] link: host: 1 link: 0 is
> down
> Feb 16 09:55:35 serverB corosync[2728]:   [KNET  ] host: host: 1 (passive)
> best link: 0 (pri: 1)
> Feb 16 09:55:35 serverB corosync[2728]:   [KNET  ] host: host: 1 has no
> active links
> Feb 16 09:55:55 serverB watchdog-mux[2280]: client watchdog expired -
> disable watchdog updates
> *Reboot*
> 
> 
> Do you have an idea why when fencing occurs on one server, the other server
> has the same behavior ?
> 
> Thanks for your help.
> 
> Best regards.
> 
> Seb.