[ClusterLabs] Antw: [EXT] The 2 servers of the cluster randomly reboot almost together
Ulrich Windl
Ulrich.Windl at rz.uni-regensburg.de
Thu Feb 17 06:21:50 EST 2022
Hi!
It seems your problem is the network. Maybe check the connectivity between all nodes (and quorum device).
Some time ago I wrote a simple script that can log ups and downs (you'll ahve to adjust for non-LAN traffic); maybe it helps:
----
# Test Host Status (Up, Down) via ping (ICMP Echo)
#$Id: up-down-test.sh,v 1.2 2018/03/07 15:17:32 windl Exp $
# Written for SLES 11 SP3 by Ulrich Windl
TESTHOST="${1:-localhost}"
SDELAY="${2:-300}"
IFACE_OPT="${3:+-I$3}"
STATE=0
WHEN=$(date +%s)
# add time stamp to message and echo it
log_time()
{
typeset t="$1"; shift
echo "$@ $t ($(date -d@"$t" -u +%F_%T))"
}
trap 'log_time $(date +%s) "---EXIT"' EXIT
log_time $(date +%s) "---START"
while sleep "$SDELAY"
do
if ping -c3 -i0.33 $IFACE_OPT -n -q "$TESTHOST" >/dev/null; then
_STATE=1
else
_STATE=0
fi
if [ $STATE -ne $_STATE ]; then
_WHEN=$(date +%s)
((DELTA = $_WHEN - $WHEN))
log_time $_WHEN "$STATE ($DELTA) -> $_STATE"
STATE="$_STATE"
WHEN="$_WHEN"
fi
done
------
The script expects three parameters: host_to_test delay_between_checks_in_seconds [interface_to_use]
Without parameters it checks localhost every 5 minutes.
Obviously you cluster cannot have higher avaulability than your network.
First you need to get an impression how reliable your network is. Then, maybe, tune the cluster parameters.
Regards,
Ulrich
>>> Sebastien BASTARD <sebastien at domalys.com> schrieb am 17.02.2022 um 10:37 in
Nachricht
<CAAjZqdwouBRBcEa83Yogi4S9mPppirEXc5T3apFmteBsK+66WQ at mail.gmail.com>:
> Hello CoroSync's team !
>
> We currently have a proxmox cluster with 2 servers (at different providers
> and different cities) and another server, in our company, with qdevice.
>
> Schematic :
>
> (A) Proxmox Server A (Provider One) ---------------------- (B) Proxmox
> Server B (Provider Two)
> |
> |
>
> \----------------------------------------------------------/
> |
> (C) Qdevice on Debian server (in the company)
>
>
> Between each server, we have approximately 50 ms of latency.
>
> Between servers A and B, each virtual server is synchronized each 5
> minutes, so if a server stops working, the second server starts the same
> virtual server.
>
> We don't need High Availability. We can wait 5 minutes without services.
> After this delay, the virtual machine must start on another server if the
> first server does not work anymore.
>
> With the corosync default configuration, fencing occurs on the servers
> randomly (average of 4/5 days), so we modified the configuration with this
> (bold text is our modification) :
>
> logging {
> debug: off
> to_syslog: yes
> }
>
> nodelist {
> node {
> name: serverA
> nodeid: 1
> quorum_votes: 1
> ring0_addr: xx.xx.xx.xx
> }
> node {
> name: serverB
> nodeid: 3
> quorum_votes: 1
> ring0_addr: xx.xx.xx.xx
> }
> }
>
> quorum {
> device {
> model: net
> net {
> algorithm: ffsplit
> host: xx.xx.xx.xx
> tls: on
> }
> votes: 1
> }
> provider: corosync_votequorum
> }
>
> totem {
> cluster_name: cluster
> config_version: 24
> interface {
> linknumber: 0
> }
> ip_version: ipv4-6
> link_mode: passive
> secauth: on
> version: 2
> *token_retransmits_before_loss_const: 40*
> *token: 30000*
>
> }
>
>
>
> With this configuration, the fence of the servers continues but with an
> average of 15 days.
>
> Our current problem is that when fencing occurs on a server, the second
> server has the same behaviour somes minutes after ... And each time.
>
> I tested the cluster with a cut off power of the server A, and all worked
> great. Server B starts the virtual machines of server A.
>
> But in real life, when a server can't talk with another main server, it
> seems that the two servers believe that they isoled of other.
>
> So, after a lot of tests, I don't know which is the best way to have a
> cluster that works correctly..
>
> Currently, the cluster stops working more than the servers have a real
> problem.
>
> Maybe my configuration is not good or another ?
>
> So, I need your help =)
>
> *Here is the kernel logs of the reboot of the server A ( result the command
> line << cat /var/log/daemon.log | grep -E 'watchdog|corosync' >> ) :*
>
> ...
> Feb 16 09:55:00 serverA corosync[2762]: [KNET ] host: host: 3 (passive)
> best link: 0 (pri: 1)
> Feb 16 09:55:00 serverA corosync[2762]: [KNET ] host: host: 3 has no
> active links
> Feb 16 09:55:22 serverA corosync[2762]: [TOTEM ] Token has not been
> received in 22500 ms
> Feb 16 09:55:30 serverA corosync[2762]: [TOTEM ] A processor failed,
> forming new configuration: token timed out (30000ms), waiting 36000ms for
> consensus.
> Feb 16 09:55:38 serverA corosync[2762]: [KNET ] rx: host: 3 link: 0 is up
> Feb 16 09:55:38 serverA corosync[2762]: [KNET ] host: host: 3 (passive)
> best link: 0 (pri: 1)
> Feb 16 09:55:55 serverA watchdog-mux[1890]: client watchdog expired -
> disable watchdog updates
> *Reboot*
> ....
>
>
> *Here is the kernel logs of the reboot of the server B **( result the
> command line << cat /var/log/daemon.log | grep -E 'watchdog|corosync'
>>> ) :*
>
> Feb 16 09:48:42 serverB corosync[2728]: [KNET ] link: host: 1 link: 0 is
> down
> Feb 16 09:48:42 serverB corosync[2728]: [KNET ] host: host: 1 (passive)
> best link: 0 (pri: 1)
> Feb 16 09:48:42 serverB corosync[2728]: [KNET ] host: host: 1 has no
> active links
> Feb 16 09:48:57 serverB corosync[2728]: [KNET ] rx: host: 1 link: 0 is up
> Feb 16 09:48:57 serverB corosync[2728]: [KNET ] host: host: 1 (passive)
> best link: 0 (pri: 1)
> Feb 16 09:53:56 serverB corosync[2728]: [KNET ] link: host: 1 link: 0 is
> down
> Feb 16 09:53:56 serverB corosync[2728]: [KNET ] host: host: 1 (passive)
> best link: 0 (pri: 1)
> Feb 16 09:53:56 serverB corosync[2728]: [KNET ] host: host: 1 has no
> active links
> Feb 16 09:54:12 serverB corosync[2728]: [KNET ] rx: host: 1 link: 0 is up
> Feb 16 09:54:12 serverB corosync[2728]: [KNET ] host: host: 1 (passive)
> best link: 0 (pri: 1)
> Feb 16 09:55:22 serverB corosync[2728]: [TOTEM ] Token has not been
> received in 22500 ms
> Feb 16 09:55:30 serverB corosync[2728]: [TOTEM ] A processor failed,
> forming new configuration: token timed out (30000ms), waiting 36000ms for
> consensus.
> Feb 16 09:55:35 serverB corosync[2728]: [KNET ] link: host: 1 link: 0 is
> down
> Feb 16 09:55:35 serverB corosync[2728]: [KNET ] host: host: 1 (passive)
> best link: 0 (pri: 1)
> Feb 16 09:55:35 serverB corosync[2728]: [KNET ] host: host: 1 has no
> active links
> Feb 16 09:55:55 serverB watchdog-mux[2280]: client watchdog expired -
> disable watchdog updates
> *Reboot*
>
>
> Do you have an idea why when fencing occurs on one server, the other server
> has the same behavior ?
>
> Thanks for your help.
>
> Best regards.
>
> Seb.
More information about the Users
mailing list