[ClusterLabs] Antw: [EXT] The 2 servers of the cluster randomly reboot almost together

Thu Feb 17 10:28:33 EST 2022

Thank you Ulrich for your script !

I launched it, with 10 seconds delay :

   - on Server A, to ping Server B
   - on Server B, to ping server A
   - on QDevice, to ping server A and Server B

I currently can't ping Qdevice from server A and B, because it is behind a
firewall which only authorizes port 5403.

Tomorrow, I will see the results.

Best regards.

Le jeu. 17 févr. 2022 à 12:22, Ulrich Windl <
Ulrich.Windl at rz.uni-regensburg.de> a écrit :

> Hi!
>
> It seems your problem is the network. Maybe check the connectivity between
> all nodes (and quorum device).
> Some time ago I wrote a simple script that can log ups and downs (you'll
> ahve to adjust for non-LAN traffic); maybe it helps:
> ----
> # Test Host Status (Up, Down) via ping (ICMP Echo)
> #$Id: up-down-test.sh,v 1.2 2018/03/07 15:17:32 windl Exp $
>
> # Written for SLES 11 SP3 by Ulrich Windl
> TESTHOST="${1:-localhost}"
> SDELAY="${2:-300}"
> IFACE_OPT="${3:+-I$3}"
> STATE=0
> WHEN=$(date +%s)
>
> # add time stamp to message and echo it
> log_time()
> {
>     typeset t="$1"; shift
>     echo "$@ $t ($(date -d@"$t" -u +%F_%T))"
> }
>
> trap 'log_time $(date +%s) "---EXIT"' EXIT
> log_time $(date +%s) "---START"
> while sleep "$SDELAY"
> do
>     if ping -c3 -i0.33 $IFACE_OPT -n -q "$TESTHOST" >/dev/null; then
>         _STATE=1
>     else
>         _STATE=0
>     fi
>     if [ $STATE -ne $_STATE ]; then
>         _WHEN=$(date +%s)
>         ((DELTA = $_WHEN - $WHEN))
>         log_time $_WHEN "$STATE ($DELTA) -> $_STATE"
>         STATE="$_STATE"
>         WHEN="$_WHEN"
>     fi
> done
> ------
>
> The script expects three parameters: host_to_test
> delay_between_checks_in_seconds [interface_to_use]
> Without parameters it checks localhost every 5 minutes.
>
> Obviously you cluster cannot have higher avaulability than your network.
> First you need to get an impression how reliable your network is. Then,
> maybe, tune the cluster parameters.
>
> Regards,
> Ulrich
>
> >>> Sebastien BASTARD <sebastien at domalys.com> schrieb am 17.02.2022 um
> 10:37 in
> Nachricht
> <CAAjZqdwouBRBcEa83Yogi4S9mPppirEXc5T3apFmteBsK+66WQ at mail.gmail.com>:
> > Hello CoroSync's team !
> >
> > We currently have a proxmox cluster with 2 servers (at different
> providers
> > and different cities) and another server, in our company, with qdevice.
> >
> > Schematic :
> >
> > (A) Proxmox Server A (Provider One) ---------------------- (B) Proxmox
> > Server B (Provider Two)
> >                  |
>
> > |
> >
> > \----------------------------------------------------------/
> >                                                |
> >                   (C) Qdevice on Debian server (in the company)
> >
> >
> > Between each server, we have approximately 50 ms of latency.
> >
> > Between servers A and B, each virtual server is synchronized each 5
> > minutes, so if a server stops working, the second server starts the same
> > virtual server.
> >
> > We don't need High Availability. We can wait 5 minutes without services.
> > After this delay, the virtual machine must start on another server if the
> > first server does not work anymore.
> >
> > With the corosync default configuration, fencing occurs on the servers
> > randomly (average of 4/5 days), so we modified the configuration with
> this
> > (bold text is our modification) :
> >
> > logging {
> >   debug: off
> >   to_syslog: yes
> > }
> >
> > nodelist {
> >   node {
> >     name: serverA
> >     nodeid: 1
> >     quorum_votes: 1
> >     ring0_addr: xx.xx.xx.xx
> >   }
> >   node {
> >     name: serverB
> >     nodeid: 3
> >     quorum_votes: 1
> >     ring0_addr: xx.xx.xx.xx
> >   }
> > }
> >
> > quorum {
> >   device {
> >     model: net
> >     net {
> >       algorithm: ffsplit
> >       host: xx.xx.xx.xx
> >       tls: on
> >     }
> >     votes: 1
> >   }
> >   provider: corosync_votequorum
> > }
> >
> > totem {
> >   cluster_name: cluster
> >   config_version: 24
> >   interface {
> >     linknumber: 0
> >   }
> >   ip_version: ipv4-6
> >   link_mode: passive
> >   secauth: on
> >   version: 2
> >   *token_retransmits_before_loss_const: 40*
> >   *token: 30000*
> >
> > }
> >
> >
> >
> > With this configuration, the fence of the servers continues but with an
> > average of 15 days.
> >
> > Our current problem is that when fencing occurs on a server, the second
> > server has the same behaviour somes minutes after ... And each time.
> >
> > I tested the cluster with a cut off power of the server A, and all worked
> > great. Server B starts the virtual machines of server A.
> >
> > But in real life, when a server can't talk with another main server, it
> > seems that the two servers believe that they isoled of other.
> >
> > So, after a lot of tests, I don't know which is the best way to have a
> > cluster that works correctly..
> >
> > Currently, the cluster stops working more than the servers have a real
> > problem.
> >
> > Maybe my configuration is not good or another ?
> >
> > So, I need your help =)
> >
> > *Here is the kernel logs of the reboot of the server A ( result the
> command
> > line << cat /var/log/daemon.log | grep -E 'watchdog|corosync' >> ) :*
> >
> > ...
> > Feb 16 09:55:00 serverA corosync[2762]:   [KNET  ] host: host: 3
> (passive)
> > best link: 0 (pri: 1)
> > Feb 16 09:55:00 serverA corosync[2762]:   [KNET  ] host: host: 3 has no
> > active links
> > Feb 16 09:55:22 serverA corosync[2762]:   [TOTEM ] Token has not been
> > received in 22500 ms
> > Feb 16 09:55:30 serverA corosync[2762]:   [TOTEM ] A processor failed,
> > forming new configuration: token timed out (30000ms), waiting 36000ms for
> > consensus.
> > Feb 16 09:55:38 serverA corosync[2762]:   [KNET  ] rx: host: 3 link: 0
> is up
> > Feb 16 09:55:38 serverA corosync[2762]:   [KNET  ] host: host: 3
> (passive)
> > best link: 0 (pri: 1)
> > Feb 16 09:55:55 serverA watchdog-mux[1890]: client watchdog expired -
> > disable watchdog updates
> > *Reboot*
> > ....
> >
> >
> > *Here is the kernel logs of the reboot of the server B **( result the
> > command line << cat /var/log/daemon.log | grep -E 'watchdog|corosync'
> >>> ) :*
> >
> > Feb 16 09:48:42 serverB corosync[2728]:   [KNET  ] link: host: 1 link: 0
> is
> > down
> > Feb 16 09:48:42 serverB corosync[2728]:   [KNET  ] host: host: 1
> (passive)
> > best link: 0 (pri: 1)
> > Feb 16 09:48:42 serverB corosync[2728]:   [KNET  ] host: host: 1 has no
> > active links
> > Feb 16 09:48:57 serverB corosync[2728]:   [KNET  ] rx: host: 1 link: 0
> is up
> > Feb 16 09:48:57 serverB corosync[2728]:   [KNET  ] host: host: 1
> (passive)
> > best link: 0 (pri: 1)
> > Feb 16 09:53:56 serverB corosync[2728]:   [KNET  ] link: host: 1 link: 0
> is
> > down
> > Feb 16 09:53:56 serverB corosync[2728]:   [KNET  ] host: host: 1
> (passive)
> > best link: 0 (pri: 1)
> > Feb 16 09:53:56 serverB corosync[2728]:   [KNET  ] host: host: 1 has no
> > active links
> > Feb 16 09:54:12 serverB corosync[2728]:   [KNET  ] rx: host: 1 link: 0
> is up
> > Feb 16 09:54:12 serverB corosync[2728]:   [KNET  ] host: host: 1
> (passive)
> > best link: 0 (pri: 1)
> > Feb 16 09:55:22 serverB corosync[2728]:   [TOTEM ] Token has not been
> > received in 22500 ms
> > Feb 16 09:55:30 serverB corosync[2728]:   [TOTEM ] A processor failed,
> > forming new configuration: token timed out (30000ms), waiting 36000ms for
> > consensus.
> > Feb 16 09:55:35 serverB corosync[2728]:   [KNET  ] link: host: 1 link: 0
> is
> > down
> > Feb 16 09:55:35 serverB corosync[2728]:   [KNET  ] host: host: 1
> (passive)
> > best link: 0 (pri: 1)
> > Feb 16 09:55:35 serverB corosync[2728]:   [KNET  ] host: host: 1 has no
> > active links
> > Feb 16 09:55:55 serverB watchdog-mux[2280]: client watchdog expired -
> > disable watchdog updates
> > *Reboot*
> >
> >
> > Do you have an idea why when fencing occurs on one server, the other
> server
> > has the same behavior ?
> >
> > Thanks for your help.
> >
> > Best regards.
> >
> > Seb.
>
>
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>

-- 

Sébastien BASTARD
*Ingénieur R&D* | Domalys • Créateurs d’autonomie

| phone : +33 5 49 83 00 08
| site : www.domalys.com
| email : sebastien at domalys.com
| address : 58 Rue du Vercors 86240 Fontaine-Le-Comte

<https://www.domalys.com/> <https://www.facebook.com/domalys/>
<https://twitter.com/domalysfr>
<https://www.youtube.com/channel/UCRLVU19hjkZ0dv29FaPJacw>
<https://www.linkedin.com/company/domalys/?originalSubdomain=fr>
<https://youtu.be/77t5rETTwQs> <https://www.ces.tech>
<https://www.ces.tech>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20220217/2797bf36/attachment.htm>