[ClusterLabs] Antw: [EXT] The 2 servers of the cluster randomly reboot almost together
Sebastien BASTARD
sebastien at domalys.com
Thu Feb 17 10:28:33 EST 2022
Thank you Ulrich for your script !
I launched it, with 10 seconds delay :
- on Server A, to ping Server B
- on Server B, to ping server A
- on QDevice, to ping server A and Server B
I currently can't ping Qdevice from server A and B, because it is behind a
firewall which only authorizes port 5403.
Tomorrow, I will see the results.
Best regards.
Le jeu. 17 févr. 2022 à 12:22, Ulrich Windl <
Ulrich.Windl at rz.uni-regensburg.de> a écrit :
> Hi!
>
> It seems your problem is the network. Maybe check the connectivity between
> all nodes (and quorum device).
> Some time ago I wrote a simple script that can log ups and downs (you'll
> ahve to adjust for non-LAN traffic); maybe it helps:
> ----
> # Test Host Status (Up, Down) via ping (ICMP Echo)
> #$Id: up-down-test.sh,v 1.2 2018/03/07 15:17:32 windl Exp $
>
> # Written for SLES 11 SP3 by Ulrich Windl
> TESTHOST="${1:-localhost}"
> SDELAY="${2:-300}"
> IFACE_OPT="${3:+-I$3}"
> STATE=0
> WHEN=$(date +%s)
>
> # add time stamp to message and echo it
> log_time()
> {
> typeset t="$1"; shift
> echo "$@ $t ($(date -d@"$t" -u +%F_%T))"
> }
>
> trap 'log_time $(date +%s) "---EXIT"' EXIT
> log_time $(date +%s) "---START"
> while sleep "$SDELAY"
> do
> if ping -c3 -i0.33 $IFACE_OPT -n -q "$TESTHOST" >/dev/null; then
> _STATE=1
> else
> _STATE=0
> fi
> if [ $STATE -ne $_STATE ]; then
> _WHEN=$(date +%s)
> ((DELTA = $_WHEN - $WHEN))
> log_time $_WHEN "$STATE ($DELTA) -> $_STATE"
> STATE="$_STATE"
> WHEN="$_WHEN"
> fi
> done
> ------
>
> The script expects three parameters: host_to_test
> delay_between_checks_in_seconds [interface_to_use]
> Without parameters it checks localhost every 5 minutes.
>
> Obviously you cluster cannot have higher avaulability than your network.
> First you need to get an impression how reliable your network is. Then,
> maybe, tune the cluster parameters.
>
> Regards,
> Ulrich
>
> >>> Sebastien BASTARD <sebastien at domalys.com> schrieb am 17.02.2022 um
> 10:37 in
> Nachricht
> <CAAjZqdwouBRBcEa83Yogi4S9mPppirEXc5T3apFmteBsK+66WQ at mail.gmail.com>:
> > Hello CoroSync's team !
> >
> > We currently have a proxmox cluster with 2 servers (at different
> providers
> > and different cities) and another server, in our company, with qdevice.
> >
> > Schematic :
> >
> > (A) Proxmox Server A (Provider One) ---------------------- (B) Proxmox
> > Server B (Provider Two)
> > |
>
> > |
> >
> > \----------------------------------------------------------/
> > |
> > (C) Qdevice on Debian server (in the company)
> >
> >
> > Between each server, we have approximately 50 ms of latency.
> >
> > Between servers A and B, each virtual server is synchronized each 5
> > minutes, so if a server stops working, the second server starts the same
> > virtual server.
> >
> > We don't need High Availability. We can wait 5 minutes without services.
> > After this delay, the virtual machine must start on another server if the
> > first server does not work anymore.
> >
> > With the corosync default configuration, fencing occurs on the servers
> > randomly (average of 4/5 days), so we modified the configuration with
> this
> > (bold text is our modification) :
> >
> > logging {
> > debug: off
> > to_syslog: yes
> > }
> >
> > nodelist {
> > node {
> > name: serverA
> > nodeid: 1
> > quorum_votes: 1
> > ring0_addr: xx.xx.xx.xx
> > }
> > node {
> > name: serverB
> > nodeid: 3
> > quorum_votes: 1
> > ring0_addr: xx.xx.xx.xx
> > }
> > }
> >
> > quorum {
> > device {
> > model: net
> > net {
> > algorithm: ffsplit
> > host: xx.xx.xx.xx
> > tls: on
> > }
> > votes: 1
> > }
> > provider: corosync_votequorum
> > }
> >
> > totem {
> > cluster_name: cluster
> > config_version: 24
> > interface {
> > linknumber: 0
> > }
> > ip_version: ipv4-6
> > link_mode: passive
> > secauth: on
> > version: 2
> > *token_retransmits_before_loss_const: 40*
> > *token: 30000*
> >
> > }
> >
> >
> >
> > With this configuration, the fence of the servers continues but with an
> > average of 15 days.
> >
> > Our current problem is that when fencing occurs on a server, the second
> > server has the same behaviour somes minutes after ... And each time.
> >
> > I tested the cluster with a cut off power of the server A, and all worked
> > great. Server B starts the virtual machines of server A.
> >
> > But in real life, when a server can't talk with another main server, it
> > seems that the two servers believe that they isoled of other.
> >
> > So, after a lot of tests, I don't know which is the best way to have a
> > cluster that works correctly..
> >
> > Currently, the cluster stops working more than the servers have a real
> > problem.
> >
> > Maybe my configuration is not good or another ?
> >
> > So, I need your help =)
> >
> > *Here is the kernel logs of the reboot of the server A ( result the
> command
> > line << cat /var/log/daemon.log | grep -E 'watchdog|corosync' >> ) :*
> >
> > ...
> > Feb 16 09:55:00 serverA corosync[2762]: [KNET ] host: host: 3
> (passive)
> > best link: 0 (pri: 1)
> > Feb 16 09:55:00 serverA corosync[2762]: [KNET ] host: host: 3 has no
> > active links
> > Feb 16 09:55:22 serverA corosync[2762]: [TOTEM ] Token has not been
> > received in 22500 ms
> > Feb 16 09:55:30 serverA corosync[2762]: [TOTEM ] A processor failed,
> > forming new configuration: token timed out (30000ms), waiting 36000ms for
> > consensus.
> > Feb 16 09:55:38 serverA corosync[2762]: [KNET ] rx: host: 3 link: 0
> is up
> > Feb 16 09:55:38 serverA corosync[2762]: [KNET ] host: host: 3
> (passive)
> > best link: 0 (pri: 1)
> > Feb 16 09:55:55 serverA watchdog-mux[1890]: client watchdog expired -
> > disable watchdog updates
> > *Reboot*
> > ....
> >
> >
> > *Here is the kernel logs of the reboot of the server B **( result the
> > command line << cat /var/log/daemon.log | grep -E 'watchdog|corosync'
> >>> ) :*
> >
> > Feb 16 09:48:42 serverB corosync[2728]: [KNET ] link: host: 1 link: 0
> is
> > down
> > Feb 16 09:48:42 serverB corosync[2728]: [KNET ] host: host: 1
> (passive)
> > best link: 0 (pri: 1)
> > Feb 16 09:48:42 serverB corosync[2728]: [KNET ] host: host: 1 has no
> > active links
> > Feb 16 09:48:57 serverB corosync[2728]: [KNET ] rx: host: 1 link: 0
> is up
> > Feb 16 09:48:57 serverB corosync[2728]: [KNET ] host: host: 1
> (passive)
> > best link: 0 (pri: 1)
> > Feb 16 09:53:56 serverB corosync[2728]: [KNET ] link: host: 1 link: 0
> is
> > down
> > Feb 16 09:53:56 serverB corosync[2728]: [KNET ] host: host: 1
> (passive)
> > best link: 0 (pri: 1)
> > Feb 16 09:53:56 serverB corosync[2728]: [KNET ] host: host: 1 has no
> > active links
> > Feb 16 09:54:12 serverB corosync[2728]: [KNET ] rx: host: 1 link: 0
> is up
> > Feb 16 09:54:12 serverB corosync[2728]: [KNET ] host: host: 1
> (passive)
> > best link: 0 (pri: 1)
> > Feb 16 09:55:22 serverB corosync[2728]: [TOTEM ] Token has not been
> > received in 22500 ms
> > Feb 16 09:55:30 serverB corosync[2728]: [TOTEM ] A processor failed,
> > forming new configuration: token timed out (30000ms), waiting 36000ms for
> > consensus.
> > Feb 16 09:55:35 serverB corosync[2728]: [KNET ] link: host: 1 link: 0
> is
> > down
> > Feb 16 09:55:35 serverB corosync[2728]: [KNET ] host: host: 1
> (passive)
> > best link: 0 (pri: 1)
> > Feb 16 09:55:35 serverB corosync[2728]: [KNET ] host: host: 1 has no
> > active links
> > Feb 16 09:55:55 serverB watchdog-mux[2280]: client watchdog expired -
> > disable watchdog updates
> > *Reboot*
> >
> >
> > Do you have an idea why when fencing occurs on one server, the other
> server
> > has the same behavior ?
> >
> > Thanks for your help.
> >
> > Best regards.
> >
> > Seb.
>
>
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
--
Sébastien BASTARD
*Ingénieur R&D* | Domalys • Créateurs d’autonomie
| phone : +33 5 49 83 00 08
| site : www.domalys.com
| email : sebastien at domalys.com
| address : 58 Rue du Vercors 86240 Fontaine-Le-Comte
<https://www.domalys.com/> <https://www.facebook.com/domalys/>
<https://twitter.com/domalysfr>
<https://www.youtube.com/channel/UCRLVU19hjkZ0dv29FaPJacw>
<https://www.linkedin.com/company/domalys/?originalSubdomain=fr>
<https://youtu.be/77t5rETTwQs> <https://www.ces.tech>
<https://www.ces.tech>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20220217/2797bf36/attachment.htm>
More information about the Users
mailing list