<div dir="ltr">Thank you Ulrich for your script !<br><div><br></div><div>I launched it, with 10 seconds delay :</div><div><ul><li>on Server A, to ping Server B</li><li>on Server B, to ping server A</li><li>on QDevice, to ping server A and Server B</li></ul><div>I currently can't ping Qdevice from server A and B, because it is behind a firewall which only authorizes port 5403.</div></div><div><br></div><div>Tomorrow, I will see the results.</div><div><br></div><div>Best regards.</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">Le jeu. 17 févr. 2022 à 12:22, Ulrich Windl <<a href="mailto:Ulrich.Windl@rz.uni-regensburg.de">Ulrich.Windl@rz.uni-regensburg.de</a>> a écrit :<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi!<br>
<br>
It seems your problem is the network. Maybe check the connectivity between all nodes (and quorum device).<br>
Some time ago I wrote a simple script that can log ups and downs (you'll ahve to adjust for non-LAN traffic); maybe it helps:<br>
----<br>
# Test Host Status (Up, Down) via ping (ICMP Echo)<br>
#$Id: up-down-test.sh,v 1.2 2018/03/07 15:17:32 windl Exp $<br>
<br>
# Written for SLES 11 SP3 by Ulrich Windl<br>
TESTHOST="${1:-localhost}"<br>
SDELAY="${2:-300}"<br>
IFACE_OPT="${3:+-I$3}"<br>
STATE=0<br>
WHEN=$(date +%s)<br>
<br>
# add time stamp to message and echo it<br>
log_time()<br>
{<br>
typeset t="$1"; shift<br>
echo "$@ $t ($(date -d@"$t" -u +%F_%T))"<br>
}<br>
<br>
trap 'log_time $(date +%s) "---EXIT"' EXIT<br>
log_time $(date +%s) "---START"<br>
while sleep "$SDELAY"<br>
do<br>
if ping -c3 -i0.33 $IFACE_OPT -n -q "$TESTHOST" >/dev/null; then<br>
_STATE=1<br>
else<br>
_STATE=0<br>
fi<br>
if [ $STATE -ne $_STATE ]; then<br>
_WHEN=$(date +%s)<br>
((DELTA = $_WHEN - $WHEN))<br>
log_time $_WHEN "$STATE ($DELTA) -> $_STATE"<br>
STATE="$_STATE"<br>
WHEN="$_WHEN"<br>
fi<br>
done<br>
------<br>
<br>
The script expects three parameters: host_to_test delay_between_checks_in_seconds [interface_to_use]<br>
Without parameters it checks localhost every 5 minutes.<br>
<br>
Obviously you cluster cannot have higher avaulability than your network.<br>
First you need to get an impression how reliable your network is. Then, maybe, tune the cluster parameters.<br>
<br>
Regards,<br>
Ulrich<br>
<br>
>>> Sebastien BASTARD <<a href="mailto:sebastien@domalys.com" target="_blank">sebastien@domalys.com</a>> schrieb am 17.02.2022 um 10:37 in<br>
Nachricht<br>
<<a href="mailto:CAAjZqdwouBRBcEa83Yogi4S9mPppirEXc5T3apFmteBsK%2B66WQ@mail.gmail.com" target="_blank">CAAjZqdwouBRBcEa83Yogi4S9mPppirEXc5T3apFmteBsK+66WQ@mail.gmail.com</a>>:<br>
> Hello CoroSync's team !<br>
> <br>
> We currently have a proxmox cluster with 2 servers (at different providers<br>
> and different cities) and another server, in our company, with qdevice.<br>
> <br>
> Schematic :<br>
> <br>
> (A) Proxmox Server A (Provider One) ---------------------- (B) Proxmox<br>
> Server B (Provider Two)<br>
> | <br>
> |<br>
> <br>
> \----------------------------------------------------------/<br>
> |<br>
> (C) Qdevice on Debian server (in the company)<br>
> <br>
> <br>
> Between each server, we have approximately 50 ms of latency.<br>
> <br>
> Between servers A and B, each virtual server is synchronized each 5<br>
> minutes, so if a server stops working, the second server starts the same<br>
> virtual server.<br>
> <br>
> We don't need High Availability. We can wait 5 minutes without services.<br>
> After this delay, the virtual machine must start on another server if the<br>
> first server does not work anymore.<br>
> <br>
> With the corosync default configuration, fencing occurs on the servers<br>
> randomly (average of 4/5 days), so we modified the configuration with this<br>
> (bold text is our modification) :<br>
> <br>
> logging {<br>
> debug: off<br>
> to_syslog: yes<br>
> }<br>
> <br>
> nodelist {<br>
> node {<br>
> name: serverA<br>
> nodeid: 1<br>
> quorum_votes: 1<br>
> ring0_addr: xx.xx.xx.xx<br>
> }<br>
> node {<br>
> name: serverB<br>
> nodeid: 3<br>
> quorum_votes: 1<br>
> ring0_addr: xx.xx.xx.xx<br>
> }<br>
> }<br>
> <br>
> quorum {<br>
> device {<br>
> model: net<br>
> net {<br>
> algorithm: ffsplit<br>
> host: xx.xx.xx.xx<br>
> tls: on<br>
> }<br>
> votes: 1<br>
> }<br>
> provider: corosync_votequorum<br>
> }<br>
> <br>
> totem {<br>
> cluster_name: cluster<br>
> config_version: 24<br>
> interface {<br>
> linknumber: 0<br>
> }<br>
> ip_version: ipv4-6<br>
> link_mode: passive<br>
> secauth: on<br>
> version: 2<br>
> *token_retransmits_before_loss_const: 40*<br>
> *token: 30000*<br>
> <br>
> }<br>
> <br>
> <br>
> <br>
> With this configuration, the fence of the servers continues but with an<br>
> average of 15 days.<br>
> <br>
> Our current problem is that when fencing occurs on a server, the second<br>
> server has the same behaviour somes minutes after ... And each time.<br>
> <br>
> I tested the cluster with a cut off power of the server A, and all worked<br>
> great. Server B starts the virtual machines of server A.<br>
> <br>
> But in real life, when a server can't talk with another main server, it<br>
> seems that the two servers believe that they isoled of other.<br>
> <br>
> So, after a lot of tests, I don't know which is the best way to have a<br>
> cluster that works correctly..<br>
> <br>
> Currently, the cluster stops working more than the servers have a real<br>
> problem.<br>
> <br>
> Maybe my configuration is not good or another ?<br>
> <br>
> So, I need your help =)<br>
> <br>
> *Here is the kernel logs of the reboot of the server A ( result the command<br>
> line << cat /var/log/daemon.log | grep -E 'watchdog|corosync' >> ) :*<br>
> <br>
> ...<br>
> Feb 16 09:55:00 serverA corosync[2762]: [KNET ] host: host: 3 (passive)<br>
> best link: 0 (pri: 1)<br>
> Feb 16 09:55:00 serverA corosync[2762]: [KNET ] host: host: 3 has no<br>
> active links<br>
> Feb 16 09:55:22 serverA corosync[2762]: [TOTEM ] Token has not been<br>
> received in 22500 ms<br>
> Feb 16 09:55:30 serverA corosync[2762]: [TOTEM ] A processor failed,<br>
> forming new configuration: token timed out (30000ms), waiting 36000ms for<br>
> consensus.<br>
> Feb 16 09:55:38 serverA corosync[2762]: [KNET ] rx: host: 3 link: 0 is up<br>
> Feb 16 09:55:38 serverA corosync[2762]: [KNET ] host: host: 3 (passive)<br>
> best link: 0 (pri: 1)<br>
> Feb 16 09:55:55 serverA watchdog-mux[1890]: client watchdog expired -<br>
> disable watchdog updates<br>
> *Reboot*<br>
> ....<br>
> <br>
> <br>
> *Here is the kernel logs of the reboot of the server B **( result the<br>
> command line << cat /var/log/daemon.log | grep -E 'watchdog|corosync'<br>
>>> ) :*<br>
> <br>
> Feb 16 09:48:42 serverB corosync[2728]: [KNET ] link: host: 1 link: 0 is<br>
> down<br>
> Feb 16 09:48:42 serverB corosync[2728]: [KNET ] host: host: 1 (passive)<br>
> best link: 0 (pri: 1)<br>
> Feb 16 09:48:42 serverB corosync[2728]: [KNET ] host: host: 1 has no<br>
> active links<br>
> Feb 16 09:48:57 serverB corosync[2728]: [KNET ] rx: host: 1 link: 0 is up<br>
> Feb 16 09:48:57 serverB corosync[2728]: [KNET ] host: host: 1 (passive)<br>
> best link: 0 (pri: 1)<br>
> Feb 16 09:53:56 serverB corosync[2728]: [KNET ] link: host: 1 link: 0 is<br>
> down<br>
> Feb 16 09:53:56 serverB corosync[2728]: [KNET ] host: host: 1 (passive)<br>
> best link: 0 (pri: 1)<br>
> Feb 16 09:53:56 serverB corosync[2728]: [KNET ] host: host: 1 has no<br>
> active links<br>
> Feb 16 09:54:12 serverB corosync[2728]: [KNET ] rx: host: 1 link: 0 is up<br>
> Feb 16 09:54:12 serverB corosync[2728]: [KNET ] host: host: 1 (passive)<br>
> best link: 0 (pri: 1)<br>
> Feb 16 09:55:22 serverB corosync[2728]: [TOTEM ] Token has not been<br>
> received in 22500 ms<br>
> Feb 16 09:55:30 serverB corosync[2728]: [TOTEM ] A processor failed,<br>
> forming new configuration: token timed out (30000ms), waiting 36000ms for<br>
> consensus.<br>
> Feb 16 09:55:35 serverB corosync[2728]: [KNET ] link: host: 1 link: 0 is<br>
> down<br>
> Feb 16 09:55:35 serverB corosync[2728]: [KNET ] host: host: 1 (passive)<br>
> best link: 0 (pri: 1)<br>
> Feb 16 09:55:35 serverB corosync[2728]: [KNET ] host: host: 1 has no<br>
> active links<br>
> Feb 16 09:55:55 serverB watchdog-mux[2280]: client watchdog expired -<br>
> disable watchdog updates<br>
> *Reboot*<br>
> <br>
> <br>
> Do you have an idea why when fencing occurs on one server, the other server<br>
> has the same behavior ?<br>
> <br>
> Thanks for your help.<br>
> <br>
> Best regards.<br>
> <br>
> Seb.<br>
<br>
<br>
<br>
_______________________________________________<br>
Manage your subscription:<br>
<a href="https://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.clusterlabs.org/mailman/listinfo/users</a><br>
<br>
ClusterLabs home: <a href="https://www.clusterlabs.org/" rel="noreferrer" target="_blank">https://www.clusterlabs.org/</a><br>
</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr" class="gmail_signature"><div dir="ltr"><div><div><div dir="ltr"><div><div dir="ltr"><table>
<tbody>
<tr>
<td>
<table>
<tbody>
<tr>
<td>
<br><img src="https://res.cloudinary.com/hxdnwvezo/image/asset/v1581501005/sebastien-1ca4a93f85de46095e67fba629dd919a.png" width="100" height="100">
<br>
</td>
<td style="padding:5px 20px 0px">
<span style="color:rgb(239,125,0)">Sébastien BASTARD</span>
<br>
<b>Ingénieur R&D</b> | Domalys • Créateurs d’autonomie
<br>
<br>
<font color="#4f3c71"> | phone :</font> +33 5 49 83 00 08
<br>
<font color="#4f3c71"> | site : </font>
<a href="http://www.domalys.com" style="text-decoration:none" target="_blank">www.domalys.com</a>
<br>
<font color="#4f3c71"> | email :</font>
<a href="mailto:sebastien@domalys.com" style="text-decoration:none" target="_blank">sebastien@domalys.com</a>
<br>
<font color="#4f3c71"> | address
:</font> 58 Rue du Vercors 86240 Fontaine-Le-Comte
<br>
<br>
</td>
</tr>
</tbody>
</table>
<table>
<tbody>
<tr>
<td>
<table>
<tbody>
<tr align="center">
<td>
<a href="https://www.domalys.com/" style="text-decoration:none" target="_blank">
<img src="https://res.cloudinary.com/hxdnwvezo/image/asset/app_logo-afaff0e455909cd6f414a066feecb4d4.png" width="90">
</a>
</td>
<td>
<a href="https://www.facebook.com/domalys/" style="text-decoration:none" target="_blank">
<img src="https://res.cloudinary.com/hxdnwvezo/image/upload/v1539349318/facebook_ai3qkl.png" width="50">
</a>
</td>
<td>
<a href="https://twitter.com/domalysfr" style="text-decoration:none" target="_blank">
<img src="https://res.cloudinary.com/hxdnwvezo/image/upload/v1539349318/twitter_ihhmxh.png" width="50">
</a>
</td>
<td>
<a href="https://www.youtube.com/channel/UCRLVU19hjkZ0dv29FaPJacw" style="text-decoration:none" target="_blank">
<img src="https://res.cloudinary.com/hxdnwvezo/image/upload/v1539349324/youtube_ngllux.png" width="50">
</a>
</td>
<td>
<a href="https://www.linkedin.com/company/domalys/?originalSubdomain=fr" style="text-decoration:none" target="_blank">
<img src="https://res.cloudinary.com/hxdnwvezo/image/upload/v1539349318/linkedin_l9whfl.png" width="50">
</a>
</td>
<td>
<a href="https://youtu.be/77t5rETTwQs" style="text-decoration:none" target="_blank">
<img src="https://res.cloudinary.com/hxdnwvezo/image/upload/v1539349542/team_pztc1j.png" width="50">
</a>
</td>
<td>
<a href="https://www.ces.tech" style="text-decoration:none" target="_blank">
<img src="https://res.cloudinary.com/hxdnwvezo/image/asset/v1542279889/ces_icon-cbefc04feb1bb0064f5e0c2e80d2fe45.png" width="55">
</a>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
</td><td style="padding:5px 20px 0px"><br></td></tr></tbody></table><a href="https://www.ces.tech" style="text-decoration:none" target="_blank">
</a></div></div></div></div></div><div><img src="https://docs.google.com/uc?export=download&id=16v-5uIvzUX7FG9anOADm0utq96zDMs8w&revid=0B5aDicP2dRSsa2xHUTdBNTI3WTNRaDF6YmZkcW5xcEw2bzkwPQ"><br></div></div></div>