<div dir="ltr">Thank you Ulrich for your script !<br><div><br></div><div>I launched it, with 10 seconds delay :</div><div><ul><li>on Server A, to ping Server B</li><li>on Server B, to ping server A</li><li>on QDevice, to ping server A and Server B</li></ul><div>I currently can't ping Qdevice from server A and B, because it is behind a firewall which only authorizes port 5403.</div></div><div><br></div><div>Tomorrow, I will see the results.</div><div><br></div><div>Best regards.</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">Le jeu. 17 févr. 2022 à 12:22, Ulrich Windl <<a href="mailto:Ulrich.Windl@rz.uni-regensburg.de">Ulrich.Windl@rz.uni-regensburg.de</a>> a écrit :<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi!<br>

<br>

It seems your problem is the network. Maybe check the connectivity between all nodes (and quorum device).<br>

Some time ago I wrote a simple script that can log ups and downs (you'll ahve to adjust for non-LAN traffic); maybe it helps:<br>

----<br>

# Test Host Status (Up, Down) via ping (ICMP Echo)<br>

#$Id: up-down-test.sh,v 1.2 2018/03/07 15:17:32 windl Exp $<br>

<br>

# Written for SLES 11 SP3 by Ulrich Windl<br>

TESTHOST="${1:-localhost}"<br>

SDELAY="${2:-300}"<br>

IFACE_OPT="${3:+-I$3}"<br>

STATE=0<br>

WHEN=$(date +%s)<br>

<br>

# add time stamp to message and echo it<br>

log_time()<br>

{<br>

    typeset t="$1"; shift<br>

    echo "$@ $t ($(date -d@"$t" -u +%F_%T))"<br>

}<br>

<br>

trap 'log_time $(date +%s) "---EXIT"' EXIT<br>

log_time $(date +%s) "---START"<br>

while sleep "$SDELAY"<br>

do<br>

    if ping -c3 -i0.33 $IFACE_OPT -n -q "$TESTHOST" >/dev/null; then<br>

        _STATE=1<br>

    else<br>

        _STATE=0<br>

    fi<br>

    if [ $STATE -ne $_STATE ]; then<br>

        _WHEN=$(date +%s)<br>

        ((DELTA = $_WHEN - $WHEN))<br>

        log_time $_WHEN "$STATE ($DELTA) -> $_STATE"<br>

        STATE="$_STATE"<br>

        WHEN="$_WHEN"<br>

    fi<br>

done<br>

------<br>

<br>

The script expects three parameters: host_to_test delay_between_checks_in_seconds [interface_to_use]<br>

Without parameters it checks localhost every 5 minutes.<br>

<br>

Obviously you cluster cannot have higher avaulability than your network.<br>

First you need to get an impression how reliable your network is. Then, maybe, tune the cluster parameters.<br>

<br>

Regards,<br>

Ulrich<br>

<br>

>>> Sebastien BASTARD <<a href="mailto:sebastien@domalys.com" target="_blank">sebastien@domalys.com</a>> schrieb am 17.02.2022 um 10:37 in<br>

Nachricht<br>

<<a href="mailto:CAAjZqdwouBRBcEa83Yogi4S9mPppirEXc5T3apFmteBsK%2B66WQ@mail.gmail.com" target="_blank">CAAjZqdwouBRBcEa83Yogi4S9mPppirEXc5T3apFmteBsK+66WQ@mail.gmail.com</a>>:<br>

> Hello CoroSync's team !<br>

> <br>

> We currently have a proxmox cluster with 2 servers (at different providers<br>

> and different cities) and another server, in our company, with qdevice.<br>

> <br>

> Schematic :<br>

> <br>

> (A) Proxmox Server A (Provider One) ---------------------- (B) Proxmox<br>

> Server B (Provider Two)<br>

>                  |                                                          <br>

> |<br>

>                  <br>

> \----------------------------------------------------------/<br>

>                                                |<br>

>                   (C) Qdevice on Debian server (in the company)<br>

> <br>

> <br>

> Between each server, we have approximately 50 ms of latency.<br>

> <br>

> Between servers A and B, each virtual server is synchronized each 5<br>

> minutes, so if a server stops working, the second server starts the same<br>

> virtual server.<br>

> <br>

> We don't need High Availability. We can wait 5 minutes without services.<br>

> After this delay, the virtual machine must start on another server if the<br>

> first server does not work anymore.<br>

> <br>

> With the corosync default configuration, fencing occurs on the servers<br>

> randomly (average of 4/5 days), so we modified the configuration with this<br>

> (bold text is our modification) :<br>

> <br>

> logging {<br>

>   debug: off<br>

>   to_syslog: yes<br>

> }<br>

> <br>

> nodelist {<br>

>   node {<br>

>     name: serverA<br>

>     nodeid: 1<br>

>     quorum_votes: 1<br>

>     ring0_addr: xx.xx.xx.xx<br>

>   }<br>

>   node {<br>

>     name: serverB<br>

>     nodeid: 3<br>

>     quorum_votes: 1<br>

>     ring0_addr: xx.xx.xx.xx<br>

>   }<br>

> }<br>

> <br>

> quorum {<br>

>   device {<br>

>     model: net<br>

>     net {<br>

>       algorithm: ffsplit<br>

>       host: xx.xx.xx.xx<br>

>       tls: on<br>

>     }<br>

>     votes: 1<br>

>   }<br>

>   provider: corosync_votequorum<br>

> }<br>

> <br>

> totem {<br>

>   cluster_name: cluster<br>

>   config_version: 24<br>

>   interface {<br>

>     linknumber: 0<br>

>   }<br>

>   ip_version: ipv4-6<br>

>   link_mode: passive<br>

>   secauth: on<br>

>   version: 2<br>

>   *token_retransmits_before_loss_const: 40*<br>

>   *token: 30000*<br>

> <br>

> }<br>

> <br>

> <br>

> <br>

> With this configuration, the fence of the servers continues but with an<br>

> average of 15 days.<br>

> <br>

> Our current problem is that when fencing occurs on a server, the second<br>

> server has the same behaviour somes minutes after ... And each time.<br>

> <br>

> I tested the cluster with a cut off power of the server A, and all worked<br>

> great. Server B starts the virtual machines of server A.<br>

> <br>

> But in real life, when a server can't talk with another main server, it<br>

> seems that the two servers believe that they isoled of other.<br>

> <br>

> So, after a lot of tests, I don't know which is the best way to have a<br>

> cluster that works correctly..<br>

> <br>

> Currently, the cluster stops working more than the servers have a real<br>

> problem.<br>

> <br>

> Maybe my configuration is not good or another ?<br>

> <br>

> So, I need your help =)<br>

> <br>

> *Here is the kernel logs of the reboot of the server A ( result the command<br>

> line << cat /var/log/daemon.log | grep -E 'watchdog|corosync' >> ) :*<br>

> <br>

> ...<br>

> Feb 16 09:55:00 serverA corosync[2762]:   [KNET  ] host: host: 3 (passive)<br>

> best link: 0 (pri: 1)<br>

> Feb 16 09:55:00 serverA corosync[2762]:   [KNET  ] host: host: 3 has no<br>

> active links<br>

> Feb 16 09:55:22 serverA corosync[2762]:   [TOTEM ] Token has not been<br>

> received in 22500 ms<br>

> Feb 16 09:55:30 serverA corosync[2762]:   [TOTEM ] A processor failed,<br>

> forming new configuration: token timed out (30000ms), waiting 36000ms for<br>

> consensus.<br>

> Feb 16 09:55:38 serverA corosync[2762]:   [KNET  ] rx: host: 3 link: 0 is up<br>

> Feb 16 09:55:38 serverA corosync[2762]:   [KNET  ] host: host: 3 (passive)<br>

> best link: 0 (pri: 1)<br>

> Feb 16 09:55:55 serverA watchdog-mux[1890]: client watchdog expired -<br>

> disable watchdog updates<br>

> *Reboot*<br>

> ....<br>

> <br>

> <br>

> *Here is the kernel logs of the reboot of the server B **( result the<br>

> command line << cat /var/log/daemon.log | grep -E 'watchdog|corosync'<br>

>>> ) :*<br>

> <br>

> Feb 16 09:48:42 serverB corosync[2728]:   [KNET  ] link: host: 1 link: 0 is<br>

> down<br>

> Feb 16 09:48:42 serverB corosync[2728]:   [KNET  ] host: host: 1 (passive)<br>

> best link: 0 (pri: 1)<br>

> Feb 16 09:48:42 serverB corosync[2728]:   [KNET  ] host: host: 1 has no<br>

> active links<br>

> Feb 16 09:48:57 serverB corosync[2728]:   [KNET  ] rx: host: 1 link: 0 is up<br>

> Feb 16 09:48:57 serverB corosync[2728]:   [KNET  ] host: host: 1 (passive)<br>

> best link: 0 (pri: 1)<br>

> Feb 16 09:53:56 serverB corosync[2728]:   [KNET  ] link: host: 1 link: 0 is<br>

> down<br>

> Feb 16 09:53:56 serverB corosync[2728]:   [KNET  ] host: host: 1 (passive)<br>

> best link: 0 (pri: 1)<br>

> Feb 16 09:53:56 serverB corosync[2728]:   [KNET  ] host: host: 1 has no<br>

> active links<br>

> Feb 16 09:54:12 serverB corosync[2728]:   [KNET  ] rx: host: 1 link: 0 is up<br>

> Feb 16 09:54:12 serverB corosync[2728]:   [KNET  ] host: host: 1 (passive)<br>

> best link: 0 (pri: 1)<br>

> Feb 16 09:55:22 serverB corosync[2728]:   [TOTEM ] Token has not been<br>

> received in 22500 ms<br>

> Feb 16 09:55:30 serverB corosync[2728]:   [TOTEM ] A processor failed,<br>

> forming new configuration: token timed out (30000ms), waiting 36000ms for<br>

> consensus.<br>

> Feb 16 09:55:35 serverB corosync[2728]:   [KNET  ] link: host: 1 link: 0 is<br>

> down<br>

> Feb 16 09:55:35 serverB corosync[2728]:   [KNET  ] host: host: 1 (passive)<br>

> best link: 0 (pri: 1)<br>

> Feb 16 09:55:35 serverB corosync[2728]:   [KNET  ] host: host: 1 has no<br>

> active links<br>

> Feb 16 09:55:55 serverB watchdog-mux[2280]: client watchdog expired -<br>

> disable watchdog updates<br>

> *Reboot*<br>

> <br>

> <br>

> Do you have an idea why when fencing occurs on one server, the other server<br>

> has the same behavior ?<br>

> <br>

> Thanks for your help.<br>

> <br>

> Best regards.<br>

> <br>

> Seb.<br>

<br>

<br>

<br>

_______________________________________________<br>

Manage your subscription:<br>

<a href="https://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.clusterlabs.org/mailman/listinfo/users</a><br>

<br>

ClusterLabs home: <a href="https://www.clusterlabs.org/" rel="noreferrer" target="_blank">https://www.clusterlabs.org/</a><br>

</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr" class="gmail_signature"><div dir="ltr"><div><div><div dir="ltr"><div><div dir="ltr"><table>

  <tbody>

  <tr>

    <td>

<table>

  <tbody>

  <tr>

    <td>

      <br><img src="https://res.cloudinary.com/hxdnwvezo/image/asset/v1581501005/sebastien-1ca4a93f85de46095e67fba629dd919a.png" width="100" height="100">

      <br>

    </td>


    <td style="padding:5px 20px 0px">

      <span style="color:rgb(239,125,0)">Sébastien BASTARD</span>

      <br>

      <b>Ingénieur R&D</b> | Domalys • Créateurs d’autonomie

      <br>

      <br>


      <font color="#4f3c71"> | phone :</font> +33 5 49 83 00 08

      <br>


      <font color="#4f3c71"> | site : </font>

      <a href="http://www.domalys.com" style="text-decoration:none" target="_blank">www.domalys.com</a>

      <br>


      <font color="#4f3c71"> | email :</font>

      <a href="mailto:sebastien@domalys.com" style="text-decoration:none" target="_blank">sebastien@domalys.com</a>

      <br>


      <font color="#4f3c71"> | address

        :</font> 58 Rue du Vercors 86240 Fontaine-Le-Comte

      <br>

      <br>

    </td>

  </tr>

  </tbody>

</table>

<table>

  <tbody>

  <tr>

    <td>

      <table>

        <tbody>

        <tr align="center">

          <td>

            <a href="https://www.domalys.com/" style="text-decoration:none" target="_blank">

              <img src="https://res.cloudinary.com/hxdnwvezo/image/asset/app_logo-afaff0e455909cd6f414a066feecb4d4.png" width="90">

            </a>

          </td>


          <td>

            <a href="https://www.facebook.com/domalys/" style="text-decoration:none" target="_blank">

              <img src="https://res.cloudinary.com/hxdnwvezo/image/upload/v1539349318/facebook_ai3qkl.png" width="50">

            </a>

          </td>


          <td>

            <a href="https://twitter.com/domalysfr" style="text-decoration:none" target="_blank">

              <img src="https://res.cloudinary.com/hxdnwvezo/image/upload/v1539349318/twitter_ihhmxh.png" width="50">

            </a>

          </td>


          <td>

            <a href="https://www.youtube.com/channel/UCRLVU19hjkZ0dv29FaPJacw" style="text-decoration:none" target="_blank">

              <img src="https://res.cloudinary.com/hxdnwvezo/image/upload/v1539349324/youtube_ngllux.png" width="50">

            </a>

          </td>


          <td>

            <a href="https://www.linkedin.com/company/domalys/?originalSubdomain=fr" style="text-decoration:none" target="_blank">

              <img src="https://res.cloudinary.com/hxdnwvezo/image/upload/v1539349318/linkedin_l9whfl.png" width="50">

            </a>

          </td>


          <td>

            <a href="https://youtu.be/77t5rETTwQs" style="text-decoration:none" target="_blank">

              <img src="https://res.cloudinary.com/hxdnwvezo/image/upload/v1539349542/team_pztc1j.png" width="50">

            </a>

          </td>


          <td>

            <a href="https://www.ces.tech" style="text-decoration:none" target="_blank">

              <img src="https://res.cloudinary.com/hxdnwvezo/image/asset/v1542279889/ces_icon-cbefc04feb1bb0064f5e0c2e80d2fe45.png" width="55">

            </a>

          </td>

        </tr>

        </tbody>

      </table>

    </td>

  </tr>

  </tbody>

</table>


</td><td style="padding:5px 20px 0px"><br></td></tr></tbody></table><a href="https://www.ces.tech" style="text-decoration:none" target="_blank">

            </a></div></div></div></div></div><div><img src="https://docs.google.com/uc?export=download&id=16v-5uIvzUX7FG9anOADm0utq96zDMs8w&revid=0B5aDicP2dRSsa2xHUTdBNTI3WTNRaDF6YmZkcW5xcEw2bzkwPQ"><br></div></div></div>