<div dir="ltr">Hello CoroSync's team !<div><br></div><div>We currently have a proxmox cluster with 2 servers (at different
providers and different cities) and another server, in our company, with qdevice.<br>
<br>
<div class="gmail-bbCodeBlock gmail-bbCodeBlock--screenLimited gmail-bbCodeBlock--code">
<div class="gmail-bbCodeBlock-title">
Schematic :
</div>
<div class="gmail-bbCodeBlock-content" dir="ltr">
<pre class="gmail-bbCodeCode" dir="ltr"><code>(A) Proxmox Server A (Provider One) ---------------------- (B) Proxmox Server B (Provider Two)
| |
\----------------------------------------------------------/
|
(C) Qdevice on Debian server (in the company) </code></pre>
</div>
</div><div><br></div><div>Between each server, we have approximately 50 ms of latency.<br></div><div><br></div><div>Between servers A and B, each virtual server is synchronized each 5 minutes, so if a server stops working, the second server starts the same virtual server.</div><div><br></div><div>We don't need High Availability. We can wait 5 minutes without services. After this delay, the virtual machine must start on another server if the first server does not work anymore.</div><div><br></div><div>With the corosync default configuration, fencing occurs on the servers randomly (average of 4/5 days), so we modified the configuration with this (bold text is our modification) :<br></div></div><div><br></div><blockquote style="margin:0 0 0 40px;border:none;padding:0px"><span style="font-family:monospace">logging {<br></span><span style="font-family:monospace"> debug: off<br></span><span style="font-family:monospace"> to_syslog: yes</span><div><font face="monospace">}</font></div><font face="monospace"><br></font><span style="font-family:monospace">nodelist {</span><span style="font-family:monospace"><br></span><span style="font-family:monospace"> node {<br></span><span style="font-family:monospace"> name: </span><font face="monospace">serverA</font><br><span style="font-family:monospace"> nodeid: 1<br></span><span style="font-family:monospace"> quorum_votes: 1<br></span><span style="font-family:monospace"> ring0_addr: xx.xx.xx.xx<br></span><span style="font-family:monospace"> }<br></span><span style="font-family:monospace"> node {<br></span><span style="font-family:monospace"> name: </span><font face="monospace">serverB</font><br><span style="font-family:monospace"> nodeid: 3<br></span><span style="font-family:monospace"> quorum_votes: 1<br></span><span style="font-family:monospace"> ring0_addr: xx.xx.xx.xx<br></span><span style="font-family:monospace"> }</span><div><font face="monospace">}</font></div><font face="monospace"><br></font><span style="font-family:monospace">quorum {</span><span style="font-family:monospace"><br></span><span style="font-family:monospace"> device {<br></span><span style="font-family:monospace"> model: net<br></span><span style="font-family:monospace"> net {<br></span><span style="font-family:monospace"> algorithm: ffsplit<br></span><span style="font-family:monospace"> host: xx.xx.xx.xx<br></span><span style="font-family:monospace"> tls: on<br></span><span style="font-family:monospace"> }<br></span><span style="font-family:monospace"> votes: 1<br></span><span style="font-family:monospace"> }<br></span><span style="font-family:monospace"> provider: corosync_votequorum</span><div><font face="monospace">}</font></div><font face="monospace"><br></font><span style="font-family:monospace">totem {</span><span style="font-family:monospace"><br></span><span style="font-family:monospace"> cluster_name: </span><font face="monospace">cluster
</font><span style="font-family:monospace"><br> config_version: 24<br></span><span style="font-family:monospace"> interface {<br></span><span style="font-family:monospace"> </span>linknumber<span style="font-family:monospace">: 0<br></span><span style="font-family:monospace"> }<br></span><span style="font-family:monospace"> ip_version: ipv4-6<br></span><span style="font-family:monospace"> link_mode: passive<br></span><span style="font-family:monospace"> secauth: on<br></span><span style="font-family:monospace"> version: 2<br></span><span style="font-family:monospace"> <b>token_retransmits_before_loss_const: 40</b><br></span><span style="font-family:monospace"> <b>token: 30000</b></span></blockquote><blockquote style="margin:0 0 0 40px;border:none;padding:0px"></blockquote><blockquote style="margin:0 0 0 40px;border:none;padding:0px"></blockquote><blockquote style="margin:0 0 0 40px;border:none;padding:0px"></blockquote><blockquote style="margin:0 0 0 40px;border:none;padding:0px"><div><font face="monospace">}</font></div></blockquote><div><br></div><div><br></div><div>With this configuration, the fence of the servers continues but with an average of 15 days.</div><div><br></div><div>Our current problem is that when fencing occurs on a server, the second server has the same behaviour somes minutes after ... And each time.</div><div><br></div><div>I tested the cluster with a cut off power of the server A, and all worked great. Server B starts the virtual machines of server A.</div><div><br></div><div>But in real life, when a server can't talk with another main server, it seems that the two servers believe that they isoled of other.</div><div><br></div><div>So, after a lot of tests, I don't know which is the best way to have a cluster that works correctly..<br></div><div><br></div><div>Currently, the cluster stops working more than the servers have a real problem.</div><div><br></div><div>Maybe my configuration is not good or another ? </div><div><br></div><div>So, I need your help =)</div><div><br></div><div><u>Here is the kernel logs of the reboot of the server A ( result the command line << cat /var/log/daemon.log | grep -E 'watchdog|corosync' >> ) :</u></div><div>
<br>
...<br>Feb 16 09:55:00 serverA corosync[2762]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)<br>Feb 16 09:55:00 serverA corosync[2762]: [KNET ] host: host: 3 has no active links<br>Feb 16 09:55:22 serverA corosync[2762]: [TOTEM ] Token has not been received in 22500 ms <br>Feb 16 09:55:30 serverA corosync[2762]: [TOTEM ] A processor failed, forming new configuration: token timed out (30000ms), waiting 36000ms for consensus.<br>Feb 16 09:55:38 serverA corosync[2762]: [KNET ] rx: host: 3 link: 0 is up<br>Feb 16 09:55:38 serverA corosync[2762]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)<br>Feb 16 09:55:55 serverA watchdog-mux[1890]: client watchdog expired - disable watchdog updates</div><div><i>Reboot</i></div><div>....<br>
<br><br>
<u>Here is the kernel logs of the reboot of the server B </u><u>( result the command line << cat /var/log/daemon.log | grep -E 'watchdog|corosync' >> ) :</u><br>
<br>Feb 16 09:48:42 serverB corosync[2728]: [KNET ] link: host: 1 link: 0 is down<br>Feb 16 09:48:42 serverB corosync[2728]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)<br>Feb 16 09:48:42 serverB corosync[2728]: [KNET ] host: host: 1 has no active links<br>Feb 16 09:48:57 serverB corosync[2728]: [KNET ] rx: host: 1 link: 0 is up<br>Feb 16 09:48:57 serverB corosync[2728]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)<br>Feb 16 09:53:56 serverB corosync[2728]: [KNET ] link: host: 1 link: 0 is down<br>Feb 16 09:53:56 serverB corosync[2728]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)<br>Feb 16 09:53:56 serverB corosync[2728]: [KNET ] host: host: 1 has no active links<br>Feb 16 09:54:12 serverB corosync[2728]: [KNET ] rx: host: 1 link: 0 is up<br>Feb 16 09:54:12 serverB corosync[2728]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)<br>Feb 16 09:55:22 serverB corosync[2728]: [TOTEM ] Token has not been received in 22500 ms <br>Feb 16 09:55:30 serverB corosync[2728]: [TOTEM ] A processor failed, forming new configuration: token timed out (30000ms), waiting 36000ms for consensus.<br>Feb 16 09:55:35 serverB corosync[2728]: [KNET ] link: host: 1 link: 0 is down<br>Feb 16 09:55:35 serverB corosync[2728]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)<br>Feb 16 09:55:35 serverB corosync[2728]: [KNET ] host: host: 1 has no active links<br>Feb 16 09:55:55 serverB watchdog-mux[2280]: client watchdog expired - disable watchdog updates</div><div><i>Reboot</i><br>
<br>
<br>
Do you have an idea why when fencing occurs on one server, the other server has the same behavior ? </div><div><br></div><div>Thanks for your help.<br><br>
Best regards. <br clear="all"><div><br></div></div><div>Seb.</div></div>