<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head><meta http-equiv=Content-Type content="text/html; charset=us-ascii"><meta name=Generator content="Microsoft Word 15 (filtered medium)"><style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0cm;
font-size:11.0pt;
font-family:"Calibri",sans-serif;
mso-ligatures:standardcontextual;
mso-fareast-language:EN-US;}
p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph
{mso-style-priority:34;
margin-top:0cm;
margin-right:0cm;
margin-bottom:0cm;
margin-left:36.0pt;
font-size:11.0pt;
font-family:"Calibri",sans-serif;
mso-ligatures:standardcontextual;
mso-fareast-language:EN-US;}
span.EmailStyle17
{mso-style-type:personal-compose;
font-family:"Calibri",sans-serif;
color:windowtext;}
.MsoChpDefault
{mso-style-type:export-only;
mso-fareast-language:EN-US;}
@page WordSection1
{size:612.0pt 792.0pt;
margin:2.0cm 42.5pt 2.0cm 3.0cm;}
div.WordSection1
{page:WordSection1;}
/* List Definitions */
@list l0
{mso-list-id:562259808;
mso-list-type:hybrid;
mso-list-template-ids:1263730210 68747279 68747289 68747291 68747279 68747289 68747291 68747279 68747289 68747291;}
@list l0:level1
{mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-18.0pt;}
@list l0:level2
{mso-level-number-format:alpha-lower;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-18.0pt;}
@list l0:level3
{mso-level-number-format:roman-lower;
mso-level-tab-stop:none;
mso-level-number-position:right;
text-indent:-9.0pt;}
@list l0:level4
{mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-18.0pt;}
@list l0:level5
{mso-level-number-format:alpha-lower;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-18.0pt;}
@list l0:level6
{mso-level-number-format:roman-lower;
mso-level-tab-stop:none;
mso-level-number-position:right;
text-indent:-9.0pt;}
@list l0:level7
{mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-18.0pt;}
@list l0:level8
{mso-level-number-format:alpha-lower;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-18.0pt;}
@list l0:level9
{mso-level-number-format:roman-lower;
mso-level-tab-stop:none;
mso-level-number-position:right;
text-indent:-9.0pt;}
@list l1
{mso-list-id:984747524;
mso-list-type:hybrid;
mso-list-template-ids:216809102 68747279 68747289 68747291 68747279 68747289 68747291 68747279 68747289 68747291;}
@list l1:level1
{mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-18.0pt;}
@list l1:level2
{mso-level-number-format:alpha-lower;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-18.0pt;}
@list l1:level3
{mso-level-number-format:roman-lower;
mso-level-tab-stop:none;
mso-level-number-position:right;
text-indent:-9.0pt;}
@list l1:level4
{mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-18.0pt;}
@list l1:level5
{mso-level-number-format:alpha-lower;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-18.0pt;}
@list l1:level6
{mso-level-number-format:roman-lower;
mso-level-tab-stop:none;
mso-level-number-position:right;
text-indent:-9.0pt;}
@list l1:level7
{mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-18.0pt;}
@list l1:level8
{mso-level-number-format:alpha-lower;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-18.0pt;}
@list l1:level9
{mso-level-number-format:roman-lower;
mso-level-tab-stop:none;
mso-level-number-position:right;
text-indent:-9.0pt;}
ol
{margin-bottom:0cm;}
ul
{margin-bottom:0cm;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]--></head><body lang=RU link="#0563C1" vlink="#954F72" style='word-wrap:break-word'><div class=WordSection1><p class=MsoNormal><span lang=EN-US>Hi All,<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>I am trying to build application-specific 2-node failover cluster using ubuntu 22, pacemaker 2.1.2 + corosync 3.1.6 and DRBD 9.2.9, knet transport.<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>For some reason I can’t use 3-node then I have to use qnetd+qdevice 3.0.1.<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>The main goal Is to protect custom app which is not cluster-aware by itself. It is quite stateful, can’t store the state outside memory and take some time to get converged with other parts of the system, then the best scenario is “failover is a restart with same config”, but each unnecessary restart is painful. So, if failover done, app must retain on the backup node until it fail or admin push it back, this work well with stickiness param.<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>So, the goal is to detect serving node fail ASAP and restart it ASAP on other node, using DRBD-synced config/data. ASAP means within 5-7 sec, not 30 or more. <o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>I was tried different combinations of timing, and finally got acceptable result within 5 sec for the best case. But! The case is very unstable.<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>My setup is a simple: two nodes on VM, and one more VM as arbiter (qnetd), VMs under Proxmox and connected by net via external ethernet switch to get closer to reality where “nodes VM” should locate as VM on different PHY hosts in one rack.<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>Then, it was adjusted for faster detect and failover.<o:p></o:p></span></p><ol style='margin-top:0cm' start=1 type=1><li class=MsoListParagraph style='margin-left:0cm;mso-list:l1 level1 lfo1'><span lang=EN-US>In Corosync, left the token default 1000ms, but add “heartbeat_failures_allowed: 3”, this made corosync catch node failure for about 200ms (4x50ms heartbeat).<o:p></o:p></span></li><li class=MsoListParagraph style='margin-left:0cm;mso-list:l1 level1 lfo1'><span lang=EN-US>Both qnet and qdevice was run with net_heartbeat_interval_min=200 to allow play with faster hearbeats and detects<o:p></o:p></span></li><li class=MsoListParagraph style='margin-left:0cm;mso-list:l1 level1 lfo1'><span lang=EN-US>Also, quorum.device.net has timeout: 500, sync_timeout: 3000, algo: LMS.<o:p></o:p></span></li></ol><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>The testing is to issue “ate +%M:%S.%N && qm stop 201”, and then check the logs on timestamp when the app started on the “backup” host. And, when backup host boot again, the test is to check the logs for the app was not restarted.<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>Sometimes switchover work like a charm but sometimes it may delay for dozens of seconds. <o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>Sometimes when the primary host boot up again, secondary hold quorum well and keep app running, sometimes quorum is lost first (and pacemaker downs the app) and then found and pacemaker get app up again, so unwanted restart happen.<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>My investigation shows that the difference between “good” and “bad” cases:<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>Good case - all the logs clear and reasonable.<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>Bad case: qnetd losing connection to second node just after the connection to “failure” node detected and then it may take dozens of seconds to restore it. All this time qdevice trying to connect qnetd and fails:<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>Example, host 192.168.100.1 send to failure, 100.2 is failover to:<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>From qnetd:<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:39 arbiter corosync-qnetd[6338]: Client ::ffff:192.168.100.1:60686 doesn't sent any message during 600ms. Disconnecting<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:39 arbiter corosync-qnetd[6338]: Client ::ffff:192.168.100.1:60686 (init_received 1, cluster bsc-test-cluster, node_id 1) disconnect<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:39 arbiter corosync-qnetd[6338]: algo-lms: Client 0x55a6fc6785b0 (cluster bsc-test-cluster, node_id 1) disconnect<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:39 arbiter corosync-qnetd[6338]: algo-lms: server going down 0<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>>>> This is unexpected down, at normal scenario connection persist<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:40 arbiter corosync-qnetd[6338]: Client ::ffff:192.168.100.2:32790 doesn't sent any message during 600ms. Disconnecting<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:40 arbiter corosync-qnetd[6338]: Client ::ffff:192.168.100.2:32790 (init_received 1, cluster bsc-test-cluster, node_id 2) disconnect<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:40 arbiter corosync-qnetd[6338]: algo-lms: Client 0x55a6fc6363d0 (cluster bsc-test-cluster, node_id 2) disconnect<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:40 arbiter corosync-qnetd[6338]: algo-lms: server going down 0<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 arbiter corosync-qnetd[6338]: New client connected<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 arbiter corosync-qnetd[6338]: cluster name = bsc-test-cluster<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 arbiter corosync-qnetd[6338]: tls started = 0<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 arbiter corosync-qnetd[6338]: tls peer certificate verified = 0<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 arbiter corosync-qnetd[6338]: node_id = 2<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 arbiter corosync-qnetd[6338]: pointer = 0x55a6fc6363d0<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 arbiter corosync-qnetd[6338]: addr_str = ::ffff:192.168.100.2:57736<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 arbiter corosync-qnetd[6338]: ring id = (2.801)<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 arbiter corosync-qnetd[6338]: cluster dump:<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 arbiter corosync-qnetd[6338]: client = ::ffff:192.168.100.2:57736, node_id = 2<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 arbiter corosync-qnetd[6338]: Client ::ffff:192.168.100.2:57736 (cluster bsc-test-cluster, node_id 2) sent initial node list.<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 arbiter corosync-qnetd[6338]: msg seq num = 99<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 arbiter corosync-qnetd[6338]: Node list:<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 arbiter corosync-qnetd[6338]: 0 node_id = 1, data_center_id = 0, node_state = not set<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 arbiter corosync-qnetd[6338]: 1 node_id = 2, data_center_id = 0, node_state = not set<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 arbiter corosync-qnetd[6338]: algo-lms: cluster bsc-test-cluster config_list has 2 nodes<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 arbiter corosync-qnetd[6338]: Algorithm result vote is No change<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 arbiter corosync-qnetd[6338]: Client ::ffff:192.168.100.2:57736 (cluster bsc-test-cluster, node_id 2) sent membership node list.<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 arbiter corosync-qnetd[6338]: msg seq num = 100<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 arbiter corosync-qnetd[6338]: ring id = (2.801)<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 arbiter corosync-qnetd[6338]: heuristics = Undefined<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 arbiter corosync-qnetd[6338]: Node list:<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 arbiter corosync-qnetd[6338]: 0 node_id = 2, data_center_id = 0, node_state = not set<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 arbiter corosync-qnetd[6338]:<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 arbiter corosync-qnetd[6338]: algo-lms: membership list from node 2 partition (2.801)<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 arbiter corosync-qnetd[6338]: algo-util: partition (2.801) (0x55a6fc67e110) has 1 nodes<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 arbiter corosync-qnetd[6338]: algo-lms: Only 1 partition. This is votequorum's problem, not ours<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 arbiter corosync-qnetd[6338]: Algorithm result vote is ACK<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 arbiter corosync-qnetd[6338]: Client ::ffff:192.168.100.2:57736 (cluster bsc-test-cluster, node_id 2) sent quorum node list.<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 arbiter corosync-qnetd[6338]: msg seq num = 101<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 arbiter corosync-qnetd[6338]: quorate = 0<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 arbiter corosync-qnetd[6338]: Node list:<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 arbiter corosync-qnetd[6338]: 0 node_id = 1, data_center_id = 0, node_state = dead<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 arbiter corosync-qnetd[6338]: 1 node_id = 2, data_center_id = 0, node_state = member<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 arbiter corosync-qnetd[6338]:<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 arbiter corosync-qnetd[6338]: algo-lms: quorum node list from node 2 partition (2.801)<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 arbiter corosync-qnetd[6338]: algo-util: partition (2.801) (0x55a6fc697e70) has 1 nodes<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 arbiter corosync-qnetd[6338]: algo-lms: Only 1 partition. This is votequorum's problem, not ours<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 arbiter corosync-qnetd[6338]: Algorithm result vote is ACK<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 arbiter corosync-qnetd[6338]: Client ::ffff:192.168.100.2:57736 (cluster bsc-test-cluster, node_id 2) sent quorum node list.<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 arbiter corosync-qnetd[6338]: msg seq num = 102<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 arbiter corosync-qnetd[6338]: quorate = 1<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 arbiter corosync-qnetd[6338]: Node list:<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 arbiter corosync-qnetd[6338]: 0 node_id = 1, data_center_id = 0, node_state = dead<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 arbiter corosync-qnetd[6338]: 1 node_id = 2, data_center_id = 0, node_state = member<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 arbiter corosync-qnetd[6338]:<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 arbiter corosync-qnetd[6338]: algo-lms: quorum node list from node 2 partition (2.801)<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 arbiter corosync-qnetd[6338]: algo-util: partition (2.801) (0x55a6fc669dc0) has 1 nodes<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 arbiter corosync-qnetd[6338]: algo-lms: Only 1 partition. This is votequorum's problem, not ours<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 arbiter corosync-qnetd[6338]: Algorithm result vote is ACK<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:58 arbiter corosync-qnetd[6338]: Client closed connection<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:58 arbiter corosync-qnetd[6338]: Client ::ffff:192.168.100.2:57674 (init_received 0, cluster bsc-test-cluster, node_id 0) disconnect<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>>> At this point resource start on backup host<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>From qdevice:<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:40 node2 corosync-qdevice[781]: Server closed connection<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:40 node2 corosync-qdevice[781]: algo-lms: disconnected. quorate = 1, WFA = 0<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:40 node2 corosync-qdevice[781]: algo-lms: disconnected. reason = 22, WFA = 0<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:40 node2 corosync-qdevice[781]: Algorithm result vote is NACK<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:40 node2 corosync-qdevice[781]: Cast vote timer remains scheduled every 250ms voting NACK.<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:40 node2 corosync-qdevice[781]: Sleeping for 161 ms before reconnect<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:40 node2 corosync-qdevice[781]: Trying connect to qnetd server arbiter:5403 (timeout = 400ms)<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:41 node2 corosync-qdevice[781]: Connect timeout<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:41 node2 corosync-qdevice[781]: algo-lms: disconnected. quorate = 1, WFA = 0<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:41 node2 corosync-qdevice[781]: algo-lms: disconnected. reason = 27, WFA = 0<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:41 node2 corosync-qdevice[781]: Algorithm result vote is NACK<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:41 node2 corosync-qdevice[781]: Cast vote timer remains scheduled every 250ms voting NACK.<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:41 node2 corosync-qdevice[781]: Trying connect to qnetd server arbiter:5403 (timeout = 400ms)<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:41 node2 corosync-qdevice[781]: Votequorum nodelist notify callback:<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:41 node2 corosync-qdevice[781]: Ring_id = (2.801)<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:41 node2 corosync-qdevice[781]: Node list (size = 1):<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:41 node2 corosync-qdevice[781]: 0 nodeid = 2<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:41 node2 corosync-qdevice[781]: Algorithm decided to pause cast vote timer and result vote is No change<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:41 node2 corosync-qdevice[781]: Cast vote timer is now paused.<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:41 node2 corosync-qdevice[781]: worker: qdevice_heuristics_worker_cmd_process_exec: Received exec command with seq_no "24" and timeout "1500"<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:41 node2 corosync-qdevice[781]: Received heuristics exec result command with seq_no "24" and result "Disabled"<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:41 node2 corosync-qdevice[781]: Votequorum heuristics exec result callback:<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:41 node2 corosync-qdevice[781]: seq_number = 24, exec_result = Disabled<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:41 node2 corosync-qdevice[781]: Algorithm decided to not send list, result vote is No change and heuristics is Undefined<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:41 node2 corosync-qdevice[781]: Cast vote timer is no longer paused.<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:41 node2 corosync-qdevice[781]: Not scheduling heuristics timer because mode is not enabled<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:41 node2 corosync-qdevice[781]: Votequorum quorum notify callback:<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:41 node2 corosync-qdevice[781]: Quorate = 0<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:41 node2 corosync-qdevice[781]: Node list (size = 3):<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:41 node2 corosync-qdevice[781]: 0 nodeid = 1, state = 2<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:41 node2 corosync-qdevice[781]: 1 nodeid = 2, state = 1<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:41 node2 corosync-qdevice[781]: 2 nodeid = 0, state = 0<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:41 node2 corosync-qdevice[781]: algo-lms: quorum_notify. quorate = 0<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:41 node2 corosync-qdevice[781]: Algorithm decided to not send list and result vote is No change<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:41 node2 corosync-qdevice[781]: Connect timeout<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:41 node2 corosync-qdevice[781]: algo-lms: disconnected. quorate = 0, WFA = 0<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:41 node2 corosync-qdevice[781]: algo-lms: disconnected. reason = 27, WFA = 0<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:41 node2 corosync-qdevice[781]: Algorithm result vote is NACK<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:41 node2 corosync-qdevice[781]: Cast vote timer remains scheduled every 250ms voting NACK.<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:41 node2 corosync-qdevice[781]: Trying connect to qnetd server arbiter:5403 (timeout = 400ms)<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>>>> At this point quorum reported lost<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:41 node2 corosync-qdevice[781]: Connect timeout<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:41 node2 corosync-qdevice[781]: algo-lms: disconnected. quorate = 0, WFA = 0<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:41 node2 corosync-qdevice[781]: algo-lms: disconnected. reason = 27, WFA = 0<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:41 node2 corosync-qdevice[781]: Algorithm result vote is NACK<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:41 node2 corosync-qdevice[781]: Cast vote timer remains scheduled every 250ms voting NACK.<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>>>> This failure pattern repeats 31 times<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:41 node2 corosync-qdevice[781]: Trying connect to qnetd server arbiter:5403 (timeout = 400ms)<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:42 node2 corosync-qdevice[781]: Connect timeout<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:42 node2 corosync-qdevice[781]: algo-lms: disconnected. quorate = 0, WFA = 0<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:42 node2 corosync-qdevice[781]: algo-lms: disconnected. reason = 27, WFA = 0<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:42 node2 corosync-qdevice[781]: Algorithm result vote is NACK<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:42 node2 corosync-qdevice[781]: Cast vote timer remains scheduled every 250ms voting NACK.<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>>>> End of pattern repeat, continue<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 node2 corosync-qdevice[781]: Trying connect to qnetd server arbiter:5403 (timeout = 400ms)<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 node2 corosync-qdevice[781]: Sending preinit msg to qnetd<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 node2 corosync-qdevice[781]: Received preinit reply msg<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 node2 corosync-qdevice[781]: Received init reply msg<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 node2 corosync-qdevice[781]: Scheduling send of heartbeat every 400ms<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 node2 corosync-qdevice[781]: Executing after-connect heuristics.<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 node2 corosync-qdevice[781]: worker: qdevice_heuristics_worker_cmd_process_exec: Received exec command with seq_no "25" and timeout "250"<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 node2 corosync-qdevice[781]: Received heuristics exec result command with seq_no "25" and result "Disabled"<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 node2 corosync-qdevice[781]: Algorithm decided to send config node list, send membership node list, send quorum node list, heuristics is Undefined and result vote is Wait for reply<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 node2 corosync-qdevice[781]: Sending set option seq = 98, HB(0) = 0ms, KAP Tie-breaker(1) = Enabled<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 node2 corosync-qdevice[781]: Sending config node list seq = 99<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 node2 corosync-qdevice[781]: Node list:<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 node2 corosync-qdevice[781]: 0 node_id = 1, data_center_id = 0, node_state = not set<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 node2 corosync-qdevice[781]: 1 node_id = 2, data_center_id = 0, node_state = not set<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 node2 corosync-qdevice[781]: Sending membership node list seq = 100, ringid = (2.801), heuristics = Undefined.<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 node2 corosync-qdevice[781]: Node list:<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 node2 corosync-qdevice[781]: 0 node_id = 2, data_center_id = 0, node_state = not set<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 node2 corosync-qdevice[781]: Sending quorum node list seq = 101, quorate = 0<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 node2 corosync-qdevice[781]: Node list:<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 node2 corosync-qdevice[781]: 0 node_id = 1, data_center_id = 0, node_state = dead<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 node2 corosync-qdevice[781]: 1 node_id = 2, data_center_id = 0, node_state = member<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 node2 corosync-qdevice[781]: Cast vote timer is now stopped.<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 node2 corosync-qdevice[781]: Received set option reply seq(1) = 98, HB(0) = 0ms, KAP Tie-breaker(1) = Enabled<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 node2 corosync-qdevice[781]: Received initial config node list reply<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 node2 corosync-qdevice[781]: seq = 99<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 node2 corosync-qdevice[781]: vote = No change<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 node2 corosync-qdevice[781]: ring id = (2.801)<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 node2 corosync-qdevice[781]: Algorithm result vote is No change<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 node2 corosync-qdevice[781]: Received membership node list reply<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 node2 corosync-qdevice[781]: seq = 100<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 node2 corosync-qdevice[781]: vote = ACK<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 node2 corosync-qdevice[781]: ring id = (2.801)<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 node2 corosync-qdevice[781]: Algorithm result vote is ACK<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 node2 corosync-qdevice[781]: Cast vote timer is now scheduled every 250ms voting ACK.<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 node2 corosync-qdevice[781]: Received quorum node list reply<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 node2 corosync-qdevice[781]: seq = 101<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 node2 corosync-qdevice[781]: vote = ACK<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 node2 corosync-qdevice[781]: ring id = (2.801)<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 node2 corosync-qdevice[781]: Algorithm result vote is ACK<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 node2 corosync-qdevice[781]: Cast vote timer remains scheduled every 250ms voting ACK.<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 node2 corosync-qdevice[781]: Votequorum quorum notify callback:<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 node2 corosync-qdevice[781]: Quorate = 1<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 node2 corosync-qdevice[781]: Node list (size = 3):<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 node2 corosync-qdevice[781]: 0 nodeid = 1, state = 2<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 node2 corosync-qdevice[781]: 1 nodeid = 2, state = 1<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 node2 corosync-qdevice[781]: 2 nodeid = 0, state = 0<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 node2 corosync-qdevice[781]: algo-lms: quorum_notify. quorate = 1<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 node2 corosync-qdevice[781]: Algorithm decided to send list and result vote is No change<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 node2 corosync-qdevice[781]: Sending quorum node list seq = 102, quorate = 1<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 node2 corosync-qdevice[781]: Node list:<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 node2 corosync-qdevice[781]: 0 node_id = 1, data_center_id = 0, node_state = dead<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 node2 corosync-qdevice[781]: 1 node_id = 2, data_center_id = 0, node_state = member<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 node2 corosync-qdevice[781]: Received quorum node list reply<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 node2 corosync-qdevice[781]: seq = 102<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 node2 corosync-qdevice[781]: vote = ACK<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 node2 corosync-qdevice[781]: ring id = (2.801)<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 node2 corosync-qdevice[781]: Algorithm result vote is ACK<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>May 01 23:30:56 node2 corosync-qdevice[781]: Cast vote timer remains scheduled every 250ms voting ACK.<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>>>> Here everything become OK and resource started on Node2<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>Also, I’ve done wireshark capture and found great mess in TCP, it seems like connection between qdevice and qnetd really stops for some time and packets won’t deliver.<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>For my guess, it match corosync syncing activities, and I suspect that corosync prevent any other traffic on the interface it use for rings.<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>As I switch qnetd and qdevice to use different interface it seems to work fine.<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>So, the question is: does corosync really temporary blocks any other traffic on the interface it uses? Or it is just a coincidence? If it blocks, is there a way to manage it?<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>Thank you for any suggest on that!<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>Sincerely,<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>Alex<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p></div></body></html>