<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Aug 7, 2019 at 1:00 PM Klaus Wenninger <<a href="mailto:kwenning@redhat.com">kwenning@redhat.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF">
<div class="gmail-m_-5926390740668612657moz-cite-prefix">On 8/7/19 12:26 PM, Momcilo Medic
wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr"> We have three node cluster that is setup to stop
resources on lost quorum.<br>
Failure (network going down) handling is done properly, but
recovery doesn't seem to work.<br>
</div>
</blockquote>
<tt>What do you mean by 'network going down'?</tt><tt><br>
</tt><tt>Loss of link? Does the IP persist on the interface</tt><tt><br>
</tt><tt>in that case?</tt><tt><br></tt></div></blockquote><div><br></div><div>Yes, we simulate faulty cable by turning switch ports down and up.<br>In such a case, the IP does not persist on the interface.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div bgcolor="#FFFFFF"><tt>
</tt><tt>That there are issue reconnecting the CPG-API</tt><tt><br>
</tt><tt>sounds strange to me. Already the fact that</tt><tt><br>
</tt><tt>something has to be reconnected. I got it</tt><tt><br>
</tt><tt>that your nodes were persistently up during the</tt><tt><br>
</tt><tt>network-disconnection. Although I would have</tt><tt><br>
</tt><tt>expected fencing to kick in at least on those</tt><tt><br>
</tt><tt>which are part of the non-quorate cluster-partition.</tt><tt><br>
</tt><tt>Maybe a few words more on your scenario</tt><tt><br>
</tt><tt>(fening-setup e.g.) would help to understand what</tt><tt><br>
</tt><tt>is going on.</tt><tt><br></tt></div></blockquote><div><br></div><div>We don't use any fencing mechanisms, we rely on quorum to run the services.<br>In more detail, we run three node Linbit LINSTOR storage that is hyperconverged.<br>Meaning, we run clustered storage on the virtualization hypervisors.<br><br>We use pcs in order to have linstor-controller service in high availabilty mode.<br>Policy for no quorum is to stop the resources.<br><br>In such hyperconverged setup, we can't fence a node without impact.<br>It may happen that network instability causes primary node to no longer be primary.<br>In that case, we don't want running VMs to go down with the ship, as there was no impact for them.<br><br>However, we would like to have high-availability of that service upon network restoration, without manual actions.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div bgcolor="#FFFFFF"><tt>
</tt><tt><br>
</tt><tt>Klaus</tt><br>
<blockquote type="cite">
<div dir="ltr"><br>
What happens is, services crash when we re-enable network
connection.<br>
<br>
From journal:<br>
<br>
```<br>
...<br>
Jul 12 00:27:32 <a href="http://itaftestkvmls02.dc.itaf.eu" target="_blank">itaftestkvmls02.dc.itaf.eu</a>
corosync[9069]: corosync: totemsrp.c:1328:
memb_consensus_agreed: Assertion `token_memb_entries >= 1'
failed.<br>
Jul 12 00:27:33 <a href="http://itaftestkvmls02.dc.itaf.eu" target="_blank">itaftestkvmls02.dc.itaf.eu</a>
attrd[9104]: error: Connection to the CPG API failed: Library
error (2)<br>
Jul 12 00:27:33 <a href="http://itaftestkvmls02.dc.itaf.eu" target="_blank">itaftestkvmls02.dc.itaf.eu</a>
stonith-ng[9100]: error: Connection to the CPG API failed:
Library error (2)<br>
Jul 12 00:27:33 <a href="http://itaftestkvmls02.dc.itaf.eu" target="_blank">itaftestkvmls02.dc.itaf.eu</a>
systemd[1]: corosync.service: Main process exited, code=dumped,
status=6/ABRT<br>
Jul 12 00:27:33 <a href="http://itaftestkvmls02.dc.itaf.eu" target="_blank">itaftestkvmls02.dc.itaf.eu</a>
cib[9098]: error: Connection to the CPG API failed: Library
error (2)<br>
Jul 12 00:27:33 <a href="http://itaftestkvmls02.dc.itaf.eu" target="_blank">itaftestkvmls02.dc.itaf.eu</a>
systemd[1]: corosync.service: Failed with result 'core-dump'.<br>
Jul 12 00:27:33 <a href="http://itaftestkvmls02.dc.itaf.eu" target="_blank">itaftestkvmls02.dc.itaf.eu</a>
pacemakerd[9087]: error: Connection to the CPG API failed:
Library error (2)<br>
Jul 12 00:27:33 <a href="http://itaftestkvmls02.dc.itaf.eu" target="_blank">itaftestkvmls02.dc.itaf.eu</a>
systemd[1]: pacemaker.service: Main process exited, code=exited,
status=107/n/a<br>
Jul 12 00:27:33 <a href="http://itaftestkvmls02.dc.itaf.eu" target="_blank">itaftestkvmls02.dc.itaf.eu</a>
systemd[1]: pacemaker.service: Failed with result 'exit-code'.<br>
Jul 12 00:27:33 <a href="http://itaftestkvmls02.dc.itaf.eu" target="_blank">itaftestkvmls02.dc.itaf.eu</a>
systemd[1]: Stopped Pacemaker High Availability Cluster Manager.<br>
Jul 12 00:27:33 <a href="http://itaftestkvmls02.dc.itaf.eu" target="_blank">itaftestkvmls02.dc.itaf.eu</a>
lrmd[9102]: warning: new_event_notification (9102-9107-7): Bad
file descriptor (9)<br>
...<br>
```<br>
Pacemaker's log shows no relevant info.<br>
<br>
This is from corosync's log:<br>
<br>
```<br>
Jul 12 00:27:33 [9107] <a href="http://itaftestkvmls02.dc.itaf.eu" target="_blank">itaftestkvmls02.dc.itaf.eu</a>
crmd: info: qb_ipcs_us_withdraw: withdrawing server
sockets<br>
Jul 12 00:27:33 [9104] <a href="http://itaftestkvmls02.dc.itaf.eu" target="_blank">itaftestkvmls02.dc.itaf.eu</a>
attrd: error: pcmk_cpg_dispatch: Connection to the CPG
API failed: Library error (2)<br>
Jul 12 00:27:33 [9100] <a href="http://itaftestkvmls02.dc.itaf.eu" target="_blank">itaftestkvmls02.dc.itaf.eu</a>
stonith-ng: error: pcmk_cpg_dispatch: Connection to the
CPG API failed: Library error (2)<br>
Jul 12 00:27:33 [9098] <a href="http://itaftestkvmls02.dc.itaf.eu" target="_blank">itaftestkvmls02.dc.itaf.eu</a>
cib: error: pcmk_cpg_dispatch: Connection to the CPG
API failed: Library error (2)<br>
Jul 12 00:27:33 [9087] <a href="http://itaftestkvmls02.dc.itaf.eu" target="_blank">itaftestkvmls02.dc.itaf.eu</a>
pacemakerd: error: pcmk_cpg_dispatch: Connection to the
CPG API failed: Library error (2)<br>
Jul 12 00:27:33 [9104] <a href="http://itaftestkvmls02.dc.itaf.eu" target="_blank">itaftestkvmls02.dc.itaf.eu</a>
attrd: info: qb_ipcs_us_withdraw: withdrawing server
sockets<br>
Jul 12 00:27:33 [9087] <a href="http://itaftestkvmls02.dc.itaf.eu" target="_blank">itaftestkvmls02.dc.itaf.eu</a>
pacemakerd: info: crm_xml_cleanup: Cleaning up memory
from libxml2<br>
Jul 12 00:27:33 [9107] <a href="http://itaftestkvmls02.dc.itaf.eu" target="_blank">itaftestkvmls02.dc.itaf.eu</a>
crmd: info: crm_xml_cleanup: Cleaning up memory from
libxml2<br>
Jul 12 00:27:33 [9100] <a href="http://itaftestkvmls02.dc.itaf.eu" target="_blank">itaftestkvmls02.dc.itaf.eu</a>
stonith-ng: info: qb_ipcs_us_withdraw: withdrawing server
sockets<br>
Jul 12 00:27:33 [9104] <a href="http://itaftestkvmls02.dc.itaf.eu" target="_blank">itaftestkvmls02.dc.itaf.eu</a>
attrd: info: crm_xml_cleanup: Cleaning up memory
from libxml2<br>
Jul 12 00:27:33 [9098] <a href="http://itaftestkvmls02.dc.itaf.eu" target="_blank">itaftestkvmls02.dc.itaf.eu</a>
cib: info: qb_ipcs_us_withdraw: withdrawing server
sockets<br>
Jul 12 00:27:33 [9100] <a href="http://itaftestkvmls02.dc.itaf.eu" target="_blank">itaftestkvmls02.dc.itaf.eu</a>
stonith-ng: info: crm_xml_cleanup: Cleaning up memory
from libxml2<br>
Jul 12 00:27:33 [9098] <a href="http://itaftestkvmls02.dc.itaf.eu" target="_blank">itaftestkvmls02.dc.itaf.eu</a>
cib: info: qb_ipcs_us_withdraw: withdrawing server
sockets<br>
Jul 12 00:27:33 [9098] <a href="http://itaftestkvmls02.dc.itaf.eu" target="_blank">itaftestkvmls02.dc.itaf.eu</a>
cib: info: qb_ipcs_us_withdraw: withdrawing server
sockets<br>
Jul 12 00:27:33 [9098] <a href="http://itaftestkvmls02.dc.itaf.eu" target="_blank">itaftestkvmls02.dc.itaf.eu</a>
cib: info: crm_xml_cleanup: Cleaning up memory from
libxml2<br>
Jul 12 00:27:33 [9102] <a href="http://itaftestkvmls02.dc.itaf.eu" target="_blank">itaftestkvmls02.dc.itaf.eu</a>
lrmd: warning: qb_ipcs_event_sendv: new_event_notification
(9102-9107-7): Bad file descriptor (9)<br>
```<br>
<br>
Please let me know if you need any further info, I'll be more
than happy to provide it.<br>
<br>
This is always reproducible in our environment:<br>
Ubuntu 18.04.2<br>
corosync 2.4.3-0ubuntu1.1<br>
pcs 0.9.164-1<br>
<div>pacemaker 1.1.18-0ubuntu1.1</div>
<div><br>
</div>
<div>Kind regards,</div>
<div>Momo.<br>
</div>
</div>
<br>
<fieldset class="gmail-m_-5926390740668612657mimeAttachmentHeader"></fieldset>
<pre class="gmail-m_-5926390740668612657moz-quote-pre">_______________________________________________
Manage your subscription:
<a class="gmail-m_-5926390740668612657moz-txt-link-freetext" href="https://lists.clusterlabs.org/mailman/listinfo/users" target="_blank">https://lists.clusterlabs.org/mailman/listinfo/users</a>
ClusterLabs home: <a class="gmail-m_-5926390740668612657moz-txt-link-freetext" href="https://www.clusterlabs.org/" target="_blank">https://www.clusterlabs.org/</a></pre>
</blockquote>
<br>
</div>
</blockquote></div></div>