<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head><META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=us-ascii"><meta name=Generator content="Microsoft Word 15 (filtered medium)"><!--[if !mso]><style>v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
</style><![endif]--><style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:#0563C1;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{mso-style-priority:99;
color:#954F72;
text-decoration:underline;}
p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph
{mso-style-priority:34;
margin-top:0in;
margin-right:0in;
margin-bottom:0in;
margin-left:.5in;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
span.EmailStyle17
{mso-style-type:personal-compose;
font-family:"Calibri",sans-serif;
color:windowtext;}
.MsoChpDefault
{mso-style-type:export-only;
font-family:"Calibri",sans-serif;}
@page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}
/* List Definitions */
@list l0
{mso-list-id:670106989;
mso-list-type:hybrid;
mso-list-template-ids:51526028 67698703 67698713 67698715 67698703 67698713 67698715 67698703 67698713 67698715;}
@list l0:level1
{mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l0:level2
{mso-level-number-format:alpha-lower;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l0:level3
{mso-level-number-format:roman-lower;
mso-level-tab-stop:none;
mso-level-number-position:right;
text-indent:-9.0pt;}
@list l0:level4
{mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l0:level5
{mso-level-number-format:alpha-lower;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l0:level6
{mso-level-number-format:roman-lower;
mso-level-tab-stop:none;
mso-level-number-position:right;
text-indent:-9.0pt;}
@list l0:level7
{mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l0:level8
{mso-level-number-format:alpha-lower;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
@list l0:level9
{mso-level-number-format:roman-lower;
mso-level-tab-stop:none;
mso-level-number-position:right;
text-indent:-9.0pt;}
ol
{margin-bottom:0in;}
ul
{margin-bottom:0in;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]--></head><body lang=EN-US link="#0563C1" vlink="#954F72"><div class=WordSection1><p class=MsoNormal>Hello,<o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal>We have an issue with a 2 node cluster where both nodes were put into standby (but the resources were not stopped first – so were still in target-role=Started). When the 2 nodes were rebooted, the corosync and pacemaker service started on the first node that came up, but the resources all tried to start, which should not have happened (standby persists through reboots by default). <o:p></o:p></p><p class=MsoNormal>Upon closer inspection, it was found that the system calculated a different node ID than it usually has, and entered the cluster with the same hostname, but not the saved information from the previous cluster ID, so it didn’t remember it was in standby, and tried to come up. I believe the issue is a consequence of two factors. First, the network interface ring0 will use was in the state ‘setup-in-progress’ for some reason when the corosync and pacemaker started. Why exactly that was is still unknown. The corosync systemctl unit should wait until after network-online.target is reached, but that can mean various things, and doesn’t guarantee a particular interface is up. In our case, we use a dedicated network interface with a 169.x.x.x address to connect to the other node. Other interfaces were up, which probably explains why the target was reached.<o:p></o:p></p><p class=MsoNormal>In normal cases, the nodeid calculated by corosync is something like 704514049, which converts to 169.254.8.1 which is the IP address of the ring0 interface. <o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal>In this particular failing case, that didn’t happen, and it got a nodeid of 2130706433 which converts to 127.0.0.1. <o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal>On start, the following logs of note were logged:<o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal><b>corosync[3965]: [TOTEM ] The network interface is down.<o:p></o:p></b></p><p class=MsoNormal><b>[TOTEM ] A new membership (127.0.0.1:4) was formed. Members joined: 2130706433<o:p></o:p></b></p><p class=MsoNormal>….<o:p></o:p></p><p class=MsoNormal><b>crmd[3978]: notice: Deleting unknown node 704514049/cbsta-mq1 which has conflicting uname with 2130706433<o:p></o:p></b></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal>It was the above notice where I believe I lost my saved configuration from the correct node configuration. Here it indicates it is deleting the node that maps to the 169 address and is replacing it with the node id that maps to 127.0.0.1.<o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal>Then all the various resources try to start on this node, which should not have happened (they should have been in standby).<o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal>The pengine files verify that they were in standby, but after the new node id was joined, it did not have that setting, and the resources started because the target role was started for the resources before this all happened.<o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal>It was shortly after that the interface we use for ring0 came up (eth-ha0):<o:p></o:p></p><p class=MsoNormal><b>eth-ha0: link becomes ready<o:p></o:p></b></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal>After that the corosync service starts going down:<o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal><b>2021-09-16T00:43:20.022106+01:00 cbsta-mq1 attrd[3976]: notice: crm_update_peer_proc: Node cbsta-mq1[2130706433] - state is now lost (was member)<o:p></o:p></b></p><p class=MsoNormal><b>2021-09-16T00:43:20.022255+01:00 cbsta-mq1 cib[3973]: notice: crm_update_peer_proc: Node cbsta-mq1[2130706433] - state is now lost (was member)<o:p></o:p></b></p><p class=MsoNormal><b>2021-09-16T00:43:20.022373+01:00 cbsta-mq1 attrd[3976]: notice: Removing all cbsta-mq1 attributes for attrd_peer_change_cb<o:p></o:p></b></p><p class=MsoNormal><b>2021-09-16T00:43:20.022524+01:00 cbsta-mq1 cib[3973]: notice: Removing cbsta-mq1/2130706433 from the membership list<o:p></o:p></b></p><p class=MsoNormal><b>2021-09-16T00:43:20.022639+01:00 cbsta-mq1 attrd[3976]: notice: Lost attribute writer cbsta-mq1<o:p></o:p></b></p><p class=MsoNormal><b>2021-09-16T00:43:20.022743+01:00 cbsta-mq1 cib[3973]: notice: Purged 1 peers with id=2130706433 and/or uname=cbsta-mq1 from the membership cache<o:p></o:p></b></p><p class=MsoNormal><b>2021-09-16T00:43:20.022857+01:00 cbsta-mq1 attrd[3976]: notice: Removing cbsta-mq1/2130706433 from the membership list<o:p></o:p></b></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal>The service then restarts, but now it gets the correct node ID (mapping to 169). <o:p></o:p></p><p class=MsoNormal><b><o:p> </o:p></b></p><p class=MsoNormal><b>2021-09-16T00:43:20.369715+01:00 cbsta-mq1 corosync[12434]: [TOTEM ] A new membership (169.254.8.1:12) was formed. Members joined: 704514049<o:p></o:p></b></p><p class=MsoNormal><b>2021-09-16T00:43:20.369830+01:00 cbsta-mq1 corosync[12434]: [QUORUM] Members[1]: 704514049<o:p></o:p></b></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal>It then tries starting resources again, because it has lost previous information apparently from the delete above. <o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal>The root issue appears to be:<o:p></o:p></p><ol style='margin-top:0in' start=1 type=1><li class=MsoListParagraph style='margin-left:0in;mso-list:l0 level1 lfo1'>The eth-ha0 (ring0 interface) interface was not completely up when corosync started. I may be able to do something to try to ensure the interface is completely up…<o:p></o:p></li><li class=MsoListParagraph style='margin-left:0in;mso-list:l0 level1 lfo1'>I believe our corosync.conf may need to be tuned (see below).<o:p></o:p></li><li class=MsoListParagraph style='margin-left:0in;mso-list:l0 level1 lfo1'>I believe we may need to adjust our /etc/hosts – as the hostname from uname -n maps back to 127.0.0.1 which I think is not what probably works best with corosync. <o:p></o:p></li></ol><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal>The following is our corosync.conf:<o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal><b>totem {<o:p></o:p></b></p><p class=MsoNormal><b> version: 2<o:p></o:p></b></p><p class=MsoNormal><b> cluster_name: cluster2<o:p></o:p></b></p><p class=MsoNormal><b> clear_node_high_bit: yes<o:p></o:p></b></p><p class=MsoNormal><b> crypto_hash: sha1<o:p></o:p></b></p><p class=MsoNormal><b> crypto_cipher: aes256<o:p></o:p></b></p><p class=MsoNormal><b> rrp_mode: active<o:p></o:p></b></p><p class=MsoNormal><b> wait_time: 150<o:p></o:p></b></p><p class=MsoNormal><b># transport: udp<o:p></o:p></b></p><p class=MsoNormal><b> interface {<o:p></o:p></b></p><p class=MsoNormal><b> ringnumber: 0<o:p></o:p></b></p><p class=MsoNormal><b> bindnetaddr: 169.254.3.0<o:p></o:p></b></p><p class=MsoNormal><b> mcastaddr: 239.255.1.2<o:p></o:p></b></p><p class=MsoNormal><b> mcastport: 5405<o:p></o:p></b></p><p class=MsoNormal><b> }<o:p></o:p></b></p><p class=MsoNormal><b> interface {<o:p></o:p></b></p><p class=MsoNormal><b> ringnumber: 1<o:p></o:p></b></p><p class=MsoNormal><b> bindnetaddr: 172.31.0.0<o:p></o:p></b></p><p class=MsoNormal><b> mcastaddr: 239.255.2.2<o:p></o:p></b></p><p class=MsoNormal><b> mcastport: 5407<o:p></o:p></b></p><p class=MsoNormal><b> }<o:p></o:p></b></p><p class=MsoNormal><b>}<o:p></o:p></b></p><p class=MsoNormal><b><o:p> </o:p></b></p><p class=MsoNormal><b>logging {<o:p></o:p></b></p><p class=MsoNormal><b> fileline: on<o:p></o:p></b></p><p class=MsoNormal><b> to_stderr: no<o:p></o:p></b></p><p class=MsoNormal><b> to_logfile: yes<o:p></o:p></b></p><p class=MsoNormal><b> logfile: /var/log/cluster/corosync.log<o:p></o:p></b></p><p class=MsoNormal><b> to_syslog: yes<o:p></o:p></b></p><p class=MsoNormal><b> debug: on<o:p></o:p></b></p><p class=MsoNormal><b> timestamp: on<o:p></o:p></b></p><p class=MsoNormal><b> logger_subsys {<o:p></o:p></b></p><p class=MsoNormal><b> subsys: QUORUM<o:p></o:p></b></p><p class=MsoNormal><b> debug: on<o:p></o:p></b></p><p class=MsoNormal><b> }<o:p></o:p></b></p><p class=MsoNormal><b>}<o:p></o:p></b></p><p class=MsoNormal><b><o:p> </o:p></b></p><p class=MsoNormal><b>quorum {<o:p></o:p></b></p><p class=MsoNormal><b> # Enable and configure quorum subsystem (default: off)<o:p></o:p></b></p><p class=MsoNormal><b> # see also corosync.conf.5 and votequorum.5<o:p></o:p></b></p><p class=MsoNormal><b> provider: corosync_votequorum<o:p></o:p></b></p><p class=MsoNormal><b> expected_votes: 1<o:p></o:p></b></p><p class=MsoNormal><b> two_node: 0<o:p></o:p></b></p><p class=MsoNormal><b>}<o:p></o:p></b></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal>Note that we don’t have a nodelist configuration. It is counting on the bindnetaddr and uses the IP address I believe it finds in that range to determine the node ID. <o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal>I am wondering if we should be adding something like this:<o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal><b>nodelist {<o:p></o:p></b></p><p class=MsoNormal><b> node {<o:p></o:p></b></p><p class=MsoNormal><b> ring0_addr: m660b-qproc4-HA<o:p></o:p></b></p><p class=MsoNormal><b> nodeid: 1<o:p></o:p></b></p><p class=MsoNormal><b> }<o:p></o:p></b></p><p class=MsoNormal><b> node {<o:p></o:p></b></p><p class=MsoNormal><b> ring0_addr: m660b-qproc3-HA<o:p></o:p></b></p><p class=MsoNormal><b> nodeid: 2<o:p></o:p></b></p><p class=MsoNormal><b> }<o:p></o:p></b></p><p class=MsoNormal><b>}<o:p></o:p></b></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal>Where the hostnames above map to the 169.x.x.x addresses for each node of the cluster. <o:p></o:p></p><p class=MsoNormal>I think that will ensure a. the node ID is a stable value (always 1 or 2 – not calculated by corosync) but also maps our ring addresses to the 169 addresses as well?<o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal>Finally, am I correct that the hostnames listed in the nodelist above should be set in the /etc/hosts file to point to the 169 addresses for each host, NOT a hostname that resolves to 127.0.0.1?<o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal>Any guidance on these issues, and in general how to avoid having the cluster calculate a node ID based on the 127.0.0.1 address which makes it lose its “usual” configuration would be appreciated. In most cases, the eth-ha0 interface is up by the time corosync starts, but in the cases that it is not (randomly occurs) what I described above happens.<o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal>Thank you.<o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><div style='mso-element:para-border-div;border:none;border-top:solid #CCCCCC 1.0pt;padding:4.0pt 0in 0in 0in'><p class=MsoNormal style='margin-bottom:3.0pt;border:none;padding:0in'><b><span style='color:#444444'>Greg Neitzert</span></b><span style='color:#444444'> | Lead Software Engineer | </span><span style='color:#444444'>RTC Software Engineering 2B - Middleware</span><span style='font-size:9.0pt;color:#444444'> <o:p></o:p></span></p><p class=MsoNormal style='margin-bottom:3.0pt;border:none;padding:0in'><span style='font-size:9.0pt;color:#444444'>Unisys | Ph: 612-486-9662 | Cell: 605-929-9118 | </span><span style='font-size:9.0pt'><a href="mailto:Greg.Neitzert@unisys.com"><span style='color:blue'>Greg.Neitzert@unisys.com</span></a><span style='color:#444444'> <o:p></o:p></span></span></p></div><p class=MsoNormal style='margin-bottom:3.0pt'><span style='font-size:9.0pt;color:#444444'>Home Based – Sioux Falls, SD USA<o:p></o:p></span></p><p class=MsoNormal style='margin-bottom:3.0pt'><span style='font-size:8.0pt;color:#2072BC'><o:p> </o:p></span></p><p class=MsoNormal style='margin-bottom:3.0pt'><a href="http://www.unisys.com/"><span style='color:windowtext;text-decoration:none'><img border=0 width=193 height=30 style='width:2.0104in;height:.3125in' id="Picture_x0020_49" src="cid:image001.png@01D7B3CE.F1E090A0" alt="unisys_logo"></span></a><img border=0 width=75 height=75 style='width:.7812in;height:.7812in' id="Picture_x0020_2" src="cid:image002.png@01D7B3CE.F1E090A0" alt=azure-fundamentals-150x150><o:p></o:p></p><p class=MsoNormal style='margin-bottom:3.0pt'><span style='font-size:8.0pt;color:#999999'><o:p> </o:p></span></p><p class=MsoNormal style='margin-bottom:3.0pt'><span style='font-size:7.5pt;color:#999999'>THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all devices.<o:p></o:p></span></p><p class=MsoNormal style='margin-bottom:3.0pt'><a href="http://www.linkedin.com/company/unisys"><span style='font-size:7.5pt;color:#999999;text-decoration:none'><img border=0 width=16 height=16 style='width:.1666in;height:.1666in' id="Picture_x0020_13" src="cid:image003.jpg@01D7B3CE.F1E090A0" alt="Grey_LI"></span></a><span style='font-size:7.5pt;color:#999999'> </span><a href="http://twitter.com/unisyscorp"><span style='font-size:7.5pt;color:#999999;text-decoration:none'><img border=0 width=16 height=16 style='width:.1666in;height:.1666in' id="Picture_x0020_12" src="cid:image004.jpg@01D7B3CE.F1E090A0" alt="Grey_TW"></span></a><span style='font-size:7.5pt;color:#999999'> </span><a href="http://www.youtube.com/theunisyschannel"><span style='font-size:7.5pt;color:#999999;text-decoration:none'><img border=0 width=16 height=16 style='width:.1666in;height:.1666in' id="Picture_x0020_10" src="cid:image005.jpg@01D7B3CE.F1E090A0" alt="Grey_YT"></span></a><a href="http://www.facebook.com/unisyscorp"><span style='font-size:7.5pt;color:#999999;text-decoration:none'><img border=0 width=16 height=16 style='width:.1666in;height:.1666in' id="Picture_x0020_6" src="cid:image006.jpg@01D7B3CE.F1E090A0" alt="Grey_FB"></span></a><a href="https://vimeo.com/unisys"><span style='font-size:7.5pt;color:#999999;text-decoration:none'><img border=0 width=16 height=16 style='width:.1666in;height:.1666in' id="Picture_x0020_9" src="cid:image007.jpg@01D7B3CE.F1E090A0" alt="Grey_Vimeo"></span></a><a href="http://blogs.unisys.com/"><span style='font-size:7.5pt;color:#999999;text-decoration:none'><img border=0 width=16 height=16 style='width:.1666in;height:.1666in' id="Picture_x0020_8" src="cid:image008.jpg@01D7B3CE.F1E090A0" alt="Grey_UB"></span></a><span style='font-size:7.5pt;color:#999999'><o:p></o:p></span></p><p class=MsoNormal><o:p> </o:p></p></div></body></html>