<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<p><tt>Hi Everyone.</tt></p>
<p><tt>I have 16-nodes asynchronous cluster configured with Corosync
redundant ring feature.</tt></p>
<p><tt>Each node has 2 similarly connected/configured NIC's. One NIC
connected to the public network,</tt></p>
<p><tt>another one to our private VLAN. When I checked Corosync
rings operability I found:</tt><tt><br>
</tt></p>
<p><tt># corosync-cfgtool -s</tt><tt><br>
</tt><tt>Printing ring status.</tt><tt><br>
</tt><tt>Local node ID 1</tt><tt><br>
</tt><tt>RING ID 0</tt><tt><br>
</tt><tt> id = 192.168.1.54</tt><tt><br>
</tt><tt> status = Marking ringid 0 interface 192.168.1.54
FAULTY</tt><tt><br>
</tt><tt>RING ID 1</tt><tt><br>
</tt><tt> id = 111.11.11.1</tt><tt><br>
</tt><tt> status = ring 1 active with no faults</tt></p>
<p><tt>After some time of digging into I </tt><tt><span
id="result_box" class="short_text" lang="en"><span class="">identified</span></span>
that if I enable back the failed ring with command:</tt></p>
<p><tt> # corosync-cfgtool -r</tt></p>
<p><tt>RING ID 0 will be marked as "active" for few minutes, but
after it marked permanently as faulty.</tt></p>
<p><tt>Log has no any useful info, just single message:</tt></p>
<p><tt>corosync[21740]: [TOTEM ] Marking ringid 0 interface
192.168.1.54 FAULTY</tt></p>
<p><tt>And no any message like:</tt></p>
<p><tt>[TOTEM ] Automatically recovered ring 1</tt></p>
<p><tt><br>
</tt></p>
<p><tt>My corosync.conf looks like:</tt><tt><br>
</tt>
<tt><br>
</tt><tt>compatibility: whitetank
</tt><tt><br>
</tt>
<tt><br>
</tt><tt>totem {
</tt><tt><br>
</tt><tt> version: 2
</tt><tt><br>
</tt><tt> secauth: on
</tt><tt><br>
</tt><tt> threads: 4
</tt><tt><br>
</tt><tt> rrp_mode: passive</tt><tt><br>
</tt>
<tt><br>
</tt><tt> interface {
</tt><tt><br>
</tt>
<tt><br>
</tt><tt> member {
</tt><tt><br>
</tt><tt> memberaddr: PRIVATE_IP_1
</tt><tt><br>
</tt><tt> }
</tt><tt><br>
</tt>
<tt><br>
</tt><tt>...
</tt><tt><br>
</tt>
<tt><br>
</tt><tt> member {
</tt><tt><br>
</tt><tt> memberaddr: PRIVATE_IP_16</tt><tt><br>
</tt><tt> }
</tt><tt><br>
</tt>
<tt><br>
</tt><tt> ringnumber: 0
</tt><tt><br>
</tt><tt> bindnetaddr: PRIVATE_NET_ADDR
</tt><tt><br>
</tt><tt> mcastaddr: 226.0.0.1
</tt><tt><br>
</tt><tt> mcastport: 5505</tt><tt><br>
</tt><tt> ttl: 1
</tt><tt><br>
</tt><tt> }
</tt><tt><br>
</tt>
<tt><br>
</tt><tt> interface {
</tt><tt><br>
</tt>
<tt><br>
</tt><tt> member {
</tt><tt><br>
</tt><tt> memberaddr: PUBLIC_IP_1
</tt><tt><br>
</tt><tt> }
</tt><tt><br>
</tt><tt>...
</tt><tt><br>
</tt>
<tt><br>
</tt><tt> member {
</tt><tt><br>
</tt><tt> memberaddr: PUBLIC_IP_16</tt><tt><br>
</tt><tt> }
</tt><tt><br>
</tt>
<tt><br>
</tt><tt> ringnumber: 1
</tt><tt><br>
</tt><tt> bindnetaddr: PUBLIC_NET_ADDR
</tt><tt><br>
</tt><tt> mcastaddr: 224.0.0.1
</tt><tt><br>
</tt><tt> mcastport: 5405
</tt><tt><br>
</tt><tt> ttl: 1
</tt><tt><br>
</tt><tt> }
</tt><tt><br>
</tt>
<tt><br>
</tt><tt> transport: udpu </tt><tt><br>
</tt></p>
<p><tt>logging {</tt><tt><br>
</tt><tt> to_stderr: no</tt><tt><br>
</tt><tt> to_logfile: yes</tt><tt><br>
</tt><tt> logfile: /var/log/cluster/corosync.log</tt><tt><br>
</tt><tt> logfile_priority: info</tt><tt><br>
</tt><tt> to_syslog: yes</tt><tt><br>
</tt><tt> syslog_priority: warning</tt><tt><br>
</tt><tt> debug: on</tt><tt><br>
</tt><tt> timestamp: on</tt><tt><br>
</tt><tt>}</tt></p>
<p><tt>I had tried to change rrp_mode, mcastaddr/mcastport for
ringnumber: 0, but result was the similar.</tt></p>
<p><tt>I checked multicast/unicast operability using omping utility
and didn't found any issues.</tt><tt><br>
</tt></p>
<p><tt>Also no errors on our private VLAN was found for network
equipment.</tt><tt><br>
</tt></p>
<p><tt>Why Corosync decided to disable permanently second ring? How
I can debug the issue?</tt><tt><br>
</tt></p>
<p><tt>Other properties:</tt><tt><br>
</tt></p>
<p><tt>Corosync Cluster Engine, version '1.4.7'</tt><tt><br>
</tt></p>
<tt>Pacemaker properties:
</tt><tt><br>
</tt><tt> cluster-infrastructure: cman
</tt><tt><br>
</tt><tt> cluster-recheck-interval: 5min
</tt><tt><br>
</tt><tt> dc-version: 1.1.14-8.el6-70404b0
</tt><tt><br>
</tt><tt> expected-quorum-votes: 3
</tt><tt><br>
</tt><tt> have-watchdog: false
</tt><tt><br>
</tt><tt> last-lrm-refresh: 1484068350
</tt><tt><br>
</tt><tt> maintenance-mode: false
</tt><tt><br>
</tt><tt> no-quorum-policy: ignore
</tt><tt><br>
</tt><tt> pe-error-series-max: 1000
</tt><tt><br>
</tt><tt> pe-input-series-max: 1000
</tt><tt><br>
</tt><tt> pe-warn-series-max: 1000
</tt><tt><br>
</tt><tt> stonith-action: reboot
</tt><tt><br>
</tt><tt> stonith-enabled: false
</tt><tt><br>
</tt><tt> symmetric-cluster: false
</tt><tt><br>
</tt><tt>
</tt><tt><br>
</tt><tt>Thank you.</tt><br>
<pre class="moz-signature" cols="72">--
Regards Denis Gribkov</pre>
</body>
</html>