[ClusterLabs] Antw: Corosync ring marked as FAULTY

Wed Feb 22 02:37:10 EST 2017

Is "ttl 1" a good idea for a public network?

>>> Denis Gribkov <dun at itsts.net> schrieb am 21.02.2017 um 18:26 in Nachricht
<4f5543c4-b80c-659d-ed5e-7a99e1482ced at itsts.net>:
> Hi Everyone.
> 
> I have 16-nodes asynchronous cluster configured with Corosync redundant 
> ring feature.
> 
> Each node has 2 similarly connected/configured NIC's. One NIC connected 
> to the public network,
> 
> another one to our private VLAN. When I checked Corosync rings 
> operability I found:
> 
> # corosync-cfgtool -s
> Printing ring status.
> Local node ID 1
> RING ID 0
>          id      = 192.168.1.54
>          status  = Marking ringid 0 interface 192.168.1.54 FAULTY
> RING ID 1
>          id      = 111.11.11.1
>          status  = ring 1 active with no faults
> 
> After some time of digging into I identified that if I enable back the 
> failed ring with command:
> 
> # corosync-cfgtool -r
> 
> RING ID 0 will be marked as "active" for few minutes, but after it 
> marked permanently as faulty.
> 
> Log has no any useful info, just single message:
> 
> corosync[21740]:   [TOTEM ] Marking ringid 0 interface 192.168.1.54 FAULTY
> 
> And no any message like:
> 
> [TOTEM ] Automatically recovered ring 1
> 
> 
> My corosync.conf looks like:
> 
> compatibility: whitetank
> 
> totem {
>          version: 2
>          secauth: on
>          threads: 4
>          rrp_mode: passive
> 
>          interface {
> 
>                  member {
>                          memberaddr: PRIVATE_IP_1
>                  }
> 
> ...
> 
>                  member {
>                          memberaddr: PRIVATE_IP_16
>                  }
> 
>                  ringnumber: 0
>                  bindnetaddr: PRIVATE_NET_ADDR
>                  mcastaddr: 226.0.0.1
>                  mcastport: 5505
>                  ttl: 1
>          }
> 
>         interface {
> 
>                  member {
>                          memberaddr: PUBLIC_IP_1
>                  }
> ...
> 
>                  member {
>                          memberaddr: PUBLIC_IP_16
>                  }
> 
>                  ringnumber: 1
>                  bindnetaddr: PUBLIC_NET_ADDR
>                  mcastaddr: 224.0.0.1
>                  mcastport: 5405
>                  ttl: 1
>          }
> 
>          transport: udpu
> 
> logging {
>          to_stderr: no
>          to_logfile: yes
>          logfile: /var/log/cluster/corosync.log
>          logfile_priority: info
>          to_syslog: yes
>          syslog_priority: warning
>          debug: on
>          timestamp: on
> }
> 
> I had tried to change rrp_mode, mcastaddr/mcastport for ringnumber: 0, 
> but result was the similar.
> 
> I checked multicast/unicast operability using omping utility and didn't 
> found any issues.
> 
> Also no errors on our private VLAN was found for network equipment.
> 
> Why Corosync decided to disable permanently second ring? How I can debug 
> the issue?
> 
> Other properties:
> 
> Corosync Cluster Engine, version '1.4.7'
> 
> Pacemaker properties:
>   cluster-infrastructure: cman
>   cluster-recheck-interval: 5min
>   dc-version: 1.1.14-8.el6-70404b0
>   expected-quorum-votes: 3
>   have-watchdog: false
>   last-lrm-refresh: 1484068350
>   maintenance-mode: false
>   no-quorum-policy: ignore
>   pe-error-series-max: 1000
>   pe-input-series-max: 1000
>   pe-warn-series-max: 1000
>   stonith-action: reboot
>   stonith-enabled: false
>   symmetric-cluster: false
> 
> Thank you.
> 
> -- 
> Regards Denis Gribkov