[ClusterLabs] Antw: Corosync ring marked as FAULTY

Wed Feb 22 03:02:58 EST 2017

Hi, Denis

could you try tcpdump "udp port 5505" on the private network to see if 
there is packet?

On 02/22/2017 03:47 PM, Denis Gribkov wrote:
>
> In our case it does not create problems since all nodes are located in 
> few networks whichserved by single router.
>
> There are also no any errors detected on public ring 1 unlike private 
> ring 0.
>
> I have a suspicion that this error could be related to private VLAN 
> settings but unfortunately have no good idea how to found the issue.
>
> On 22/02/17 09:37, Ulrich Windl wrote:
>> Is "ttl 1" a good idea for a public network?
>>
>>>>> Denis Gribkov<dun at itsts.net>  schrieb am 21.02.2017 um 18:26 in Nachricht
>> <4f5543c4-b80c-659d-ed5e-7a99e1482ced at itsts.net>:
>>> Hi Everyone.
>>>
>>> I have 16-nodes asynchronous cluster configured with Corosync redundant
>>> ring feature.
>>>
>>> Each node has 2 similarly connected/configured NIC's. One NIC connected
>>> to the public network,
>>>
>>> another one to our private VLAN. When I checked Corosync rings
>>> operability I found:
>>>
>>> # corosync-cfgtool -s
>>> Printing ring status.
>>> Local node ID 1
>>> RING ID 0
>>>           id      = 192.168.1.54
>>>           status  = Marking ringid 0 interface 192.168.1.54 FAULTY
>>> RING ID 1
>>>           id      = 111.11.11.1
>>>           status  = ring 1 active with no faults
>>>
>>> After some time of digging into I identified that if I enable back the
>>> failed ring with command:
>>>
>>> # corosync-cfgtool -r
>>>
>>> RING ID 0 will be marked as "active" for few minutes, but after it
>>> marked permanently as faulty.
>>>
>>> Log has no any useful info, just single message:
>>>
>>> corosync[21740]:   [TOTEM ] Marking ringid 0 interface 192.168.1.54 FAULTY
>>>
>>> And no any message like:
>>>
>>> [TOTEM ] Automatically recovered ring 1
>>>
>>>
>>> My corosync.conf looks like:
>>>
>>> compatibility: whitetank
>>>
>>> totem {
>>>           version: 2
>>>           secauth: on
>>>           threads: 4
>>>           rrp_mode: passive
>>>
>>>           interface {
>>>
>>>                   member {
>>>                           memberaddr: PRIVATE_IP_1
>>>                   }
>>>
>>> ...
>>>
>>>                   member {
>>>                           memberaddr: PRIVATE_IP_16
>>>                   }
>>>
>>>                   ringnumber: 0
>>>                   bindnetaddr: PRIVATE_NET_ADDR
>>>                   mcastaddr: 226.0.0.1
>>>                   mcastport: 5505
>>>                   ttl: 1
>>>           }
>>>
>>>          interface {
>>>
>>>                   member {
>>>                           memberaddr: PUBLIC_IP_1
>>>                   }
>>> ...
>>>
>>>                   member {
>>>                           memberaddr: PUBLIC_IP_16
>>>                   }
>>>
>>>                   ringnumber: 1
>>>                   bindnetaddr: PUBLIC_NET_ADDR
>>>                   mcastaddr: 224.0.0.1
>>>                   mcastport: 5405
>>>                   ttl: 1
>>>           }
>>>
>>>           transport: udpu
>>>
>>> logging {
>>>           to_stderr: no
>>>           to_logfile: yes
>>>           logfile: /var/log/cluster/corosync.log
>>>           logfile_priority: info
>>>           to_syslog: yes
>>>           syslog_priority: warning
>>>           debug: on
>>>           timestamp: on
>>> }
>>>
>>> I had tried to change rrp_mode, mcastaddr/mcastport for ringnumber: 0,
>>> but result was the similar.
>>>
>>> I checked multicast/unicast operability using omping utility and didn't
>>> found any issues.
>>>
>>> Also no errors on our private VLAN was found for network equipment.
>>>
>>> Why Corosync decided to disable permanently second ring? How I can debug
>>> the issue?
>>>
>>> Other properties:
>>>
>>> Corosync Cluster Engine, version '1.4.7'
>>>
>>> Pacemaker properties:
>>>    cluster-infrastructure: cman
>>>    cluster-recheck-interval: 5min
>>>    dc-version: 1.1.14-8.el6-70404b0
>>>    expected-quorum-votes: 3
>>>    have-watchdog: false
>>>    last-lrm-refresh: 1484068350
>>>    maintenance-mode: false
>>>    no-quorum-policy: ignore
>>>    pe-error-series-max: 1000
>>>    pe-input-series-max: 1000
>>>    pe-warn-series-max: 1000
>>>    stonith-action: reboot
>>>    stonith-enabled: false
>>>    symmetric-cluster: false
>>>
>>> Thank you.
>>>
>>> -- 
>>> Regards Denis Gribkov
>>
>>
>>
>> _______________________________________________
>> Users mailing list:Users at clusterlabs.org
>> http://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home:http://www.clusterlabs.org
>> Getting started:http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs:http://bugs.clusterlabs.org
>
> -- 
> Regards Denis Gribkov
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20170222/9cc49bbb/attachment-0003.html>