[ClusterLabs] wireshark cannot recognize corosync packets

Sun Mar 19 23:57:50 EDT 2017

The config file I sent you may be wrong.  Since all nodes are virtual machines,
they may have been re-deployed before I got the config file. But I'm sure I
checked the config files were consistent and downloaded the logs before re-deploy.

>> For now I guess reason can be one ofe:
>> - ifdown on one of other nodes which made whole membership broken
I checked my colleague's operations, and found this may be right.

Unfortunately, our production was cancelled last week, and all environments 
were destroyed. I have no resources to help you to diagnose the problem any more.
And I have to stop the work on pacemaker and corosync.
I'm really sorry I have not helped to resolve the corosync segfault.

Thanks very much for all the help from you, the maintainers, and the community.

At 2017-03-17 22:30:01, "Jan Friesse" <jfriesse at redhat.com> wrote:
>> I have checked all the config files are the same, except bindnetaddr.
>> So I'm sending only logs.
>
>I'm not sure if config files matches log files. Because config file 
>contains nodes 200.201.162.(52|53|54), but log files contains ip 
>200.201.162.(52|53|55).
>
>Can you confirm node with ip 200.201.162.54 exists and it shouldn't be 
>200.201.162.55 (or 200.201.162.55 shouldn't have ip 200.201.162.54)?
>
>Honza
>
>>
>>
>>
>>
>>
>>
>> 在2017年03月16 15时54分, "Jan Friesse"<jfriesse at redhat.com>写道:
>>
>>> corosync.conf and debug logs are in attachment.
>>
>> Thanks for them. They look really interesting. As can be seen
>>
>> Mar 14 11:37:28 [57827] node-132.acloud.vt corosync debug   [TOTEM ]
>> timer_function_orf_token_timeout The token was lost in the
>>   OPERATIONAL state.
>>
>> corosync correctly detected token lost. Also
>>
>> Mar 14 11:44:41 [57827] node-132.acloud.vt corosync debug   [TOTEM ]
>> memb_state_gather_enter entering GATHER state from 11(merg
>> e during join).
>>
>> says it correctly detected merge. But since then it's becoming weird.
>> Mar 14 11:44:54 [57827] node-132.acloud.vt corosync debug   [TOTEM ]
>> memb_state_gather_enter entering GATHER state from 0(conse
>> nsus timeout).
>> Mar 14 11:45:06 [57827] node-132.acloud.vt corosync debug   [TOTEM ]
>> memb_state_gather_enter entering GATHER state from 0(conse
>> nsus timeout).
>> ...
>> Mar 14 12:55:47 [154709] node-132.acloud.vt corosync debug   [TOTEM ]
>> memb_state_gather_enter entering GATHER state from 0(cons
>> ensus timeout)
>>
>> So even after two other nodes merged, there is still something what
>> prevents corosync to reach consensus.
>>
>> Would it be possible to attach also other nodes logs/configs?
>>
>> For now I guess reason can be one ofe:
>> - ifdown on one of other nodes which made whole membership broken
>> - different node list in config between nodes
>> - "forget" node with node list containing one of the 200.201.162.x nodes
>>
>> Regards,
>>    Honza
>>>
>>> And two messages from kernel:
>>>
>>> 2017-03-14 11:37:20.097233 - info  e1000: eth0 NIC Link is Down
>>>
>>> 2017-03-14 11:44:41.032121 - info  e1000: eth0 NIC Link is Up 1000 Mbps
>>> Full Duplex, Flow Control: RX
>>>
>>>
>>> Thanks.
>>>
>>>
>>> On 2017/3/15 16:29, Jan Friesse wrote:
>>>>> Yesterday I found corosync took almost one hour to form a cluster(a
>>>>> failed node came back online).
>>>>
>>>> This for sure shouldn't happen (at least with default timeout settings).
>>>>
>>>>>
>>>>> So I captured some corosync packets, and opened the pcap file in
>>>>> wireshark.
>>>>>
>>>>> But wireshark only displayed raw udp, no totem.
>>>>>
>>>>> Wireshark version is 2.2.5. I'm sure it supports corosync totem.
>>>>>
>>>>> corosync is 2.4.0.
>>>>
>>>> Wireshark has corosync dissector, but only for version 1.x. 2.x is not
>>>> supported yet.
>>>>
>>>>>
>>>>> And if corosync takes too long to form a cluster, how to diagnose it?
>>>>>
>>>>> I read the logs, but could not figure it out.
>>>>
>>>> Logs, specially when debug is enabled, has usually enough info. Can
>>>> paste your config + logs?
>>>>
>>>> Regards,
>>>>    Honza
>>>>
>>>>>
>>>>> Thanks.
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Users mailing list: Users at clusterlabs.org
>>>>> http://lists.clusterlabs.org/mailman/listinfo/users
>>>>>
>>>>> Project Home: http://www.clusterlabs.org
>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>> Bugs: http://bugs.clusterlabs.org
>>>>
>>>>
>>>> _______________________________________________
>>>> Users mailing list: Users at clusterlabs.org
>>>> http://lists.clusterlabs.org/mailman/listinfo/users
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>
>>>
>>>
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org
>>> http://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>>
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
>
>_______________________________________________
>Users mailing list: Users at clusterlabs.org
>http://lists.clusterlabs.org/mailman/listinfo/users
>
>Project Home: http://www.clusterlabs.org
>Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>Bugs: http://bugs.clusterlabs.org