[ClusterLabs] corosync race condition when node leaves immediately after joining
Jan Friesse
jfriesse at redhat.com
Thu Oct 12 06:54:55 EDT 2017
Jonathan,
>
>
> On 12/10/17 07:48, Jan Friesse wrote:
>> Jonathan,
>> I believe main "problem" is votequorum ability to work during sync
>> phase (votequorum is only one service with this ability, see
>> votequorum_overview.8 section VIRTUAL SYNCHRONY)...
>>
>>> Hi ClusterLabs,
>>>
>>> I'm seeing a race condition in corosync where votequorum can have
>>> incorrect membership info when a node joins the cluster then leaves very
>>> soon after.
>>>
>>> I'm on corosync-2.3.4 plus my patch
Finally noticed ^^^ 2.3.4 is really old and as long as it is not some
patched version, I wouldn't recommend to use it. Can you give a try to
current needle?
>>> https://github.com/corosync/corosync/pull/248. That patch makes the
>>> problem readily reproducible but the bug was already present.
>>>
>>> Here's the scenario. I have two hosts, cluster1 and cluster2. The
>>> corosync.conf on cluster2 is:
>>>
>>> totem {
>>> version: 2
>>> cluster_name: test
>>> config_version: 2
>>> transport: udpu
>>> }
>>> nodelist {
>>> node {
>>> nodeid: 1
>>> ring0_addr: cluster1
>>> }
>>> node {
>>> nodeid: 2
>>> ring0_addr: cluster2
>>> }
>>> }
>>> quorum {
>>> provider: corosync_votequorum
>>> auto_tie_breaker: 1
>>> }
>>> logging {
>>> to_syslog: yes
>>> }
>>>
>>> The corosync.conf on cluster1 is the same except with
>>> "config_version: 1".
>>>
>>> I start corosync on cluster2. When I start corosync on cluster1, it
>>> joins and then immediately leaves due to the lower config_version.
>>> (Previously corosync on cluster2 would also exit but with
>>> https://github.com/corosync/corosync/pull/248 it remains alive.)
>>>
>>> But often at this point, cluster1's disappearance is not reflected in
>>> the votequorum info on cluster2:
>>
>> ... Is this permanent (= until new node join/leave it , or it will fix
>> itself over (short) time? If this is permanent, it's a bug. If it
>> fixes itself it's result of votequorum not being virtual synchronous.
>
> Yes, it's permanent. After several minutes of waiting, votequorum still
> reports "total votes: 2" even though there's only one member.
That's bad. I've tried following setup:
- Both nodes with current needle
- Your config
- Second node is just running corosync
- First node is running following command:
while true;do corosync -f; ssh node2 'corosync-quorumtool | grep
Total | grep 1' || exit 1;done
Running it for quite a while and I'm unable to reproduce the bug. Sadly
I'm unable to reproduce the bug even with 2.3.4. Do you think that
reproducer is correct?
Honza
>
> Thanks,
> Jonathan
>
>>>
>>> Quorum information
>>> ------------------
>>> Date: Tue Oct 10 16:43:50 2017
>>> Quorum provider: corosync_votequorum
>>> Nodes: 1
>>> Node ID: 2
>>> Ring ID: 700
>>> Quorate: Yes
>>>
>>> Votequorum information
>>> ----------------------
>>> Expected votes: 2
>>> Highest expected: 2
>>> Total votes: 2
>>> Quorum: 2
>>> Flags: Quorate AutoTieBreaker
>>>
>>> Membership information
>>> ----------------------
>>> Nodeid Votes Name
>>> 2 1 cluster2 (local)
>>>
>>> The logs on cluster1 show:
>>>
>>> Oct 10 16:43:37 cluster1 corosync[15750]: [CMAP ] Received config
>>> version (2) is different than my config version (1)! Exiting
>>>
>>> The logs on cluster2 show:
>>>
>>> Oct 10 16:43:37 cluster2 corosync[5102]: [TOTEM ] A new membership
>>> (10.71.218.17:588) was formed. Members joined: 1
>>> Oct 10 16:43:37 cluster2 corosync[5102]: [QUORUM] This node is
>>> within the primary component and will provide service.
>>> Oct 10 16:43:37 cluster2 corosync[5102]: [QUORUM] Members[1]: 2
>>> Oct 10 16:43:37 cluster2 corosync[5102]: [TOTEM ] A new membership
>>> (10.71.218.18:592) was formed. Members left: 1
>>> Oct 10 16:43:37 cluster2 corosync[5102]: [QUORUM] Members[1]: 2
>>> Oct 10 16:43:37 cluster2 corosync[5102]: [MAIN ] Completed
>>> service synchronization, ready to provide service.
>>>
>>> It looks like QUORUM has seen cluster1's arrival but not its departure!
>>>
>>> When it works as expected, the state is left consistent:
>>>
>>> Quorum information
>>> ------------------
>>> Date: Tue Oct 10 16:58:14 2017
>>> Quorum provider: corosync_votequorum
>>> Nodes: 1
>>> Node ID: 2
>>> Ring ID: 604
>>> Quorate: No
>>>
>>> Votequorum information
>>> ----------------------
>>> Expected votes: 2
>>> Highest expected: 2
>>> Total votes: 1
>>> Quorum: 2 Activity blocked
>>> Flags: AutoTieBreaker
>>>
>>> Membership information
>>> ----------------------
>>> Nodeid Votes Name
>>> 2 1 cluster2 (local)
>>>
>>> Logs on cluster1:
>>>
>>> Oct 10 16:58:01 cluster1 corosync[16430]: [CMAP ] Received config
>>> version (2) is different than my config version (1)! Exiting
>>>
>>> Logs on cluster2 are either:
>>>
>>> Oct 10 16:58:01 cluster2 corosync[17835]: [TOTEM ] A new
>>> membership (10.71.218.17:600) was formed. Members joined: 1
>>> Oct 10 16:58:01 cluster2 corosync[17835]: [QUORUM] This node is
>>> within the primary component and will provide service.
>>> Oct 10 16:58:01 cluster2 corosync[17835]: [QUORUM] Members[1]: 2
>>> Oct 10 16:58:01 cluster2 corosync[17835]: [CMAP ] Highest config
>>> version (2) and my config version (2)
>>> Oct 10 16:58:01 cluster2 corosync[17835]: [TOTEM ] A new
>>> membership (10.71.218.18:604) was formed. Members left: 1
>>> Oct 10 16:58:01 cluster2 corosync[17835]: [QUORUM] This node is
>>> within the non-primary component and will NOT provide any services.
>>> Oct 10 16:58:01 cluster2 corosync[17835]: [QUORUM] Members[1]: 2
>>> Oct 10 16:58:01 cluster2 corosync[17835]: [MAIN ] Completed
>>> service synchronization, ready to provide service.
>>>
>>> ... in which it looks like QUORUM has seen cluster1's arrival *and* its
>>> departure,
>>>
>>> or:
>>>
>>> Oct 10 16:59:03 cluster2 corosync[18841]: [TOTEM ] A new
>>> membership (10.71.218.17:632) was formed. Members joined: 1
>>> Oct 10 16:59:03 cluster2 corosync[18841]: [CMAP ] Highest config
>>> version (2) and my config version (2)
>>> Oct 10 16:59:03 cluster2 corosync[18841]: [TOTEM ] A new
>>> membership (10.71.218.18:636) was formed. Members left: 1
>>> Oct 10 16:59:03 cluster2 corosync[18841]: [QUORUM] Members[1]: 2
>>> Oct 10 16:59:03 cluster2 corosync[18841]: [MAIN ] Completed
>>> service synchronization, ready to provide service.
>>>
>>> ... in which it looks like QUORUM never noticed cluster1's brief
>>> presence.
>>>
>>> Any thoughts?
>>>
>>> Thanks,
>>> Jonathan
>>>
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org
>>> http://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
More information about the Users
mailing list