[ClusterLabs] corosync race condition when node leaves immediately after joining

Jan Friesse jfriesse at redhat.com
Thu Oct 12 12:54:55 CEST 2017


Jonathan,

>
>
> On 12/10/17 07:48, Jan Friesse wrote:
>> Jonathan,
>> I believe main "problem" is votequorum ability to work during sync
>> phase (votequorum is only one service with this ability, see
>> votequorum_overview.8 section VIRTUAL SYNCHRONY)...
>>
>>> Hi ClusterLabs,
>>>
>>> I'm seeing a race condition in corosync where votequorum can have
>>> incorrect membership info when a node joins the cluster then leaves very
>>> soon after.
>>>
>>> I'm on corosync-2.3.4 plus my patch

Finally noticed ^^^ 2.3.4 is really old and as long as it is not some 
patched version, I wouldn't recommend to use it. Can you give a try to 
current needle?

>>> https://github.com/corosync/corosync/pull/248. That patch makes the
>>> problem readily reproducible but the bug was already present.
>>>
>>> Here's the scenario. I have two hosts, cluster1 and cluster2. The
>>> corosync.conf on cluster2 is:
>>>
>>>      totem {
>>>        version: 2
>>>        cluster_name: test
>>>        config_version: 2
>>>        transport: udpu
>>>      }
>>>      nodelist {
>>>        node {
>>>          nodeid: 1
>>>          ring0_addr: cluster1
>>>        }
>>>        node {
>>>          nodeid: 2
>>>          ring0_addr: cluster2
>>>        }
>>>      }
>>>      quorum {
>>>        provider: corosync_votequorum
>>>        auto_tie_breaker: 1
>>>      }
>>>      logging {
>>>        to_syslog: yes
>>>      }
>>>
>>> The corosync.conf on cluster1 is the same except with
>>> "config_version: 1".
>>>
>>> I start corosync on cluster2. When I start corosync on cluster1, it
>>> joins and then immediately leaves due to the lower config_version.
>>> (Previously corosync on cluster2 would also exit but with
>>> https://github.com/corosync/corosync/pull/248 it remains alive.)
>>>
>>> But often at this point, cluster1's disappearance is not reflected in
>>> the votequorum info on cluster2:
>>
>> ... Is this permanent (= until new node join/leave it , or it will fix
>> itself over (short) time? If this is permanent, it's a bug. If it
>> fixes itself it's result of votequorum not being virtual synchronous.
>
> Yes, it's permanent. After several minutes of waiting, votequorum still
> reports "total votes: 2" even though there's only one member.


That's bad. I've tried following setup:

- Both nodes with current needle
- Your config
- Second node is just running corosync
- First node is running following command:
   while true;do corosync -f; ssh node2 'corosync-quorumtool | grep 
Total | grep 1' || exit 1;done

Running it for quite a while and I'm unable to reproduce the bug. Sadly 
I'm unable to reproduce the bug even with 2.3.4. Do you think that 
reproducer is correct?

Honza


>
> Thanks,
> Jonathan
>
>>>
>>>      Quorum information
>>>      ------------------
>>>      Date:             Tue Oct 10 16:43:50 2017
>>>      Quorum provider:  corosync_votequorum
>>>      Nodes:            1
>>>      Node ID:          2
>>>      Ring ID:          700
>>>      Quorate:          Yes
>>>
>>>      Votequorum information
>>>      ----------------------
>>>      Expected votes:   2
>>>      Highest expected: 2
>>>      Total votes:      2
>>>      Quorum:           2
>>>      Flags:            Quorate AutoTieBreaker
>>>
>>>      Membership information
>>>      ----------------------
>>>          Nodeid      Votes Name
>>>               2          1 cluster2 (local)
>>>
>>> The logs on cluster1 show:
>>>
>>>      Oct 10 16:43:37 cluster1 corosync[15750]:  [CMAP  ] Received config
>>> version (2) is different than my config version (1)! Exiting
>>>
>>> The logs on cluster2 show:
>>>
>>>      Oct 10 16:43:37 cluster2 corosync[5102]:  [TOTEM ] A new membership
>>> (10.71.218.17:588) was formed. Members joined: 1
>>>      Oct 10 16:43:37 cluster2 corosync[5102]:  [QUORUM] This node is
>>> within the primary component and will provide service.
>>>      Oct 10 16:43:37 cluster2 corosync[5102]:  [QUORUM] Members[1]: 2
>>>      Oct 10 16:43:37 cluster2 corosync[5102]:  [TOTEM ] A new membership
>>> (10.71.218.18:592) was formed. Members left: 1
>>>      Oct 10 16:43:37 cluster2 corosync[5102]:  [QUORUM] Members[1]: 2
>>>      Oct 10 16:43:37 cluster2 corosync[5102]:  [MAIN  ] Completed
>>> service synchronization, ready to provide service.
>>>
>>> It looks like QUORUM has seen cluster1's arrival but not its departure!
>>>
>>> When it works as expected, the state is left consistent:
>>>
>>>      Quorum information
>>>      ------------------
>>>      Date:             Tue Oct 10 16:58:14 2017
>>>      Quorum provider:  corosync_votequorum
>>>      Nodes:            1
>>>      Node ID:          2
>>>      Ring ID:          604
>>>      Quorate:          No
>>>
>>>      Votequorum information
>>>      ----------------------
>>>      Expected votes:   2
>>>      Highest expected: 2
>>>      Total votes:      1
>>>      Quorum:           2 Activity blocked
>>>      Flags:            AutoTieBreaker
>>>
>>>      Membership information
>>>      ----------------------
>>>          Nodeid      Votes Name
>>>               2          1 cluster2 (local)
>>>
>>> Logs on cluster1:
>>>
>>>      Oct 10 16:58:01 cluster1 corosync[16430]:  [CMAP  ] Received config
>>> version (2) is different than my config version (1)! Exiting
>>>
>>> Logs on cluster2 are either:
>>>
>>>      Oct 10 16:58:01 cluster2 corosync[17835]:  [TOTEM ] A new
>>> membership (10.71.218.17:600) was formed. Members joined: 1
>>>      Oct 10 16:58:01 cluster2 corosync[17835]:  [QUORUM] This node is
>>> within the primary component and will provide service.
>>>      Oct 10 16:58:01 cluster2 corosync[17835]:  [QUORUM] Members[1]: 2
>>>      Oct 10 16:58:01 cluster2 corosync[17835]:  [CMAP  ] Highest config
>>> version (2) and my config version (2)
>>>      Oct 10 16:58:01 cluster2 corosync[17835]:  [TOTEM ] A new
>>> membership (10.71.218.18:604) was formed. Members left: 1
>>>      Oct 10 16:58:01 cluster2 corosync[17835]:  [QUORUM] This node is
>>> within the non-primary component and will NOT provide any services.
>>>      Oct 10 16:58:01 cluster2 corosync[17835]:  [QUORUM] Members[1]: 2
>>>      Oct 10 16:58:01 cluster2 corosync[17835]:  [MAIN  ] Completed
>>> service synchronization, ready to provide service.
>>>
>>> ... in which it looks like QUORUM has seen cluster1's arrival *and* its
>>> departure,
>>>
>>> or:
>>>
>>>      Oct 10 16:59:03 cluster2 corosync[18841]:  [TOTEM ] A new
>>> membership (10.71.218.17:632) was formed. Members joined: 1
>>>      Oct 10 16:59:03 cluster2 corosync[18841]:  [CMAP  ] Highest config
>>> version (2) and my config version (2)
>>>      Oct 10 16:59:03 cluster2 corosync[18841]:  [TOTEM ] A new
>>> membership (10.71.218.18:636) was formed. Members left: 1
>>>      Oct 10 16:59:03 cluster2 corosync[18841]:  [QUORUM] Members[1]: 2
>>>      Oct 10 16:59:03 cluster2 corosync[18841]:  [MAIN  ] Completed
>>> service synchronization, ready to provide service.
>>>
>>> ... in which it looks like QUORUM never noticed cluster1's brief
>>> presence.
>>>
>>> Any thoughts?
>>>
>>> Thanks,
>>> Jonathan
>>>
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org
>>> http://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org




More information about the Users mailing list