[ClusterLabs] corosync race condition when node leaves immediately after joining

Jonathan Davies jonathan.davies at citrix.com
Thu Oct 12 05:45:18 EDT 2017



On 12/10/17 07:48, Jan Friesse wrote:
> Jonathan,
> I believe main "problem" is votequorum ability to work during sync phase 
> (votequorum is only one service with this ability, see 
> votequorum_overview.8 section VIRTUAL SYNCHRONY)...
> 
>> Hi ClusterLabs,
>>
>> I'm seeing a race condition in corosync where votequorum can have
>> incorrect membership info when a node joins the cluster then leaves very
>> soon after.
>>
>> I'm on corosync-2.3.4 plus my patch
>> https://github.com/corosync/corosync/pull/248. That patch makes the
>> problem readily reproducible but the bug was already present.
>>
>> Here's the scenario. I have two hosts, cluster1 and cluster2. The
>> corosync.conf on cluster2 is:
>>
>>      totem {
>>        version: 2
>>        cluster_name: test
>>        config_version: 2
>>        transport: udpu
>>      }
>>      nodelist {
>>        node {
>>          nodeid: 1
>>          ring0_addr: cluster1
>>        }
>>        node {
>>          nodeid: 2
>>          ring0_addr: cluster2
>>        }
>>      }
>>      quorum {
>>        provider: corosync_votequorum
>>        auto_tie_breaker: 1
>>      }
>>      logging {
>>        to_syslog: yes
>>      }
>>
>> The corosync.conf on cluster1 is the same except with "config_version: 
>> 1".
>>
>> I start corosync on cluster2. When I start corosync on cluster1, it
>> joins and then immediately leaves due to the lower config_version.
>> (Previously corosync on cluster2 would also exit but with
>> https://github.com/corosync/corosync/pull/248 it remains alive.)
>>
>> But often at this point, cluster1's disappearance is not reflected in
>> the votequorum info on cluster2:
> 
> ... Is this permanent (= until new node join/leave it , or it will fix 
> itself over (short) time? If this is permanent, it's a bug. If it fixes 
> itself it's result of votequorum not being virtual synchronous.

Yes, it's permanent. After several minutes of waiting, votequorum still 
reports "total votes: 2" even though there's only one member.

Thanks,
Jonathan

>>
>>      Quorum information
>>      ------------------
>>      Date:             Tue Oct 10 16:43:50 2017
>>      Quorum provider:  corosync_votequorum
>>      Nodes:            1
>>      Node ID:          2
>>      Ring ID:          700
>>      Quorate:          Yes
>>
>>      Votequorum information
>>      ----------------------
>>      Expected votes:   2
>>      Highest expected: 2
>>      Total votes:      2
>>      Quorum:           2
>>      Flags:            Quorate AutoTieBreaker
>>
>>      Membership information
>>      ----------------------
>>          Nodeid      Votes Name
>>               2          1 cluster2 (local)
>>
>> The logs on cluster1 show:
>>
>>      Oct 10 16:43:37 cluster1 corosync[15750]:  [CMAP  ] Received config
>> version (2) is different than my config version (1)! Exiting
>>
>> The logs on cluster2 show:
>>
>>      Oct 10 16:43:37 cluster2 corosync[5102]:  [TOTEM ] A new membership
>> (10.71.218.17:588) was formed. Members joined: 1
>>      Oct 10 16:43:37 cluster2 corosync[5102]:  [QUORUM] This node is
>> within the primary component and will provide service.
>>      Oct 10 16:43:37 cluster2 corosync[5102]:  [QUORUM] Members[1]: 2
>>      Oct 10 16:43:37 cluster2 corosync[5102]:  [TOTEM ] A new membership
>> (10.71.218.18:592) was formed. Members left: 1
>>      Oct 10 16:43:37 cluster2 corosync[5102]:  [QUORUM] Members[1]: 2
>>      Oct 10 16:43:37 cluster2 corosync[5102]:  [MAIN  ] Completed
>> service synchronization, ready to provide service.
>>
>> It looks like QUORUM has seen cluster1's arrival but not its departure!
>>
>> When it works as expected, the state is left consistent:
>>
>>      Quorum information
>>      ------------------
>>      Date:             Tue Oct 10 16:58:14 2017
>>      Quorum provider:  corosync_votequorum
>>      Nodes:            1
>>      Node ID:          2
>>      Ring ID:          604
>>      Quorate:          No
>>
>>      Votequorum information
>>      ----------------------
>>      Expected votes:   2
>>      Highest expected: 2
>>      Total votes:      1
>>      Quorum:           2 Activity blocked
>>      Flags:            AutoTieBreaker
>>
>>      Membership information
>>      ----------------------
>>          Nodeid      Votes Name
>>               2          1 cluster2 (local)
>>
>> Logs on cluster1:
>>
>>      Oct 10 16:58:01 cluster1 corosync[16430]:  [CMAP  ] Received config
>> version (2) is different than my config version (1)! Exiting
>>
>> Logs on cluster2 are either:
>>
>>      Oct 10 16:58:01 cluster2 corosync[17835]:  [TOTEM ] A new
>> membership (10.71.218.17:600) was formed. Members joined: 1
>>      Oct 10 16:58:01 cluster2 corosync[17835]:  [QUORUM] This node is
>> within the primary component and will provide service.
>>      Oct 10 16:58:01 cluster2 corosync[17835]:  [QUORUM] Members[1]: 2
>>      Oct 10 16:58:01 cluster2 corosync[17835]:  [CMAP  ] Highest config
>> version (2) and my config version (2)
>>      Oct 10 16:58:01 cluster2 corosync[17835]:  [TOTEM ] A new
>> membership (10.71.218.18:604) was formed. Members left: 1
>>      Oct 10 16:58:01 cluster2 corosync[17835]:  [QUORUM] This node is
>> within the non-primary component and will NOT provide any services.
>>      Oct 10 16:58:01 cluster2 corosync[17835]:  [QUORUM] Members[1]: 2
>>      Oct 10 16:58:01 cluster2 corosync[17835]:  [MAIN  ] Completed
>> service synchronization, ready to provide service.
>>
>> ... in which it looks like QUORUM has seen cluster1's arrival *and* its
>> departure,
>>
>> or:
>>
>>      Oct 10 16:59:03 cluster2 corosync[18841]:  [TOTEM ] A new
>> membership (10.71.218.17:632) was formed. Members joined: 1
>>      Oct 10 16:59:03 cluster2 corosync[18841]:  [CMAP  ] Highest config
>> version (2) and my config version (2)
>>      Oct 10 16:59:03 cluster2 corosync[18841]:  [TOTEM ] A new
>> membership (10.71.218.18:636) was formed. Members left: 1
>>      Oct 10 16:59:03 cluster2 corosync[18841]:  [QUORUM] Members[1]: 2
>>      Oct 10 16:59:03 cluster2 corosync[18841]:  [MAIN  ] Completed
>> service synchronization, ready to provide service.
>>
>> ... in which it looks like QUORUM never noticed cluster1's brief 
>> presence.
>>
>> Any thoughts?
>>
>> Thanks,
>> Jonathan
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org




More information about the Users mailing list