[ClusterLabs] corosync race condition when node leaves immediately after joining
Jonathan Davies
jonathan.davies at citrix.com
Thu Oct 12 05:45:18 EDT 2017
On 12/10/17 07:48, Jan Friesse wrote:
> Jonathan,
> I believe main "problem" is votequorum ability to work during sync phase
> (votequorum is only one service with this ability, see
> votequorum_overview.8 section VIRTUAL SYNCHRONY)...
>
>> Hi ClusterLabs,
>>
>> I'm seeing a race condition in corosync where votequorum can have
>> incorrect membership info when a node joins the cluster then leaves very
>> soon after.
>>
>> I'm on corosync-2.3.4 plus my patch
>> https://github.com/corosync/corosync/pull/248. That patch makes the
>> problem readily reproducible but the bug was already present.
>>
>> Here's the scenario. I have two hosts, cluster1 and cluster2. The
>> corosync.conf on cluster2 is:
>>
>> totem {
>> version: 2
>> cluster_name: test
>> config_version: 2
>> transport: udpu
>> }
>> nodelist {
>> node {
>> nodeid: 1
>> ring0_addr: cluster1
>> }
>> node {
>> nodeid: 2
>> ring0_addr: cluster2
>> }
>> }
>> quorum {
>> provider: corosync_votequorum
>> auto_tie_breaker: 1
>> }
>> logging {
>> to_syslog: yes
>> }
>>
>> The corosync.conf on cluster1 is the same except with "config_version:
>> 1".
>>
>> I start corosync on cluster2. When I start corosync on cluster1, it
>> joins and then immediately leaves due to the lower config_version.
>> (Previously corosync on cluster2 would also exit but with
>> https://github.com/corosync/corosync/pull/248 it remains alive.)
>>
>> But often at this point, cluster1's disappearance is not reflected in
>> the votequorum info on cluster2:
>
> ... Is this permanent (= until new node join/leave it , or it will fix
> itself over (short) time? If this is permanent, it's a bug. If it fixes
> itself it's result of votequorum not being virtual synchronous.
Yes, it's permanent. After several minutes of waiting, votequorum still
reports "total votes: 2" even though there's only one member.
Thanks,
Jonathan
>>
>> Quorum information
>> ------------------
>> Date: Tue Oct 10 16:43:50 2017
>> Quorum provider: corosync_votequorum
>> Nodes: 1
>> Node ID: 2
>> Ring ID: 700
>> Quorate: Yes
>>
>> Votequorum information
>> ----------------------
>> Expected votes: 2
>> Highest expected: 2
>> Total votes: 2
>> Quorum: 2
>> Flags: Quorate AutoTieBreaker
>>
>> Membership information
>> ----------------------
>> Nodeid Votes Name
>> 2 1 cluster2 (local)
>>
>> The logs on cluster1 show:
>>
>> Oct 10 16:43:37 cluster1 corosync[15750]: [CMAP ] Received config
>> version (2) is different than my config version (1)! Exiting
>>
>> The logs on cluster2 show:
>>
>> Oct 10 16:43:37 cluster2 corosync[5102]: [TOTEM ] A new membership
>> (10.71.218.17:588) was formed. Members joined: 1
>> Oct 10 16:43:37 cluster2 corosync[5102]: [QUORUM] This node is
>> within the primary component and will provide service.
>> Oct 10 16:43:37 cluster2 corosync[5102]: [QUORUM] Members[1]: 2
>> Oct 10 16:43:37 cluster2 corosync[5102]: [TOTEM ] A new membership
>> (10.71.218.18:592) was formed. Members left: 1
>> Oct 10 16:43:37 cluster2 corosync[5102]: [QUORUM] Members[1]: 2
>> Oct 10 16:43:37 cluster2 corosync[5102]: [MAIN ] Completed
>> service synchronization, ready to provide service.
>>
>> It looks like QUORUM has seen cluster1's arrival but not its departure!
>>
>> When it works as expected, the state is left consistent:
>>
>> Quorum information
>> ------------------
>> Date: Tue Oct 10 16:58:14 2017
>> Quorum provider: corosync_votequorum
>> Nodes: 1
>> Node ID: 2
>> Ring ID: 604
>> Quorate: No
>>
>> Votequorum information
>> ----------------------
>> Expected votes: 2
>> Highest expected: 2
>> Total votes: 1
>> Quorum: 2 Activity blocked
>> Flags: AutoTieBreaker
>>
>> Membership information
>> ----------------------
>> Nodeid Votes Name
>> 2 1 cluster2 (local)
>>
>> Logs on cluster1:
>>
>> Oct 10 16:58:01 cluster1 corosync[16430]: [CMAP ] Received config
>> version (2) is different than my config version (1)! Exiting
>>
>> Logs on cluster2 are either:
>>
>> Oct 10 16:58:01 cluster2 corosync[17835]: [TOTEM ] A new
>> membership (10.71.218.17:600) was formed. Members joined: 1
>> Oct 10 16:58:01 cluster2 corosync[17835]: [QUORUM] This node is
>> within the primary component and will provide service.
>> Oct 10 16:58:01 cluster2 corosync[17835]: [QUORUM] Members[1]: 2
>> Oct 10 16:58:01 cluster2 corosync[17835]: [CMAP ] Highest config
>> version (2) and my config version (2)
>> Oct 10 16:58:01 cluster2 corosync[17835]: [TOTEM ] A new
>> membership (10.71.218.18:604) was formed. Members left: 1
>> Oct 10 16:58:01 cluster2 corosync[17835]: [QUORUM] This node is
>> within the non-primary component and will NOT provide any services.
>> Oct 10 16:58:01 cluster2 corosync[17835]: [QUORUM] Members[1]: 2
>> Oct 10 16:58:01 cluster2 corosync[17835]: [MAIN ] Completed
>> service synchronization, ready to provide service.
>>
>> ... in which it looks like QUORUM has seen cluster1's arrival *and* its
>> departure,
>>
>> or:
>>
>> Oct 10 16:59:03 cluster2 corosync[18841]: [TOTEM ] A new
>> membership (10.71.218.17:632) was formed. Members joined: 1
>> Oct 10 16:59:03 cluster2 corosync[18841]: [CMAP ] Highest config
>> version (2) and my config version (2)
>> Oct 10 16:59:03 cluster2 corosync[18841]: [TOTEM ] A new
>> membership (10.71.218.18:636) was formed. Members left: 1
>> Oct 10 16:59:03 cluster2 corosync[18841]: [QUORUM] Members[1]: 2
>> Oct 10 16:59:03 cluster2 corosync[18841]: [MAIN ] Completed
>> service synchronization, ready to provide service.
>>
>> ... in which it looks like QUORUM never noticed cluster1's brief
>> presence.
>>
>> Any thoughts?
>>
>> Thanks,
>> Jonathan
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Users
mailing list