[ClusterLabs] corosync race condition when node leaves immediately after joining

Jan Friesse jfriesse at redhat.com
Thu Oct 12 02:48:52 EDT 2017


Jonathan,
I believe main "problem" is votequorum ability to work during sync phase 
(votequorum is only one service with this ability, see 
votequorum_overview.8 section VIRTUAL SYNCHRONY)...

> Hi ClusterLabs,
>
> I'm seeing a race condition in corosync where votequorum can have
> incorrect membership info when a node joins the cluster then leaves very
> soon after.
>
> I'm on corosync-2.3.4 plus my patch
> https://github.com/corosync/corosync/pull/248. That patch makes the
> problem readily reproducible but the bug was already present.
>
> Here's the scenario. I have two hosts, cluster1 and cluster2. The
> corosync.conf on cluster2 is:
>
>      totem {
>        version: 2
>        cluster_name: test
>        config_version: 2
>        transport: udpu
>      }
>      nodelist {
>        node {
>          nodeid: 1
>          ring0_addr: cluster1
>        }
>        node {
>          nodeid: 2
>          ring0_addr: cluster2
>        }
>      }
>      quorum {
>        provider: corosync_votequorum
>        auto_tie_breaker: 1
>      }
>      logging {
>        to_syslog: yes
>      }
>
> The corosync.conf on cluster1 is the same except with "config_version: 1".
>
> I start corosync on cluster2. When I start corosync on cluster1, it
> joins and then immediately leaves due to the lower config_version.
> (Previously corosync on cluster2 would also exit but with
> https://github.com/corosync/corosync/pull/248 it remains alive.)
>
> But often at this point, cluster1's disappearance is not reflected in
> the votequorum info on cluster2:

... Is this permanent (= until new node join/leave it , or it will fix 
itself over (short) time? If this is permanent, it's a bug. If it fixes 
itself it's result of votequorum not being virtual synchronous.

Honza

>
>      Quorum information
>      ------------------
>      Date:             Tue Oct 10 16:43:50 2017
>      Quorum provider:  corosync_votequorum
>      Nodes:            1
>      Node ID:          2
>      Ring ID:          700
>      Quorate:          Yes
>
>      Votequorum information
>      ----------------------
>      Expected votes:   2
>      Highest expected: 2
>      Total votes:      2
>      Quorum:           2
>      Flags:            Quorate AutoTieBreaker
>
>      Membership information
>      ----------------------
>          Nodeid      Votes Name
>               2          1 cluster2 (local)
>
> The logs on cluster1 show:
>
>      Oct 10 16:43:37 cluster1 corosync[15750]:  [CMAP  ] Received config
> version (2) is different than my config version (1)! Exiting
>
> The logs on cluster2 show:
>
>      Oct 10 16:43:37 cluster2 corosync[5102]:  [TOTEM ] A new membership
> (10.71.218.17:588) was formed. Members joined: 1
>      Oct 10 16:43:37 cluster2 corosync[5102]:  [QUORUM] This node is
> within the primary component and will provide service.
>      Oct 10 16:43:37 cluster2 corosync[5102]:  [QUORUM] Members[1]: 2
>      Oct 10 16:43:37 cluster2 corosync[5102]:  [TOTEM ] A new membership
> (10.71.218.18:592) was formed. Members left: 1
>      Oct 10 16:43:37 cluster2 corosync[5102]:  [QUORUM] Members[1]: 2
>      Oct 10 16:43:37 cluster2 corosync[5102]:  [MAIN  ] Completed
> service synchronization, ready to provide service.
>
> It looks like QUORUM has seen cluster1's arrival but not its departure!
>
> When it works as expected, the state is left consistent:
>
>      Quorum information
>      ------------------
>      Date:             Tue Oct 10 16:58:14 2017
>      Quorum provider:  corosync_votequorum
>      Nodes:            1
>      Node ID:          2
>      Ring ID:          604
>      Quorate:          No
>
>      Votequorum information
>      ----------------------
>      Expected votes:   2
>      Highest expected: 2
>      Total votes:      1
>      Quorum:           2 Activity blocked
>      Flags:            AutoTieBreaker
>
>      Membership information
>      ----------------------
>          Nodeid      Votes Name
>               2          1 cluster2 (local)
>
> Logs on cluster1:
>
>      Oct 10 16:58:01 cluster1 corosync[16430]:  [CMAP  ] Received config
> version (2) is different than my config version (1)! Exiting
>
> Logs on cluster2 are either:
>
>      Oct 10 16:58:01 cluster2 corosync[17835]:  [TOTEM ] A new
> membership (10.71.218.17:600) was formed. Members joined: 1
>      Oct 10 16:58:01 cluster2 corosync[17835]:  [QUORUM] This node is
> within the primary component and will provide service.
>      Oct 10 16:58:01 cluster2 corosync[17835]:  [QUORUM] Members[1]: 2
>      Oct 10 16:58:01 cluster2 corosync[17835]:  [CMAP  ] Highest config
> version (2) and my config version (2)
>      Oct 10 16:58:01 cluster2 corosync[17835]:  [TOTEM ] A new
> membership (10.71.218.18:604) was formed. Members left: 1
>      Oct 10 16:58:01 cluster2 corosync[17835]:  [QUORUM] This node is
> within the non-primary component and will NOT provide any services.
>      Oct 10 16:58:01 cluster2 corosync[17835]:  [QUORUM] Members[1]: 2
>      Oct 10 16:58:01 cluster2 corosync[17835]:  [MAIN  ] Completed
> service synchronization, ready to provide service.
>
> ... in which it looks like QUORUM has seen cluster1's arrival *and* its
> departure,
>
> or:
>
>      Oct 10 16:59:03 cluster2 corosync[18841]:  [TOTEM ] A new
> membership (10.71.218.17:632) was formed. Members joined: 1
>      Oct 10 16:59:03 cluster2 corosync[18841]:  [CMAP  ] Highest config
> version (2) and my config version (2)
>      Oct 10 16:59:03 cluster2 corosync[18841]:  [TOTEM ] A new
> membership (10.71.218.18:636) was formed. Members left: 1
>      Oct 10 16:59:03 cluster2 corosync[18841]:  [QUORUM] Members[1]: 2
>      Oct 10 16:59:03 cluster2 corosync[18841]:  [MAIN  ] Completed
> service synchronization, ready to provide service.
>
> ... in which it looks like QUORUM never noticed cluster1's brief presence.
>
> Any thoughts?
>
> Thanks,
> Jonathan
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org





More information about the Users mailing list