[ClusterLabs] corosync race condition when node leaves immediately after joining
Jan Friesse
jfriesse at redhat.com
Thu Oct 12 02:48:52 EDT 2017
Jonathan,
I believe main "problem" is votequorum ability to work during sync phase
(votequorum is only one service with this ability, see
votequorum_overview.8 section VIRTUAL SYNCHRONY)...
> Hi ClusterLabs,
>
> I'm seeing a race condition in corosync where votequorum can have
> incorrect membership info when a node joins the cluster then leaves very
> soon after.
>
> I'm on corosync-2.3.4 plus my patch
> https://github.com/corosync/corosync/pull/248. That patch makes the
> problem readily reproducible but the bug was already present.
>
> Here's the scenario. I have two hosts, cluster1 and cluster2. The
> corosync.conf on cluster2 is:
>
> totem {
> version: 2
> cluster_name: test
> config_version: 2
> transport: udpu
> }
> nodelist {
> node {
> nodeid: 1
> ring0_addr: cluster1
> }
> node {
> nodeid: 2
> ring0_addr: cluster2
> }
> }
> quorum {
> provider: corosync_votequorum
> auto_tie_breaker: 1
> }
> logging {
> to_syslog: yes
> }
>
> The corosync.conf on cluster1 is the same except with "config_version: 1".
>
> I start corosync on cluster2. When I start corosync on cluster1, it
> joins and then immediately leaves due to the lower config_version.
> (Previously corosync on cluster2 would also exit but with
> https://github.com/corosync/corosync/pull/248 it remains alive.)
>
> But often at this point, cluster1's disappearance is not reflected in
> the votequorum info on cluster2:
... Is this permanent (= until new node join/leave it , or it will fix
itself over (short) time? If this is permanent, it's a bug. If it fixes
itself it's result of votequorum not being virtual synchronous.
Honza
>
> Quorum information
> ------------------
> Date: Tue Oct 10 16:43:50 2017
> Quorum provider: corosync_votequorum
> Nodes: 1
> Node ID: 2
> Ring ID: 700
> Quorate: Yes
>
> Votequorum information
> ----------------------
> Expected votes: 2
> Highest expected: 2
> Total votes: 2
> Quorum: 2
> Flags: Quorate AutoTieBreaker
>
> Membership information
> ----------------------
> Nodeid Votes Name
> 2 1 cluster2 (local)
>
> The logs on cluster1 show:
>
> Oct 10 16:43:37 cluster1 corosync[15750]: [CMAP ] Received config
> version (2) is different than my config version (1)! Exiting
>
> The logs on cluster2 show:
>
> Oct 10 16:43:37 cluster2 corosync[5102]: [TOTEM ] A new membership
> (10.71.218.17:588) was formed. Members joined: 1
> Oct 10 16:43:37 cluster2 corosync[5102]: [QUORUM] This node is
> within the primary component and will provide service.
> Oct 10 16:43:37 cluster2 corosync[5102]: [QUORUM] Members[1]: 2
> Oct 10 16:43:37 cluster2 corosync[5102]: [TOTEM ] A new membership
> (10.71.218.18:592) was formed. Members left: 1
> Oct 10 16:43:37 cluster2 corosync[5102]: [QUORUM] Members[1]: 2
> Oct 10 16:43:37 cluster2 corosync[5102]: [MAIN ] Completed
> service synchronization, ready to provide service.
>
> It looks like QUORUM has seen cluster1's arrival but not its departure!
>
> When it works as expected, the state is left consistent:
>
> Quorum information
> ------------------
> Date: Tue Oct 10 16:58:14 2017
> Quorum provider: corosync_votequorum
> Nodes: 1
> Node ID: 2
> Ring ID: 604
> Quorate: No
>
> Votequorum information
> ----------------------
> Expected votes: 2
> Highest expected: 2
> Total votes: 1
> Quorum: 2 Activity blocked
> Flags: AutoTieBreaker
>
> Membership information
> ----------------------
> Nodeid Votes Name
> 2 1 cluster2 (local)
>
> Logs on cluster1:
>
> Oct 10 16:58:01 cluster1 corosync[16430]: [CMAP ] Received config
> version (2) is different than my config version (1)! Exiting
>
> Logs on cluster2 are either:
>
> Oct 10 16:58:01 cluster2 corosync[17835]: [TOTEM ] A new
> membership (10.71.218.17:600) was formed. Members joined: 1
> Oct 10 16:58:01 cluster2 corosync[17835]: [QUORUM] This node is
> within the primary component and will provide service.
> Oct 10 16:58:01 cluster2 corosync[17835]: [QUORUM] Members[1]: 2
> Oct 10 16:58:01 cluster2 corosync[17835]: [CMAP ] Highest config
> version (2) and my config version (2)
> Oct 10 16:58:01 cluster2 corosync[17835]: [TOTEM ] A new
> membership (10.71.218.18:604) was formed. Members left: 1
> Oct 10 16:58:01 cluster2 corosync[17835]: [QUORUM] This node is
> within the non-primary component and will NOT provide any services.
> Oct 10 16:58:01 cluster2 corosync[17835]: [QUORUM] Members[1]: 2
> Oct 10 16:58:01 cluster2 corosync[17835]: [MAIN ] Completed
> service synchronization, ready to provide service.
>
> ... in which it looks like QUORUM has seen cluster1's arrival *and* its
> departure,
>
> or:
>
> Oct 10 16:59:03 cluster2 corosync[18841]: [TOTEM ] A new
> membership (10.71.218.17:632) was formed. Members joined: 1
> Oct 10 16:59:03 cluster2 corosync[18841]: [CMAP ] Highest config
> version (2) and my config version (2)
> Oct 10 16:59:03 cluster2 corosync[18841]: [TOTEM ] A new
> membership (10.71.218.18:636) was formed. Members left: 1
> Oct 10 16:59:03 cluster2 corosync[18841]: [QUORUM] Members[1]: 2
> Oct 10 16:59:03 cluster2 corosync[18841]: [MAIN ] Completed
> service synchronization, ready to provide service.
>
> ... in which it looks like QUORUM never noticed cluster1's brief presence.
>
> Any thoughts?
>
> Thanks,
> Jonathan
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Users
mailing list