[ClusterLabs] corosync race condition when node leaves immediately after joining

Jonathan Davies jonathan.davies at citrix.com
Wed Oct 11 12:37:00 EDT 2017


Hi ClusterLabs,

I'm seeing a race condition in corosync where votequorum can have 
incorrect membership info when a node joins the cluster then leaves very 
soon after.

I'm on corosync-2.3.4 plus my patch 
https://github.com/corosync/corosync/pull/248. That patch makes the 
problem readily reproducible but the bug was already present.

Here's the scenario. I have two hosts, cluster1 and cluster2. The 
corosync.conf on cluster2 is:

     totem {
       version: 2
       cluster_name: test
       config_version: 2
       transport: udpu
     }
     nodelist {
       node {
         nodeid: 1
         ring0_addr: cluster1
       }
       node {
         nodeid: 2
         ring0_addr: cluster2
       }
     }
     quorum {
       provider: corosync_votequorum
       auto_tie_breaker: 1
     }
     logging {
       to_syslog: yes
     }

The corosync.conf on cluster1 is the same except with "config_version: 1".

I start corosync on cluster2. When I start corosync on cluster1, it 
joins and then immediately leaves due to the lower config_version.
(Previously corosync on cluster2 would also exit but with 
https://github.com/corosync/corosync/pull/248 it remains alive.)

But often at this point, cluster1's disappearance is not reflected in 
the votequorum info on cluster2:

     Quorum information
     ------------------
     Date:             Tue Oct 10 16:43:50 2017
     Quorum provider:  corosync_votequorum
     Nodes:            1
     Node ID:          2
     Ring ID:          700
     Quorate:          Yes

     Votequorum information
     ----------------------
     Expected votes:   2
     Highest expected: 2
     Total votes:      2
     Quorum:           2
     Flags:            Quorate AutoTieBreaker

     Membership information
     ----------------------
         Nodeid      Votes Name
              2          1 cluster2 (local)

The logs on cluster1 show:

     Oct 10 16:43:37 cluster1 corosync[15750]:  [CMAP  ] Received config 
version (2) is different than my config version (1)! Exiting

The logs on cluster2 show:

     Oct 10 16:43:37 cluster2 corosync[5102]:  [TOTEM ] A new membership 
(10.71.218.17:588) was formed. Members joined: 1
     Oct 10 16:43:37 cluster2 corosync[5102]:  [QUORUM] This node is 
within the primary component and will provide service.
     Oct 10 16:43:37 cluster2 corosync[5102]:  [QUORUM] Members[1]: 2
     Oct 10 16:43:37 cluster2 corosync[5102]:  [TOTEM ] A new membership 
(10.71.218.18:592) was formed. Members left: 1
     Oct 10 16:43:37 cluster2 corosync[5102]:  [QUORUM] Members[1]: 2
     Oct 10 16:43:37 cluster2 corosync[5102]:  [MAIN  ] Completed 
service synchronization, ready to provide service.

It looks like QUORUM has seen cluster1's arrival but not its departure!

When it works as expected, the state is left consistent:

     Quorum information
     ------------------
     Date:             Tue Oct 10 16:58:14 2017
     Quorum provider:  corosync_votequorum
     Nodes:            1
     Node ID:          2
     Ring ID:          604
     Quorate:          No

     Votequorum information
     ----------------------
     Expected votes:   2
     Highest expected: 2
     Total votes:      1
     Quorum:           2 Activity blocked
     Flags:            AutoTieBreaker

     Membership information
     ----------------------
         Nodeid      Votes Name
              2          1 cluster2 (local)

Logs on cluster1:

     Oct 10 16:58:01 cluster1 corosync[16430]:  [CMAP  ] Received config 
version (2) is different than my config version (1)! Exiting

Logs on cluster2 are either:

     Oct 10 16:58:01 cluster2 corosync[17835]:  [TOTEM ] A new 
membership (10.71.218.17:600) was formed. Members joined: 1
     Oct 10 16:58:01 cluster2 corosync[17835]:  [QUORUM] This node is 
within the primary component and will provide service.
     Oct 10 16:58:01 cluster2 corosync[17835]:  [QUORUM] Members[1]: 2
     Oct 10 16:58:01 cluster2 corosync[17835]:  [CMAP  ] Highest config 
version (2) and my config version (2)
     Oct 10 16:58:01 cluster2 corosync[17835]:  [TOTEM ] A new 
membership (10.71.218.18:604) was formed. Members left: 1
     Oct 10 16:58:01 cluster2 corosync[17835]:  [QUORUM] This node is 
within the non-primary component and will NOT provide any services.
     Oct 10 16:58:01 cluster2 corosync[17835]:  [QUORUM] Members[1]: 2
     Oct 10 16:58:01 cluster2 corosync[17835]:  [MAIN  ] Completed 
service synchronization, ready to provide service.

... in which it looks like QUORUM has seen cluster1's arrival *and* its 
departure,

or:

     Oct 10 16:59:03 cluster2 corosync[18841]:  [TOTEM ] A new 
membership (10.71.218.17:632) was formed. Members joined: 1
     Oct 10 16:59:03 cluster2 corosync[18841]:  [CMAP  ] Highest config 
version (2) and my config version (2)
     Oct 10 16:59:03 cluster2 corosync[18841]:  [TOTEM ] A new 
membership (10.71.218.18:636) was formed. Members left: 1
     Oct 10 16:59:03 cluster2 corosync[18841]:  [QUORUM] Members[1]: 2
     Oct 10 16:59:03 cluster2 corosync[18841]:  [MAIN  ] Completed 
service synchronization, ready to provide service.

... in which it looks like QUORUM never noticed cluster1's brief presence.

Any thoughts?

Thanks,
Jonathan




More information about the Users mailing list