[ClusterLabs] corosync race condition when node leaves immediately after joining
Jonathan Davies
jonathan.davies at citrix.com
Wed Oct 11 12:37:00 EDT 2017
Hi ClusterLabs,
I'm seeing a race condition in corosync where votequorum can have
incorrect membership info when a node joins the cluster then leaves very
soon after.
I'm on corosync-2.3.4 plus my patch
https://github.com/corosync/corosync/pull/248. That patch makes the
problem readily reproducible but the bug was already present.
Here's the scenario. I have two hosts, cluster1 and cluster2. The
corosync.conf on cluster2 is:
totem {
version: 2
cluster_name: test
config_version: 2
transport: udpu
}
nodelist {
node {
nodeid: 1
ring0_addr: cluster1
}
node {
nodeid: 2
ring0_addr: cluster2
}
}
quorum {
provider: corosync_votequorum
auto_tie_breaker: 1
}
logging {
to_syslog: yes
}
The corosync.conf on cluster1 is the same except with "config_version: 1".
I start corosync on cluster2. When I start corosync on cluster1, it
joins and then immediately leaves due to the lower config_version.
(Previously corosync on cluster2 would also exit but with
https://github.com/corosync/corosync/pull/248 it remains alive.)
But often at this point, cluster1's disappearance is not reflected in
the votequorum info on cluster2:
Quorum information
------------------
Date: Tue Oct 10 16:43:50 2017
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 2
Ring ID: 700
Quorate: Yes
Votequorum information
----------------------
Expected votes: 2
Highest expected: 2
Total votes: 2
Quorum: 2
Flags: Quorate AutoTieBreaker
Membership information
----------------------
Nodeid Votes Name
2 1 cluster2 (local)
The logs on cluster1 show:
Oct 10 16:43:37 cluster1 corosync[15750]: [CMAP ] Received config
version (2) is different than my config version (1)! Exiting
The logs on cluster2 show:
Oct 10 16:43:37 cluster2 corosync[5102]: [TOTEM ] A new membership
(10.71.218.17:588) was formed. Members joined: 1
Oct 10 16:43:37 cluster2 corosync[5102]: [QUORUM] This node is
within the primary component and will provide service.
Oct 10 16:43:37 cluster2 corosync[5102]: [QUORUM] Members[1]: 2
Oct 10 16:43:37 cluster2 corosync[5102]: [TOTEM ] A new membership
(10.71.218.18:592) was formed. Members left: 1
Oct 10 16:43:37 cluster2 corosync[5102]: [QUORUM] Members[1]: 2
Oct 10 16:43:37 cluster2 corosync[5102]: [MAIN ] Completed
service synchronization, ready to provide service.
It looks like QUORUM has seen cluster1's arrival but not its departure!
When it works as expected, the state is left consistent:
Quorum information
------------------
Date: Tue Oct 10 16:58:14 2017
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 2
Ring ID: 604
Quorate: No
Votequorum information
----------------------
Expected votes: 2
Highest expected: 2
Total votes: 1
Quorum: 2 Activity blocked
Flags: AutoTieBreaker
Membership information
----------------------
Nodeid Votes Name
2 1 cluster2 (local)
Logs on cluster1:
Oct 10 16:58:01 cluster1 corosync[16430]: [CMAP ] Received config
version (2) is different than my config version (1)! Exiting
Logs on cluster2 are either:
Oct 10 16:58:01 cluster2 corosync[17835]: [TOTEM ] A new
membership (10.71.218.17:600) was formed. Members joined: 1
Oct 10 16:58:01 cluster2 corosync[17835]: [QUORUM] This node is
within the primary component and will provide service.
Oct 10 16:58:01 cluster2 corosync[17835]: [QUORUM] Members[1]: 2
Oct 10 16:58:01 cluster2 corosync[17835]: [CMAP ] Highest config
version (2) and my config version (2)
Oct 10 16:58:01 cluster2 corosync[17835]: [TOTEM ] A new
membership (10.71.218.18:604) was formed. Members left: 1
Oct 10 16:58:01 cluster2 corosync[17835]: [QUORUM] This node is
within the non-primary component and will NOT provide any services.
Oct 10 16:58:01 cluster2 corosync[17835]: [QUORUM] Members[1]: 2
Oct 10 16:58:01 cluster2 corosync[17835]: [MAIN ] Completed
service synchronization, ready to provide service.
... in which it looks like QUORUM has seen cluster1's arrival *and* its
departure,
or:
Oct 10 16:59:03 cluster2 corosync[18841]: [TOTEM ] A new
membership (10.71.218.17:632) was formed. Members joined: 1
Oct 10 16:59:03 cluster2 corosync[18841]: [CMAP ] Highest config
version (2) and my config version (2)
Oct 10 16:59:03 cluster2 corosync[18841]: [TOTEM ] A new
membership (10.71.218.18:636) was formed. Members left: 1
Oct 10 16:59:03 cluster2 corosync[18841]: [QUORUM] Members[1]: 2
Oct 10 16:59:03 cluster2 corosync[18841]: [MAIN ] Completed
service synchronization, ready to provide service.
... in which it looks like QUORUM never noticed cluster1's brief presence.
Any thoughts?
Thanks,
Jonathan
More information about the Users
mailing list