[ClusterLabs] corosync continues even if node is removed from the cluster
Jonathan Davies
jonathan.davies at citrix.com
Fri Oct 20 08:39:35 EDT 2017
Hi ClusterLabs,
I have a query about safely removing a node from a corosync cluster.
When "corosync-cfgtool -R" is issued, it causes all nodes to reload
their config from corosync.conf. If I have removed a node from the
nodelist but corosync is still running on that node, it will receive the
reload signal but will try to continue as if nothing had happened. This
then causes various problems on all nodes.
A specific example:
I have a running cluster containing two nodes: 10.71.217.70 (nodeid=1)
and 10.71.217.71 (nodeid=2). When I remove node 1 from the nodelist in
corosync.conf on both nodes then issue "corosync-cfgtool -R" on
10.71.217.71, I see this on 10.71.217.70:
Quorum information
------------------
Date: Fri Oct 20 13:23:02 2017
Quorum provider: corosync_votequorum
Nodes: 2
Node ID: 1
Ring ID: 124
Quorate: Yes
Votequorum information
----------------------
Expected votes: 2
Highest expected: 2
Total votes: 2
Quorum: 2
Flags: Quorate AutoTieBreaker
Membership information
----------------------
Nodeid Votes Name
1 1 cluster1 (local)
2 1 10.71.217.71
and this on 10.71.217.71:
Quorum information
------------------
Date: Fri Oct 20 13:22:46 2017
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 2
Ring ID: 132
Quorate: No
Votequorum information
----------------------
Expected votes: 2
Highest expected: 2
Total votes: 1
Quorum: 2 Activity blocked
Flags:
Membership information
----------------------
Nodeid Votes Name
2 1 10.71.217.71 (local)
Instead, I would expect corosync on node 1 to exit and node 2 to have
"expected votes: 1, total votes: 1, quorate: yes".
I notice that there is already some logic in votequorum.c that detects
this condition, and it produces the following log messages on node 1:
debug [VOTEQ ] No nodelist defined or our node is not in the nodelist
crit [VOTEQ ] configuration error: nodelist or
quorum.expected_votes must be configured!
crit [VOTEQ ] will continue with current runtime data
What is the rationale for continuing despite the obvious inconsistency?
Surely this is destined to cause problems...?
I find that I get my expected behaviour with the following patch:
diff --git a/exec/votequorum.c b/exec/votequorum.c
index 1a97c6d..4ff7ff2 100644
--- a/exec/votequorum.c
+++ b/exec/votequorum.c
@@ -1286,7 +1287,8 @@ static char *votequorum_readconfig(int runtime)
error = (char *)"configuration error: nodelist
or quorum.expected_votes must be configured!";
} else {
log_printf(LOGSYS_LEVEL_CRIT, "configuration
error: nodelist or quorum.expected_votes must be configured!");
- log_printf(LOGSYS_LEVEL_CRIT, "will continue
with current runtime data");
+ log_printf(LOGSYS_LEVEL_CRIT, "exiting...");
+ exit(1);
}
goto out;
}
Is there any reason why that would not be a good idea?
Thanks,
Jonathan
More information about the Users
mailing list