[ClusterLabs] corosync race condition when node leaves immediately after joining
Jonathan Davies
jonathan.davies at citrix.com
Fri Oct 13 10:05:33 EDT 2017
On 12/10/17 11:54, Jan Friesse wrote:
>>>> I'm on corosync-2.3.4 plus my patch
>
> Finally noticed ^^^ 2.3.4 is really old and as long as it is not some
> patched version, I wouldn't recommend to use it. Can you give a try to
> current needle?
I was mistaken to think I was on 2.3.4. Actually I am on the version
from CentOS 7.4 which is 2.4.0+patches.
I will try to reproduce it with needle.
>>>> But often at this point, cluster1's disappearance is not reflected in
>>>> the votequorum info on cluster2:
>>>
>>> ... Is this permanent (= until new node join/leave it , or it will fix
>>> itself over (short) time? If this is permanent, it's a bug. If it
>>> fixes itself it's result of votequorum not being virtual synchronous.
>>
>> Yes, it's permanent. After several minutes of waiting, votequorum still
>> reports "total votes: 2" even though there's only one member.
>
>
> That's bad. I've tried following setup:
>
> - Both nodes with current needle
> - Your config
> - Second node is just running corosync
> - First node is running following command:
> while true;do corosync -f; ssh node2 'corosync-quorumtool | grep
> Total | grep 1' || exit 1;done
>
> Running it for quite a while and I'm unable to reproduce the bug. Sadly
> I'm unable to reproduce the bug even with 2.3.4. Do you think that
> reproducer is correct?
Yes, that's similar enough to what I'm doing. The bug happens about 50%
of the time for me, so I do it manually rather than needing a loop. So
I'm not sure why you can't reproduce it.
I've done a bit of digging and am getting closer to the root cause of
the race.
We rely on having votequorum_sync_init called twice -- once when node 1
joins (with member_list_entries=2) and once when node 1 leaves (with
member_list_entries=1). This is important because votequorum_sync_init
marks nodes as NODESTATE_DEAD if they are not in quorum_members[] -- so
it needs to have seen the node appear then disappear. This is important
because get_total_votes only counts votes from nodes in state
NODESTATE_MEMBER.
When it goes wrong, I see that votequorum_sync_init is only called
*once* (with member_list_entries=1) -- after node 1 has joined and left.
So it never sees node 1 in member_list, hence never marks it as
NODESTATE_DEAD. But message_handler_req_exec_votequorum_nodeinfo has
indepedently marked the node as NODESTATE_MEMBER, hence get_total_votes
counts it and quorate is set to 1.
So why is votequorum_sync_init sometimes only called once? It looks like
it's all down to whether we manage to iterate through all the calls to
schedwrk_processor before entering the OPERATIONAL state. I haven't yet
looked into exactly what controls the timing of these two things.
Adding the following patch helps me to demonstrate the problem more clearly:
diff --git a/exec/sync.c b/exec/sync.c
index e7b71bd..a2fb06d 100644
--- a/exec/sync.c
+++ b/exec/sync.c
@@ -544,6 +545,7 @@ static int schedwrk_processor (const void *context)
}
if
(my_sync_callbacks_retrieve(my_service_list[my_processing_idx].service_id,
NULL) != -1) {
+ log_printf(LOGSYS_LEVEL_NOTICE, "calling
sync_init on service '%s' (%d) with my_member_list_entries = %d",
my_service_list[my_processing_idx].name, my_processing_idx,
my_member_list_entries);
my_service_list[my_processing_idx].sync_init
(my_trans_list,
my_trans_list_entries, my_member_list,
my_member_list_entries,
diff --git a/exec/votequorum.c b/exec/votequorum.c
index d5f06c1..aab6c15 100644
--- a/exec/votequorum.c
+++ b/exec/votequorum.c
@@ -2336,6 +2353,8 @@ static void votequorum_sync_init (
int left_nodes;
struct cluster_node *node;
+ log_printf(LOGSYS_LEVEL_NOTICE, "votequorum_sync_init has %d
member_list_entries", member_list_entries);
+
ENTER();
sync_in_progress = 1;
When it works correctly I see the following (selected log lines):
notice [TOTEM ] A new membership (10.71.218.17:2016) was formed.
Members joined: 1
notice [SYNC ] calling sync_init on service 'corosync configuration
map access' (0) with my_member_list_entries = 2
notice [SYNC ] calling sync_init on service 'corosync cluster closed
process group service v1.01' (1) with my_member_list_entries = 2
notice [SYNC ] calling sync_init on service 'corosync vote quorum
service v1.0' (2) with my_member_list_entries = 2
notice [VOTEQ ] votequorum_sync_init has 2 member_list_entries
notice [TOTEM ] A new membership (10.71.218.18:2020) was formed.
Members left: 1
notice [SYNC ] calling sync_init on service 'corosync configuration
map access' (0) with my_member_list_entries = 1
notice [SYNC ] calling sync_init on service 'corosync cluster closed
process group service v1.01' (1) with my_member_list_entries = 1
notice [SYNC ] calling sync_init on service 'corosync vote quorum
service v1.0' (2) with my_member_list_entries = 1
notice [VOTEQ ] votequorum_sync_init has 1 member_list_entries
-- Notice that votequorum_sync_init is called once with 2 members and
once with 1 member.
When it goes wrong I see the following (selected log lines):
notice [TOTEM ] A new membership (10.71.218.17:2004) was formed.
Members joined: 1
notice [SYNC ] calling sync_init on service 'corosync configuration
map access' (0) with my_member_list_entries = 2
notice [SYNC ] calling sync_init on service 'corosync cluster closed
process group service v1.01' (1) with my_member_list_entries = 2
notice [TOTEM ] A new membership (10.71.218.18:2008) was formed.
Members left: 1
notice [SYNC ] calling sync_init on service 'corosync configuration
map access' (0) with my_member_list_entries = 1
notice [SYNC ] calling sync_init on service 'corosync cluster closed
process group service v1.01' (1) with my_member_list_entries = 1
notice [SYNC ] calling sync_init on service 'corosync vote quorum
service v1.0' (2) with my_member_list_entries = 1
notice [VOTEQ ] votequorum_sync_init has 1 member_list_entries
-- Notice the value of my_member_list_entries in the different
sync_init calls, and that votequorum_sync_init is only called once.
Does this help explain the issue?
Thanks,
Jonathan
More information about the Users
mailing list