[ClusterLabs] corosync race condition when node leaves immediately after joining

Fri Oct 13 10:05:33 EDT 2017

On 12/10/17 11:54, Jan Friesse wrote:
>>>> I'm on corosync-2.3.4 plus my patch
> 
> Finally noticed ^^^ 2.3.4 is really old and as long as it is not some 
> patched version, I wouldn't recommend to use it. Can you give a try to 
> current needle?

I was mistaken to think I was on 2.3.4. Actually I am on the version 
from CentOS 7.4 which is 2.4.0+patches.

I will try to reproduce it with needle.

>>>> But often at this point, cluster1's disappearance is not reflected in
>>>> the votequorum info on cluster2:
>>>
>>> ... Is this permanent (= until new node join/leave it , or it will fix
>>> itself over (short) time? If this is permanent, it's a bug. If it
>>> fixes itself it's result of votequorum not being virtual synchronous.
>>
>> Yes, it's permanent. After several minutes of waiting, votequorum still
>> reports "total votes: 2" even though there's only one member.
> 
> 
> That's bad. I've tried following setup:
> 
> - Both nodes with current needle
> - Your config
> - Second node is just running corosync
> - First node is running following command:
>    while true;do corosync -f; ssh node2 'corosync-quorumtool | grep 
> Total | grep 1' || exit 1;done
> 
> Running it for quite a while and I'm unable to reproduce the bug. Sadly 
> I'm unable to reproduce the bug even with 2.3.4. Do you think that 
> reproducer is correct?

Yes, that's similar enough to what I'm doing. The bug happens about 50% 
of the time for me, so I do it manually rather than needing a loop. So 
I'm not sure why you can't reproduce it.

I've done a bit of digging and am getting closer to the root cause of 
the race.

We rely on having votequorum_sync_init called twice -- once when node 1 
joins (with member_list_entries=2) and once when node 1 leaves (with 
member_list_entries=1). This is important because votequorum_sync_init 
marks nodes as NODESTATE_DEAD if they are not in quorum_members[] -- so 
it needs to have seen the node appear then disappear. This is important 
because get_total_votes only counts votes from nodes in state 
NODESTATE_MEMBER.

When it goes wrong, I see that votequorum_sync_init is only called 
*once* (with member_list_entries=1) -- after node 1 has joined and left. 
So it never sees node 1 in member_list, hence never marks it as 
NODESTATE_DEAD. But message_handler_req_exec_votequorum_nodeinfo has 
indepedently marked the node as NODESTATE_MEMBER, hence get_total_votes 
counts it and quorate is set to 1.

So why is votequorum_sync_init sometimes only called once? It looks like 
it's all down to whether we manage to iterate through all the calls to 
schedwrk_processor before entering the OPERATIONAL state. I haven't yet 
looked into exactly what controls the timing of these two things.

Adding the following patch helps me to demonstrate the problem more clearly:

diff --git a/exec/sync.c b/exec/sync.c
index e7b71bd..a2fb06d 100644
--- a/exec/sync.c
+++ b/exec/sync.c
@@ -544,6 +545,7 @@ static int schedwrk_processor (const void *context)
                 }

                 if 
(my_sync_callbacks_retrieve(my_service_list[my_processing_idx].service_id, 
NULL) != -1) {
+                       log_printf(LOGSYS_LEVEL_NOTICE, "calling 
sync_init on service '%s' (%d) with my_member_list_entries = %d", 
my_service_list[my_processing_idx].name, my_processing_idx, 
my_member_list_entries);
                         my_service_list[my_processing_idx].sync_init 
(my_trans_list,
                                 my_trans_list_entries, my_member_list,
                                 my_member_list_entries,
diff --git a/exec/votequorum.c b/exec/votequorum.c
index d5f06c1..aab6c15 100644
--- a/exec/votequorum.c
+++ b/exec/votequorum.c
@@ -2336,6 +2353,8 @@ static void votequorum_sync_init (
         int left_nodes;
         struct cluster_node *node;

+       log_printf(LOGSYS_LEVEL_NOTICE, "votequorum_sync_init has %d 
member_list_entries", member_list_entries);
+
         ENTER();

         sync_in_progress = 1;

When it works correctly I see the following (selected log lines):

notice  [TOTEM ] A new membership (10.71.218.17:2016) was formed. 
Members joined: 1
notice  [SYNC  ] calling sync_init on service 'corosync configuration 
map access' (0) with my_member_list_entries = 2
notice  [SYNC  ] calling sync_init on service 'corosync cluster closed 
process group service v1.01' (1) with my_member_list_entries = 2
notice  [SYNC  ] calling sync_init on service 'corosync vote quorum 
service v1.0' (2) with my_member_list_entries = 2
notice  [VOTEQ ] votequorum_sync_init has 2 member_list_entries
notice  [TOTEM ] A new membership (10.71.218.18:2020) was formed. 
Members left: 1
notice  [SYNC  ] calling sync_init on service 'corosync configuration 
map access' (0) with my_member_list_entries = 1
notice  [SYNC  ] calling sync_init on service 'corosync cluster closed 
process group service v1.01' (1) with my_member_list_entries = 1
notice  [SYNC  ] calling sync_init on service 'corosync vote quorum 
service v1.0' (2) with my_member_list_entries = 1
notice  [VOTEQ ] votequorum_sync_init has 1 member_list_entries

  -- Notice that votequorum_sync_init is called once with 2 members and 
once with 1 member.

When it goes wrong I see the following (selected log lines):

notice  [TOTEM ] A new membership (10.71.218.17:2004) was formed. 
Members joined: 1
notice  [SYNC  ] calling sync_init on service 'corosync configuration 
map access' (0) with my_member_list_entries = 2
notice  [SYNC  ] calling sync_init on service 'corosync cluster closed 
process group service v1.01' (1) with my_member_list_entries = 2
notice  [TOTEM ] A new membership (10.71.218.18:2008) was formed. 
Members left: 1
notice  [SYNC  ] calling sync_init on service 'corosync configuration 
map access' (0) with my_member_list_entries = 1
notice  [SYNC  ] calling sync_init on service 'corosync cluster closed 
process group service v1.01' (1) with my_member_list_entries = 1
notice  [SYNC  ] calling sync_init on service 'corosync vote quorum 
service v1.0' (2) with my_member_list_entries = 1
notice  [VOTEQ ] votequorum_sync_init has 1 member_list_entries

  -- Notice the value of my_member_list_entries in the different 
sync_init calls, and that votequorum_sync_init is only called once.

Does this help explain the issue?

Thanks,
Jonathan