[ClusterLabs] corosync race condition when node leaves immediately after joining

Fri Oct 13 16:24:58 UTC 2017

Jonathan Davies napsal(a):
>
>
> On 12/10/17 11:54, Jan Friesse wrote:
>>>>> I'm on corosync-2.3.4 plus my patch
>>
>> Finally noticed ^^^ 2.3.4 is really old and as long as it is not some
>> patched version, I wouldn't recommend to use it. Can you give a try to
>> current needle?
>
> I was mistaken to think I was on 2.3.4. Actually I am on the version
> from CentOS 7.4 which is 2.4.0+patches.

Ok, so this is basically like needle.

>
> I will try to reproduce it with needle.
>
>>>>> But often at this point, cluster1's disappearance is not reflected in
>>>>> the votequorum info on cluster2:
>>>>
>>>> ... Is this permanent (= until new node join/leave it , or it will fix
>>>> itself over (short) time? If this is permanent, it's a bug. If it
>>>> fixes itself it's result of votequorum not being virtual synchronous.
>>>
>>> Yes, it's permanent. After several minutes of waiting, votequorum still
>>> reports "total votes: 2" even though there's only one member.
>>
>>
>> That's bad. I've tried following setup:
>>
>> - Both nodes with current needle
>> - Your config
>> - Second node is just running corosync
>> - First node is running following command:
>>    while true;do corosync -f; ssh node2 'corosync-quorumtool | grep
>> Total | grep 1' || exit 1;done
>>
>> Running it for quite a while and I'm unable to reproduce the bug.
>> Sadly I'm unable to reproduce the bug even with 2.3.4. Do you think
>> that reproducer is correct?
>
> Yes, that's similar enough to what I'm doing. The bug happens about 50%
> of the time for me, so I do it manually rather than needing a loop. So
> I'm not sure why you can't reproduce it.

That's really strange. Chrissie is also unable to reproduce it so we may 
miss some part of puzzle.

>
> I've done a bit of digging and am getting closer to the root cause of
> the race.
>
> We rely on having votequorum_sync_init called twice -- once when node 1
> joins (with member_list_entries=2) and once when node 1 leaves (with
> member_list_entries=1). This is important because votequorum_sync_init
> marks nodes as NODESTATE_DEAD if they are not in quorum_members[] -- so
> it needs to have seen the node appear then disappear. This is important
> because get_total_votes only counts votes from nodes in state
> NODESTATE_MEMBER.

So there are basically two problems.

Actually first (main) problem is that votequorum_sync_init is ever 
called when that node joins. It really shouldn't. And problem is simply 
because calling api->shutdown_request() is not enough. Can you try 
replace it with exit(1) (for testing) and reproduce the problem? I'm 
pretty sure problem disappears.

>
> When it goes wrong, I see that votequorum_sync_init is only called
> *once* (with member_list_entries=1) -- after node 1 has joined and left.
> So it never sees node 1 in member_list, hence never marks it as
> NODESTATE_DEAD. But message_handler_req_exec_votequorum_nodeinfo has
> indepedently marked the node as NODESTATE_MEMBER, hence get_total_votes
> counts it and quorate is set to 1.

This is second problem and it has to be also fixed. You have to be 
pretty lucky to reproduce it so often.

Anyway, here is theory:
- node 1 calls api->shutdown_request() and continue processing (this is 
problem 1)
- both nodes gets to a state where they should call 
votequorum_sync_init, but node 2 is now not scheduled (as I said, it 
must be pretty luck)
- node 1 calls votequorum_sync_init and votequorum_sync_process (sending 
nodeinfo) and it's shutdown
- node 2 gets nodeinfo
- node 2 sees node 1 shutdown
- node 2 calls votequorum_sync_init and votequorum_sync_process

If this theory is true, we must probably fix the sync.c to have 2 or 
ideally 3-4 barriers instead of 1.

>
> So why is votequorum_sync_init sometimes only called once? It looks like
> it's all down to whether we manage to iterate through all the calls to
> schedwrk_processor before entering the OPERATIONAL state. I haven't yet
> looked into exactly what controls the timing of these two things.
>
> Adding the following patch helps me to demonstrate the problem more
> clearly:
>
> diff --git a/exec/sync.c b/exec/sync.c
> index e7b71bd..a2fb06d 100644
> --- a/exec/sync.c
> +++ b/exec/sync.c
> @@ -544,6 +545,7 @@ static int schedwrk_processor (const void *context)
>                  }
>
>                  if
> (my_sync_callbacks_retrieve(my_service_list[my_processing_idx].service_id,
> NULL) != -1) {
> +                       log_printf(LOGSYS_LEVEL_NOTICE, "calling
> sync_init on service '%s' (%d) with my_member_list_entries = %d",
> my_service_list[my_processing_idx].name, my_processing_idx,
> my_member_list_entries);
>                          my_service_list[my_processing_idx].sync_init
> (my_trans_list,
>                                  my_trans_list_entries, my_member_list,
>                                  my_member_list_entries,
> diff --git a/exec/votequorum.c b/exec/votequorum.c
> index d5f06c1..aab6c15 100644
> --- a/exec/votequorum.c
> +++ b/exec/votequorum.c
> @@ -2336,6 +2353,8 @@ static void votequorum_sync_init (
>          int left_nodes;
>          struct cluster_node *node;
>
> +       log_printf(LOGSYS_LEVEL_NOTICE, "votequorum_sync_init has %d
> member_list_entries", member_list_entries);
> +
>          ENTER();
>
>          sync_in_progress = 1;
>
> When it works correctly I see the following (selected log lines):
>
> notice  [TOTEM ] A new membership (10.71.218.17:2016) was formed.
> Members joined: 1
> notice  [SYNC  ] calling sync_init on service 'corosync configuration
> map access' (0) with my_member_list_entries = 2
> notice  [SYNC  ] calling sync_init on service 'corosync cluster closed
> process group service v1.01' (1) with my_member_list_entries = 2
> notice  [SYNC  ] calling sync_init on service 'corosync vote quorum
> service v1.0' (2) with my_member_list_entries = 2
> notice  [VOTEQ ] votequorum_sync_init has 2 member_list_entries
> notice  [TOTEM ] A new membership (10.71.218.18:2020) was formed.
> Members left: 1
> notice  [SYNC  ] calling sync_init on service 'corosync configuration
> map access' (0) with my_member_list_entries = 1
> notice  [SYNC  ] calling sync_init on service 'corosync cluster closed
> process group service v1.01' (1) with my_member_list_entries = 1
> notice  [SYNC  ] calling sync_init on service 'corosync vote quorum
> service v1.0' (2) with my_member_list_entries = 1
> notice  [VOTEQ ] votequorum_sync_init has 1 member_list_entries
>
>   -- Notice that votequorum_sync_init is called once with 2 members and
> once with 1 member.
>
> When it goes wrong I see the following (selected log lines):
>
> notice  [TOTEM ] A new membership (10.71.218.17:2004) was formed.
> Members joined: 1
> notice  [SYNC  ] calling sync_init on service 'corosync configuration
> map access' (0) with my_member_list_entries = 2
> notice  [SYNC  ] calling sync_init on service 'corosync cluster closed
> process group service v1.01' (1) with my_member_list_entries = 2
> notice  [TOTEM ] A new membership (10.71.218.18:2008) was formed.
> Members left: 1
> notice  [SYNC  ] calling sync_init on service 'corosync configuration
> map access' (0) with my_member_list_entries = 1
> notice  [SYNC  ] calling sync_init on service 'corosync cluster closed
> process group service v1.01' (1) with my_member_list_entries = 1
> notice  [SYNC  ] calling sync_init on service 'corosync vote quorum
> service v1.0' (2) with my_member_list_entries = 1
> notice  [VOTEQ ] votequorum_sync_init has 1 member_list_entries
>
>   -- Notice the value of my_member_list_entries in the different
> sync_init calls, and that votequorum_sync_init is only called once.
>
> Does this help explain the issue?

It's definitively helpful. I will try to thing about what may be 
happening a little deeper. Can you please add similar debug info as you 
have for sync_init also to other sync_ and send me the log of both nodes 
when everything works and when do not?

Honza

>
> Thanks,
> Jonathan