[ClusterLabs] corosync race condition when node leaves immediately after joining

Thu Oct 12 08:35:08 EDT 2017

On 12/10/17 11:54, Jan Friesse wrote:
> Jonathan,
> 
>>
>>
>> On 12/10/17 07:48, Jan Friesse wrote:
>>> Jonathan,
>>> I believe main "problem" is votequorum ability to work during sync
>>> phase (votequorum is only one service with this ability, see
>>> votequorum_overview.8 section VIRTUAL SYNCHRONY)...
>>>
>>>> Hi ClusterLabs,
>>>>
>>>> I'm seeing a race condition in corosync where votequorum can have
>>>> incorrect membership info when a node joins the cluster then leaves
>>>> very
>>>> soon after.
>>>>
>>>> I'm on corosync-2.3.4 plus my patch
> 
> Finally noticed ^^^ 2.3.4 is really old and as long as it is not some
> patched version, I wouldn't recommend to use it. Can you give a try to
> current needle?
> 
>>>> https://github.com/corosync/corosync/pull/248. That patch makes the
>>>> problem readily reproducible but the bug was already present.
>>>>
>>>> Here's the scenario. I have two hosts, cluster1 and cluster2. The
>>>> corosync.conf on cluster2 is:
>>>>
>>>>      totem {
>>>>        version: 2
>>>>        cluster_name: test
>>>>        config_version: 2
>>>>        transport: udpu
>>>>      }
>>>>      nodelist {
>>>>        node {
>>>>          nodeid: 1
>>>>          ring0_addr: cluster1
>>>>        }
>>>>        node {
>>>>          nodeid: 2
>>>>          ring0_addr: cluster2
>>>>        }
>>>>      }
>>>>      quorum {
>>>>        provider: corosync_votequorum
>>>>        auto_tie_breaker: 1
>>>>      }
>>>>      logging {
>>>>        to_syslog: yes
>>>>      }
>>>>
>>>> The corosync.conf on cluster1 is the same except with
>>>> "config_version: 1".
>>>>
>>>> I start corosync on cluster2. When I start corosync on cluster1, it
>>>> joins and then immediately leaves due to the lower config_version.
>>>> (Previously corosync on cluster2 would also exit but with
>>>> https://github.com/corosync/corosync/pull/248 it remains alive.)
>>>>
>>>> But often at this point, cluster1's disappearance is not reflected in
>>>> the votequorum info on cluster2:
>>>
>>> ... Is this permanent (= until new node join/leave it , or it will fix
>>> itself over (short) time? If this is permanent, it's a bug. If it
>>> fixes itself it's result of votequorum not being virtual synchronous.
>>
>> Yes, it's permanent. After several minutes of waiting, votequorum still
>> reports "total votes: 2" even though there's only one member.
> 
> 
> That's bad. I've tried following setup:
> 
> - Both nodes with current needle
> - Your config
> - Second node is just running corosync
> - First node is running following command:
>   while true;do corosync -f; ssh node2 'corosync-quorumtool | grep Total
> | grep 1' || exit 1;done
> 
> Running it for quite a while and I'm unable to reproduce the bug. Sadly
> I'm unable to reproduce the bug even with 2.3.4. Do you think that
> reproducer is correct?
> 

I can't reproduce it either.

Chrissie