[ClusterLabs] corosync race condition when node leaves immediately after joining
Christine Caulfield
ccaulfie at redhat.com
Thu Oct 12 08:35:08 EDT 2017
On 12/10/17 11:54, Jan Friesse wrote:
> Jonathan,
>
>>
>>
>> On 12/10/17 07:48, Jan Friesse wrote:
>>> Jonathan,
>>> I believe main "problem" is votequorum ability to work during sync
>>> phase (votequorum is only one service with this ability, see
>>> votequorum_overview.8 section VIRTUAL SYNCHRONY)...
>>>
>>>> Hi ClusterLabs,
>>>>
>>>> I'm seeing a race condition in corosync where votequorum can have
>>>> incorrect membership info when a node joins the cluster then leaves
>>>> very
>>>> soon after.
>>>>
>>>> I'm on corosync-2.3.4 plus my patch
>
> Finally noticed ^^^ 2.3.4 is really old and as long as it is not some
> patched version, I wouldn't recommend to use it. Can you give a try to
> current needle?
>
>>>> https://github.com/corosync/corosync/pull/248. That patch makes the
>>>> problem readily reproducible but the bug was already present.
>>>>
>>>> Here's the scenario. I have two hosts, cluster1 and cluster2. The
>>>> corosync.conf on cluster2 is:
>>>>
>>>> totem {
>>>> version: 2
>>>> cluster_name: test
>>>> config_version: 2
>>>> transport: udpu
>>>> }
>>>> nodelist {
>>>> node {
>>>> nodeid: 1
>>>> ring0_addr: cluster1
>>>> }
>>>> node {
>>>> nodeid: 2
>>>> ring0_addr: cluster2
>>>> }
>>>> }
>>>> quorum {
>>>> provider: corosync_votequorum
>>>> auto_tie_breaker: 1
>>>> }
>>>> logging {
>>>> to_syslog: yes
>>>> }
>>>>
>>>> The corosync.conf on cluster1 is the same except with
>>>> "config_version: 1".
>>>>
>>>> I start corosync on cluster2. When I start corosync on cluster1, it
>>>> joins and then immediately leaves due to the lower config_version.
>>>> (Previously corosync on cluster2 would also exit but with
>>>> https://github.com/corosync/corosync/pull/248 it remains alive.)
>>>>
>>>> But often at this point, cluster1's disappearance is not reflected in
>>>> the votequorum info on cluster2:
>>>
>>> ... Is this permanent (= until new node join/leave it , or it will fix
>>> itself over (short) time? If this is permanent, it's a bug. If it
>>> fixes itself it's result of votequorum not being virtual synchronous.
>>
>> Yes, it's permanent. After several minutes of waiting, votequorum still
>> reports "total votes: 2" even though there's only one member.
>
>
> That's bad. I've tried following setup:
>
> - Both nodes with current needle
> - Your config
> - Second node is just running corosync
> - First node is running following command:
> while true;do corosync -f; ssh node2 'corosync-quorumtool | grep Total
> | grep 1' || exit 1;done
>
> Running it for quite a while and I'm unable to reproduce the bug. Sadly
> I'm unable to reproduce the bug even with 2.3.4. Do you think that
> reproducer is correct?
>
I can't reproduce it either.
Chrissie
More information about the Users
mailing list