[ClusterLabs] corosync race condition when node leaves immediately after joining

Jonathan Davies jonathan.davies at citrix.com
Thu Oct 19 12:05:13 EDT 2017



On 19/10/17 16:56, Jan Friesse wrote:
> Jonathan,
> 
>>
>>
>> On 18/10/17 16:18, Jan Friesse wrote:
>>> Jonathan,
>>>
>>>>
>>>> On 18/10/17 14:38, Jan Friesse wrote:
>>>>> Can you please try to remove
>>>>> "votequorum_exec_send_nodeinfo(us->node_id);" line from votequorum.c
>>>>> in the votequorum_exec_init_fn function (around line 2306) and let me
>>>>> know if problem persists?
>>>>
>>>> Wow! With that change, I'm pleased to say that I'm not able to 
>>>> reproduce
>>>> the problem at all!
>>>
>>> Sounds good.
>>>
>>>>
>>>> Is this a legitimate fix, or do we still need the call to
>>>> votequorum_exec_send_nodeinfo for other reasons?
>>>
>>> That is good question. Calling of votequorum_exec_send_nodeinfo should
>>> not be needed because it's called by sync_process only slightly later.
>>>
>>> But to mark this as a legitimate fix, I would like to find out why is
>>> this happening and if it is legal or not. Basically because I'm not
>>> able to reproduce the bug at all (and I was really trying also with
>>> various usleeps/packet loss/...) I would like to have more information
>>> about notworking_cluster1.log. Because tracing doesn't work, we need
>>> to try blackbox. Could you please add
>>>
>>> icmap_set_string("runtime.blackbox.dump_flight_data", "yes");
>>>
>>> line before api->shutdown_request(); in cmap.c ?
>>>
>>> It should trigger dumping blackbox in /var/lib/corosync. When you
>>> reproduce the nonworking_cluster1, could you please ether:
>>> - compress the file pointed by /var/lib/corosync/fdata symlink
>>> - or execute corosync-blackbox
>>> - or execute qb-blackbox "/var/lib/corosync/fdata"
>>>
>>> and send it?
>>
>> Attached, along with the "debug: trace" log from cluster2.
> 
> Thanks a lot for the logs. I'm - finally!!!! - able to reproduce bug 
> (with the 2 artificial pauses - included at the end of the mail). I'll 
> try to fix the main bug (what may take some time, eventho I have kind of 
> idea what is happening) and let you know.

Glad to hear that the logs are useful and you're able to reproduce the 
problem! I look forward to hearing what you come up with, and am happy 
to test out patches if that would help.

> Thanks a lot for all the logs and your super helpful cooperation,
>    Honza

Same to you!

Thanks,
Jonathan




More information about the Users mailing list