[ClusterLabs] corosync race condition when node leaves immediately after joining

Jonathan Davies jonathan.davies at citrix.com
Thu Oct 19 05:10:31 EDT 2017



On 18/10/17 16:18, Jan Friesse wrote:
> Jonathan,
> 
>>
>> On 18/10/17 14:38, Jan Friesse wrote:
>>> Can you please try to remove
>>> "votequorum_exec_send_nodeinfo(us->node_id);" line from votequorum.c
>>> in the votequorum_exec_init_fn function (around line 2306) and let me
>>> know if problem persists?
>>
>> Wow! With that change, I'm pleased to say that I'm not able to reproduce
>> the problem at all!
> 
> Sounds good.
> 
>>
>> Is this a legitimate fix, or do we still need the call to
>> votequorum_exec_send_nodeinfo for other reasons?
> 
> That is good question. Calling of votequorum_exec_send_nodeinfo should 
> not be needed because it's called by sync_process only slightly later.
> 
> But to mark this as a legitimate fix, I would like to find out why is 
> this happening and if it is legal or not. Basically because I'm not able 
> to reproduce the bug at all (and I was really trying also with various 
> usleeps/packet loss/...) I would like to have more information about 
> notworking_cluster1.log. Because tracing doesn't work, we need to try 
> blackbox. Could you please add
> 
> icmap_set_string("runtime.blackbox.dump_flight_data", "yes");
> 
> line before api->shutdown_request(); in cmap.c ?
> 
> It should trigger dumping blackbox in /var/lib/corosync. When you 
> reproduce the nonworking_cluster1, could you please ether:
> - compress the file pointed by /var/lib/corosync/fdata symlink
> - or execute corosync-blackbox
> - or execute qb-blackbox "/var/lib/corosync/fdata"
> 
> and send it?

Attached, along with the "debug: trace" log from cluster2.

Thanks,
Jonathan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fdata-2017-10-19T10:05:12-17515.gz
Type: application/gzip
Size: 13344 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20171019/a8ee3945/attachment-0003.gz>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: notworking_cluster1.log.gz
Type: application/gzip
Size: 1241 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20171019/a8ee3945/attachment-0004.gz>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: notworking_cluster2.log.gz
Type: application/gzip
Size: 7879 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20171019/a8ee3945/attachment-0005.gz>


More information about the Users mailing list