[ClusterLabs] corosync race condition when node leaves immediately after joining

Wed Oct 18 11:18:30 EDT 2017

Jonathan,

>
> On 18/10/17 14:38, Jan Friesse wrote:
>> Can you please try to remove
>> "votequorum_exec_send_nodeinfo(us->node_id);" line from votequorum.c
>> in the votequorum_exec_init_fn function (around line 2306) and let me
>> know if problem persists?
>
> Wow! With that change, I'm pleased to say that I'm not able to reproduce
> the problem at all!

Sounds good.

>
> Is this a legitimate fix, or do we still need the call to
> votequorum_exec_send_nodeinfo for other reasons?

That is good question. Calling of votequorum_exec_send_nodeinfo should 
not be needed because it's called by sync_process only slightly later.

But to mark this as a legitimate fix, I would like to find out why is 
this happening and if it is legal or not. Basically because I'm not able 
to reproduce the bug at all (and I was really trying also with various 
usleeps/packet loss/...) I would like to have more information about 
notworking_cluster1.log. Because tracing doesn't work, we need to try 
blackbox. Could you please add

icmap_set_string("runtime.blackbox.dump_flight_data", "yes");

line before api->shutdown_request(); in cmap.c ?

It should trigger dumping blackbox in /var/lib/corosync. When you 
reproduce the nonworking_cluster1, could you please ether:
- compress the file pointed by /var/lib/corosync/fdata symlink
- or execute corosync-blackbox
- or execute qb-blackbox "/var/lib/corosync/fdata"

and send it?

Thank you for your help,
   Honza

>
> Thanks,
> Jonathan