[ClusterLabs] corosync race condition when node leaves immediately after joining
Jan Friesse
jfriesse at redhat.com
Wed Oct 18 11:18:30 EDT 2017
Jonathan,
>
> On 18/10/17 14:38, Jan Friesse wrote:
>> Can you please try to remove
>> "votequorum_exec_send_nodeinfo(us->node_id);" line from votequorum.c
>> in the votequorum_exec_init_fn function (around line 2306) and let me
>> know if problem persists?
>
> Wow! With that change, I'm pleased to say that I'm not able to reproduce
> the problem at all!
Sounds good.
>
> Is this a legitimate fix, or do we still need the call to
> votequorum_exec_send_nodeinfo for other reasons?
That is good question. Calling of votequorum_exec_send_nodeinfo should
not be needed because it's called by sync_process only slightly later.
But to mark this as a legitimate fix, I would like to find out why is
this happening and if it is legal or not. Basically because I'm not able
to reproduce the bug at all (and I was really trying also with various
usleeps/packet loss/...) I would like to have more information about
notworking_cluster1.log. Because tracing doesn't work, we need to try
blackbox. Could you please add
icmap_set_string("runtime.blackbox.dump_flight_data", "yes");
line before api->shutdown_request(); in cmap.c ?
It should trigger dumping blackbox in /var/lib/corosync. When you
reproduce the nonworking_cluster1, could you please ether:
- compress the file pointed by /var/lib/corosync/fdata symlink
- or execute corosync-blackbox
- or execute qb-blackbox "/var/lib/corosync/fdata"
and send it?
Thank you for your help,
Honza
>
> Thanks,
> Jonathan
More information about the Users
mailing list