[ClusterLabs] corosync race condition when node leaves immediately after joining

Thu Oct 19 11:56:44 EDT 2017

Jonathan,

>
>
> On 18/10/17 16:18, Jan Friesse wrote:
>> Jonathan,
>>
>>>
>>> On 18/10/17 14:38, Jan Friesse wrote:
>>>> Can you please try to remove
>>>> "votequorum_exec_send_nodeinfo(us->node_id);" line from votequorum.c
>>>> in the votequorum_exec_init_fn function (around line 2306) and let me
>>>> know if problem persists?
>>>
>>> Wow! With that change, I'm pleased to say that I'm not able to reproduce
>>> the problem at all!
>>
>> Sounds good.
>>
>>>
>>> Is this a legitimate fix, or do we still need the call to
>>> votequorum_exec_send_nodeinfo for other reasons?
>>
>> That is good question. Calling of votequorum_exec_send_nodeinfo should
>> not be needed because it's called by sync_process only slightly later.
>>
>> But to mark this as a legitimate fix, I would like to find out why is
>> this happening and if it is legal or not. Basically because I'm not
>> able to reproduce the bug at all (and I was really trying also with
>> various usleeps/packet loss/...) I would like to have more information
>> about notworking_cluster1.log. Because tracing doesn't work, we need
>> to try blackbox. Could you please add
>>
>> icmap_set_string("runtime.blackbox.dump_flight_data", "yes");
>>
>> line before api->shutdown_request(); in cmap.c ?
>>
>> It should trigger dumping blackbox in /var/lib/corosync. When you
>> reproduce the nonworking_cluster1, could you please ether:
>> - compress the file pointed by /var/lib/corosync/fdata symlink
>> - or execute corosync-blackbox
>> - or execute qb-blackbox "/var/lib/corosync/fdata"
>>
>> and send it?
>
> Attached, along with the "debug: trace" log from cluster2.

Thanks a lot for the logs. I'm - finally!!!! - able to reproduce bug 
(with the 2 artificial pauses - included at the end of the mail). I'll 
try to fix the main bug (what may take some time, eventho I have kind of 
idea what is happening) and let you know.

Thanks a lot for all the logs and your super helpful cooperation,
   Honza



>
> Thanks,
> Jonathan

diff --git a/exec/cpg.c b/exec/cpg.c
index 78ac1e9..8a2ce6a 100644
--- a/exec/cpg.c
+++ b/exec/cpg.c
@@ -591,6 +591,14 @@ static void cpg_sync_init (
  static int cpg_sync_process (void)
  {
         int res = -1;
+       static int pause = 0;
+
+       if (pause < 20) {
+               pause++;
+               return (-1);
+       } else {
+               pause = 0;
+       }

         if (my_sync_state == CPGSYNC_DOWNLIST) {
                 res = cpg_exec_send_downlist();
diff --git a/exec/totemsrp.c b/exec/totemsrp.c
index 91c5423..1240229 100644
--- a/exec/totemsrp.c
+++ b/exec/totemsrp.c
@@ -2156,6 +2156,7 @@ static void memb_state_operational_enter (struct 
totemsrp_instance *instance)
         memcpy (&instance->my_old_ring_id, &instance->my_ring_id,
                 sizeof (struct memb_ring_id));

+       poll(NULL, 0, 600);
         return;
  }