[ClusterLabs] corosync race condition when node leaves immediately after joining
Jonathan Davies
jonathan.davies at citrix.com
Mon Oct 16 11:51:53 EDT 2017
On 16/10/17 15:58, Jan Friesse wrote:
> Jonathan,
>
>>
>>
>> On 13/10/17 17:24, Jan Friesse wrote:
>>>> I've done a bit of digging and am getting closer to the root cause of
>>>> the race.
>>>>
>>>> We rely on having votequorum_sync_init called twice -- once when node 1
>>>> joins (with member_list_entries=2) and once when node 1 leaves (with
>>>> member_list_entries=1). This is important because votequorum_sync_init
>>>> marks nodes as NODESTATE_DEAD if they are not in quorum_members[] -- so
>>>> it needs to have seen the node appear then disappear. This is important
>>>> because get_total_votes only counts votes from nodes in state
>>>> NODESTATE_MEMBER.
>>>
>>> So there are basically two problems.
>>>
>>> Actually first (main) problem is that votequorum_sync_init is ever
>>> called when that node joins. It really shouldn't. And problem is
>>> simply because calling api->shutdown_request() is not enough. Can you
>>> try replace it with exit(1) (for testing) and reproduce the problem?
>>> I'm pretty sure problem disappears.
>>
>> No, the problem still happens :-(
>
> Not good.
>
>>
>> I am using the following patch:
>>
>> diff --git a/exec/cmap.c b/exec/cmap.c
>> index de730d2..1125cef 100644
>> --- a/exec/cmap.c
>> +++ b/exec/cmap.c
>> @@ -406,7 +406,7 @@ static void cmap_sync_activate (void)
>> log_printf(LOGSYS_LEVEL_ERROR,
>> "Received config version (%"PRIu64") is different
>> than my config version (%"PRIu64")! Exiting",
>> cmap_highest_config_version_received,
>> cmap_my_config_version);
>> - api->shutdown_request();
>> + exit(1);
>> return ;
>> }
>> }
>> diff --git a/exec/main.c b/exec/main.c
>> index b0d5639..4fd3e68 100644
>> --- a/exec/main.c
>> +++ b/exec/main.c
>> @@ -627,6 +627,7 @@ static void deliver_fn (
>> ((void *)msg);
>> }
>>
>> + log_printf(LOGSYS_LEVEL_NOTICE, "executing '%s' exec_handler_fn
>> %p for node %d (fn %d)", corosync_service[service]->name,
>> corosync_service[service]->exec_engine[fn_id].exec_handler_fn, nodeid,
>> fn_id);
>> corosync_service[service]->exec_engine[fn_id].exec_handler_fn
>> (msg, nodeid);
>> }
>> diff --git a/exec/votequorum.c b/exec/votequorum.c
>> index 1a97c6d..7c0f34f 100644
>> --- a/exec/votequorum.c
>> +++ b/exec/votequorum.c
>> @@ -2099,6 +2100,7 @@ static void
>> message_handler_req_exec_votequorum_nodeinfo (
>> node->flags = req_exec_quorum_nodeinfo->flags;
>> node->votes = req_exec_quorum_nodeinfo->votes;
>> node->state = NODESTATE_MEMBER;
>> + log_printf(LOGSYS_LEVEL_NOTICE,
>> "message_handler_req_exec_votequorum_nodeinfo (%p) marking node %d as
>> MEMBER", message_handler_req_exec_votequorum_nodeinfo, nodeid);
>>
>> if (node->flags & NODE_FLAGS_LEAVING) {
>> node->state = NODESTATE_LEAVING;
>>
>> When it's working correctly I see this:
>>
>> 1508151960.072927 notice [TOTEM ] A new membership (10.71.218.17:2304)
>> was formed. Members joined: 1
>> 1508151960.073082 notice [SYNC ] calling sync_init on service
>> 'corosync configuration map access' (0) with my_member_list_entries = 2
>> 1508151960.073150 notice [MAIN ] executing 'corosync configuration map
>> access' exec_handler_fn 0x55b5eb504ca0 for node 1 (fn 0)
>> 1508151960.073197 notice [MAIN ] executing 'corosync configuration map
>> access' exec_handler_fn 0x55b5eb504ca0 for node 2 (fn 0)
>> 1508151960.073238 notice [SYNC ] calling sync_init on service
>> 'corosync cluster closed process group service v1.01' (1) with
>> my_member_list_entries = 2
>> 1508151961.073033 notice [TOTEM ] A processor failed, forming new
>> configuration.
>>
>> When it's not working correctly I see this:
>>
>> 1508151908.447584 notice [TOTEM ] A new membership (10.71.218.17:2292)
>> was formed. Members joined: 1
>> 1508151908.447757 notice [MAIN ] executing 'corosync vote quorum
>> service v1.0' exec_handler_fn 0x558b39fbbaa0 for node 1 (fn 0)
>> 1508151908.447866 notice [VOTEQ ]
>> message_handler_req_exec_votequorum_nodeinfo (0x558b39fbbaa0) marking
>> node 1 as MEMBER
>> 1508151908.447972 notice [VOTEQ ] get_total_votes: node 1 is a MEMBER
>> so counting vote
>> 1508151908.448045 notice [VOTEQ ] get_total_votes: node 2 is a MEMBER
>> so counting vote
>> 1508151908.448091 notice [QUORUM] This node is within the primary
>> component and will provide service.
>> 1508151908.448134 notice [QUORUM] Members[1]: 2
>> 1508151908.448175 notice [SYNC ] calling sync_init on service
>> 'corosync configuration map access' (0) with my_member_list_entries = 2
>> 1508151908.448205 notice [MAIN ] executing 'corosync configuration map
>> access' exec_handler_fn 0x558b39fb3ca0 for node 1 (fn 0)
>> 1508151908.448247 notice [MAIN ] executing 'corosync configuration map
>> access' exec_handler_fn 0x558b39fb3ca0 for node 2 (fn 0)
>> 1508151908.448307 notice [SYNC ] calling sync_init on service
>> 'corosync cluster closed process group service v1.01' (1) with
>> my_member_list_entries = 2
>> 1508151909.447182 notice [TOTEM ] A processor failed, forming new
>> configuration.
>>
>> ... and at that point I already see "Total votes: 2" in the
>> corosync-quorumtool output.
>>
>> The key difference seems to be whether votequorum's nodeinfo
>> exec_handler_fn is called. If it is called, it marks the node as
>> NODESTATE_MEMBER, which is sufficient to trigger the bug because it is
>> never subsequently marked as NODESTATE_DEAD.
>>
>> (Note that in both cases there is now a one-second delay before the "A
>> processor failed" message. This is due to calling exit(1) rather than
>> api->shutdown_request().)
>>
>>>> When it goes wrong, I see that votequorum_sync_init is only called
>>>> *once* (with member_list_entries=1) -- after node 1 has joined and
>>>> left.
>>>> So it never sees node 1 in member_list, hence never marks it as
>>>> NODESTATE_DEAD. But message_handler_req_exec_votequorum_nodeinfo has
>>>> indepedently marked the node as NODESTATE_MEMBER, hence get_total_votes
>>>> counts it and quorate is set to 1.
>>>
>>> This is second problem and it has to be also fixed. You have to be
>>> pretty lucky to reproduce it so often.
>>>
>>> Anyway, here is theory:
>>> - node 1 calls api->shutdown_request() and continue processing (this
>>> is problem 1)
>>> - both nodes gets to a state where they should call
>>> votequorum_sync_init, but node 2 is now not scheduled (as I said, it
>>> must be pretty luck)
>>> - node 1 calls votequorum_sync_init and votequorum_sync_process
>>> (sending nodeinfo) and it's shutdown
>>> - node 2 gets nodeinfo
>>> - node 2 sees node 1 shutdown
>>> - node 2 calls votequorum_sync_init and votequorum_sync_process
>>>
>>> If this theory is true, we must probably fix the sync.c to have 2 or
>>> ideally 3-4 barriers instead of 1.
>>>
>>>>
>>>> So why is votequorum_sync_init sometimes only called once? It looks
>>>> like
>>>> it's all down to whether we manage to iterate through all the calls to
>>>> schedwrk_processor before entering the OPERATIONAL state. I haven't yet
>>>> looked into exactly what controls the timing of these two things.
>>>>
>>>> Adding the following patch helps me to demonstrate the problem more
>>>> clearly:
>>>>
>>>> diff --git a/exec/sync.c b/exec/sync.c
>>>> index e7b71bd..a2fb06d 100644
>>>> --- a/exec/sync.c
>>>> +++ b/exec/sync.c
>>>> @@ -544,6 +545,7 @@ static int schedwrk_processor (const void *context)
>>>> }
>>>>
>>>> if
>>>> (my_sync_callbacks_retrieve(my_service_list[my_processing_idx].service_id,
>>>>
>>>>
>>>> NULL) != -1) {
>>>> + log_printf(LOGSYS_LEVEL_NOTICE, "calling
>>>> sync_init on service '%s' (%d) with my_member_list_entries = %d",
>>>> my_service_list[my_processing_idx].name, my_processing_idx,
>>>> my_member_list_entries);
>>>> my_service_list[my_processing_idx].sync_init
>>>> (my_trans_list,
>>>> my_trans_list_entries, my_member_list,
>>>> my_member_list_entries,
>>>> diff --git a/exec/votequorum.c b/exec/votequorum.c
>>>> index d5f06c1..aab6c15 100644
>>>> --- a/exec/votequorum.c
>>>> +++ b/exec/votequorum.c
>>>> @@ -2336,6 +2353,8 @@ static void votequorum_sync_init (
>>>> int left_nodes;
>>>> struct cluster_node *node;
>>>>
>>>> + log_printf(LOGSYS_LEVEL_NOTICE, "votequorum_sync_init has %d
>>>> member_list_entries", member_list_entries);
>>>> +
>>>> ENTER();
>>>>
>>>> sync_in_progress = 1;
>>>>
>>>> When it works correctly I see the following (selected log lines):
>>>>
>>>> notice [TOTEM ] A new membership (10.71.218.17:2016) was formed.
>>>> Members joined: 1
>>>> notice [SYNC ] calling sync_init on service 'corosync configuration
>>>> map access' (0) with my_member_list_entries = 2
>>>> notice [SYNC ] calling sync_init on service 'corosync cluster closed
>>>> process group service v1.01' (1) with my_member_list_entries = 2
>>>> notice [SYNC ] calling sync_init on service 'corosync vote quorum
>>>> service v1.0' (2) with my_member_list_entries = 2
>>>> notice [VOTEQ ] votequorum_sync_init has 2 member_list_entries
>>>> notice [TOTEM ] A new membership (10.71.218.18:2020) was formed.
>>>> Members left: 1
>>>> notice [SYNC ] calling sync_init on service 'corosync configuration
>>>> map access' (0) with my_member_list_entries = 1
>>>> notice [SYNC ] calling sync_init on service 'corosync cluster closed
>>>> process group service v1.01' (1) with my_member_list_entries = 1
>>>> notice [SYNC ] calling sync_init on service 'corosync vote quorum
>>>> service v1.0' (2) with my_member_list_entries = 1
>>>> notice [VOTEQ ] votequorum_sync_init has 1 member_list_entries
>>>>
>>>> -- Notice that votequorum_sync_init is called once with 2 members and
>>>> once with 1 member.
>>>>
>>>> When it goes wrong I see the following (selected log lines):
>>>>
>>>> notice [TOTEM ] A new membership (10.71.218.17:2004) was formed.
>>>> Members joined: 1
>>>> notice [SYNC ] calling sync_init on service 'corosync configuration
>>>> map access' (0) with my_member_list_entries = 2
>>>> notice [SYNC ] calling sync_init on service 'corosync cluster closed
>>>> process group service v1.01' (1) with my_member_list_entries = 2
>>>> notice [TOTEM ] A new membership (10.71.218.18:2008) was formed.
>>>> Members left: 1
>>>> notice [SYNC ] calling sync_init on service 'corosync configuration
>>>> map access' (0) with my_member_list_entries = 1
>>>> notice [SYNC ] calling sync_init on service 'corosync cluster closed
>>>> process group service v1.01' (1) with my_member_list_entries = 1
>>>> notice [SYNC ] calling sync_init on service 'corosync vote quorum
>>>> service v1.0' (2) with my_member_list_entries = 1
>>>> notice [VOTEQ ] votequorum_sync_init has 1 member_list_entries
>>>>
>>>> -- Notice the value of my_member_list_entries in the different
>>>> sync_init calls, and that votequorum_sync_init is only called once.
>>>>
>>>> Does this help explain the issue?
>>>
>>> It's definitively helpful. I will try to thing about what may be
>>> happening a little deeper. Can you please add similar debug info as
>>> you have for sync_init also to other sync_ and send me the log of both
>>> nodes when everything works and when do not?
>>
>> Okay. I've rolled back to make the call to api->shutdown_request() on
>> config_version mismatch, and have added debug lines in sync.c just
>> before calls to sync_activate, sync_abort and sync_process.
>>
>> When it's working correctly:
>>
>> 1508158482.991782 notice [TOTEM ] A new membership (10.71.218.17:2384)
>> was formed. Members joined: 1
>> 1508158482.991930 notice [MAIN ] executing 'corosync vote quorum
>> service v1.0' exec_handler_fn 0x56492b0d6bf0 for node 1 (fn 0)
>> 1508158482.992029 notice [VOTEQ ]
>> message_handler_req_exec_votequorum_nodeinfo (0x56492b0d6bf0) marking
>> node 1 as MEMBER
>> 1508158482.992122 notice [VOTEQ ] get_total_votes: node 1 is a MEMBER
>> so counting vote
>> 1508158482.992206 notice [VOTEQ ] get_total_votes: node 2 is a MEMBER
>> so counting vote
>> 1508158482.992263 notice [QUORUM] This node is within the primary
>> component and will provide service.
>> 1508158482.992307 notice [QUORUM] Members[1]: 2
>> 1508158482.992351 notice [SYNC ] calling sync_init on service
>> 'corosync configuration map access' (0) with my_member_list_entries = 2
>> 1508158482.992382 notice [SYNC ] calling sync_process on service
>> 'corosync configuration map access' (0)
>> 1508158482.992425 notice [MAIN ] executing 'corosync configuration map
>> access' exec_handler_fn 0x56492b0cedf0 for node 1 (fn 0)
>> 1508158482.992456 notice [MAIN ] executing 'corosync configuration map
>> access' exec_handler_fn 0x56492b0cedf0 for node 2 (fn 0)
>> 1508158482.992498 notice [SYNC ] calling sync_activate on service
>> 'corosync configuration map access' (0)
>> 1508158482.992529 notice [SYNC ] calling sync_init on service
>> 'corosync cluster closed process group service v1.01' (1) with
>> my_member_list_entries = 2
>> 1508158482.992572 notice [SYNC ] calling sync_process on service
>> 'corosync cluster closed process group service v1.01' (1)
>> 1508158482.992603 notice [MAIN ] executing 'corosync cluster closed
>> process group service v1.01' exec_handler_fn 0x56492b0d2940 for node 2
>> (fn 5)
>> 1508158482.992646 notice [MAIN ] executing 'corosync cluster closed
>> process group service v1.01' exec_handler_fn 0x56492b0d2940 for node 1
>> (fn 5)
>> 1508158482.992677 notice [SYNC ] calling sync_activate on service
>> 'corosync cluster closed process group service v1.01' (1)
>> 1508158482.992720 notice [SYNC ] calling sync_init on service
>> 'corosync vote quorum service v1.0' (2) with my_member_list_entries = 2
>> 1508158482.992750 notice [VOTEQ ] votequorum_sync_init has 2
>> member_list_entries
>> 1508158482.992792 notice [SYNC ] calling sync_process on service
>> 'corosync vote quorum service v1.0' (2)
>> 1508158482.992850 notice [TOTEM ] A new membership (10.71.218.18:2388)
>> was formed. Members left: 1
>> 1508158482.992895 notice [MAIN ] executing 'corosync vote quorum
>> service v1.0' exec_handler_fn 0x56492b0d6bf0 for node 2 (fn 0)
>> 1508158482.992938 notice [VOTEQ ]
>> message_handler_req_exec_votequorum_nodeinfo (0x56492b0d6bf0) marking
>> node 2 as MEMBER
>> 1508158482.992968 notice [VOTEQ ] get_total_votes: node 1 is a MEMBER
>> so counting vote
>> 1508158482.993011 notice [VOTEQ ] get_total_votes: node 2 is a MEMBER
>> so counting vote
>> 1508158482.993042 notice [MAIN ] executing 'corosync vote quorum
>> service v1.0' exec_handler_fn 0x56492b0d6bf0 for node 2 (fn 0)
>> 1508158482.993084 notice [SYNC ] calling sync_init on service
>> 'corosync configuration map access' (0) with my_member_list_entries = 1
>> 1508158482.993115 notice [SYNC ] calling sync_process on service
>> 'corosync configuration map access' (0)
>> 1508158482.993157 notice [MAIN ] executing 'corosync configuration map
>> access' exec_handler_fn 0x56492b0cedf0 for node 2 (fn 0)
>> 1508158482.993188 notice [SYNC ] calling sync_activate on service
>> 'corosync configuration map access' (0)
>> 1508158482.993231 notice [SYNC ] calling sync_init on service
>> 'corosync cluster closed process group service v1.01' (1) with
>> my_member_list_entries = 1
>> 1508158482.993261 notice [SYNC ] calling sync_process on service
>> 'corosync cluster closed process group service v1.01' (1)
>> 1508158482.993304 notice [MAIN ] executing 'corosync cluster closed
>> process group service v1.01' exec_handler_fn 0x56492b0d2940 for node 2
>> (fn 5)
>> 1508158482.993335 notice [SYNC ] calling sync_activate on service
>> 'corosync cluster closed process group service v1.01' (1)
>> 1508158482.993378 notice [SYNC ] calling sync_init on service
>> 'corosync vote quorum service v1.0' (2) with my_member_list_entries = 1
>> 1508158482.993409 notice [VOTEQ ] votequorum_sync_init has 1
>> member_list_entries
>> 1508158482.993452 notice [VOTEQ ] votequorum_sync_init marking node 1
>> as DEAD
>> 1508158482.993482 notice [SYNC ] calling sync_process on service
>> 'corosync vote quorum service v1.0' (2)
>> 1508158482.993525 notice [MAIN ] executing 'corosync vote quorum
>> service v1.0' exec_handler_fn 0x56492b0d6bf0 for node 2 (fn 0)
>> 1508158482.993555 notice [VOTEQ ]
>> message_handler_req_exec_votequorum_nodeinfo (0x56492b0d6bf0) marking
>> node 2 as MEMBER
>> 1508158482.993598 notice [VOTEQ ] get_total_votes: node 2 is a MEMBER
>> so counting vote
>> 1508158482.993628 notice [MAIN ] executing 'corosync vote quorum
>> service v1.0' exec_handler_fn 0x56492b0d6bf0 for node 2 (fn 0)
>> 1508158482.993673 notice [SYNC ] calling sync_activate on service
>> 'corosync vote quorum service v1.0' (2)
>> 1508158482.993703 notice [VOTEQ ] get_total_votes: node 2 is a MEMBER
>> so counting vote
>> 1508158482.993746 notice [QUORUM] This node is within the non-primary
>> component and will NOT provide any services.
>> 1508158482.993776 notice [QUORUM] Members[1]: 2
>> 1508158482.993818 notice [MAIN ] Completed service synchronization,
>> ready to provide service.
>>
>> When it's working incorrectly:
>>
>> 1508158493.905246 notice [TOTEM ] A new membership (10.71.218.17:2392)
>> was formed. Members joined: 1
>> 1508158493.905392 notice [MAIN ] executing 'corosync vote quorum
>> service v1.0' exec_handler_fn 0x56492b0d6bf0 for node 1 (fn 0)
>> 1508158493.905522 notice [VOTEQ ]
>> message_handler_req_exec_votequorum_nodeinfo (0x56492b0d6bf0) marking
>> node 1 as MEMBER
>> 1508158493.905623 notice [VOTEQ ] get_total_votes: node 1 is a MEMBER
>> so counting vote
>> 1508158493.905697 notice [VOTEQ ] get_total_votes: node 2 is a MEMBER
>> so counting vote
>> 1508158493.905742 notice [QUORUM] This node is within the primary
>> component and will provide service.
>> 1508158493.905773 notice [QUORUM] Members[1]: 2
>> 1508158493.905815 notice [SYNC ] calling sync_init on service
>> 'corosync configuration map access' (0) with my_member_list_entries = 2
>> 1508158493.905889 notice [SYNC ] calling sync_process on service
>> 'corosync configuration map access' (0)
>> 1508158493.905921 notice [MAIN ] executing 'corosync configuration map
>> access' exec_handler_fn 0x56492b0cedf0 for node 1 (fn 0)
>> 1508158493.905951 notice [MAIN ] executing 'corosync configuration map
>> access' exec_handler_fn 0x56492b0cedf0 for node 2 (fn 0)
>> 1508158493.905993 notice [SYNC ] calling sync_activate on service
>> 'corosync configuration map access' (0)
>> 1508158493.906023 notice [SYNC ] calling sync_init on service
>> 'corosync cluster closed process group service v1.01' (1) with
>> my_member_list_entries = 2
>> 1508158493.906066 notice [SYNC ] calling sync_process on service
>> 'corosync cluster closed process group service v1.01' (1)
>> 1508158493.906096 notice [MAIN ] executing 'corosync cluster closed
>> process group service v1.01' exec_handler_fn 0x56492b0d2940 for node 2
>> (fn 5)
>> 1508158493.906138 notice [TOTEM ] A new membership (10.71.218.18:2396)
>> was formed. Members left: 1
>> 1508158493.906168 notice [SYNC ] calling sync_init on service
>> 'corosync configuration map access' (0) with my_member_list_entries = 1
>> 1508158493.906210 notice [SYNC ] calling sync_process on service
>> 'corosync configuration map access' (0)
>> 1508158493.906240 notice [MAIN ] executing 'corosync configuration map
>> access' exec_handler_fn 0x56492b0cedf0 for node 2 (fn 0)
>> 1508158493.906282 notice [SYNC ] calling sync_activate on service
>> 'corosync configuration map access' (0)
>> 1508158493.906312 notice [SYNC ] calling sync_init on service
>> 'corosync cluster closed process group service v1.01' (1) with
>> my_member_list_entries = 1
>> 1508158493.906354 notice [SYNC ] calling sync_process on service
>> 'corosync cluster closed process group service v1.01' (1)
>> 1508158493.906384 notice [MAIN ] executing 'corosync cluster closed
>> process group service v1.01' exec_handler_fn 0x56492b0d2940 for node 2
>> (fn 5)
>> 1508158493.906427 notice [SYNC ] calling sync_activate on service
>> 'corosync cluster closed process group service v1.01' (1)
>> 1508158493.906457 notice [SYNC ] calling sync_init on service
>> 'corosync vote quorum service v1.0' (2) with my_member_list_entries = 1
>> 1508158493.906499 notice [VOTEQ ] votequorum_sync_init has 1
>> member_list_entries
>> 1508158493.906528 notice [SYNC ] calling sync_process on service
>> 'corosync vote quorum service v1.0' (2)
>> 1508158493.906570 notice [MAIN ] executing 'corosync vote quorum
>> service v1.0' exec_handler_fn 0x56492b0d6bf0 for node 2 (fn 0)
>> 1508158493.906600 notice [VOTEQ ]
>> message_handler_req_exec_votequorum_nodeinfo (0x56492b0d6bf0) marking
>> node 2 as MEMBER
>> 1508158493.906642 notice [VOTEQ ] get_total_votes: node 1 is a MEMBER
>> so counting vote
>> 1508158493.906672 notice [VOTEQ ] get_total_votes: node 2 is a MEMBER
>> so counting vote
>> 1508158493.906714 notice [MAIN ] executing 'corosync vote quorum
>> service v1.0' exec_handler_fn 0x56492b0d6bf0 for node 2 (fn 0)
>> 1508158493.906743 notice [SYNC ] calling sync_activate on service
>> 'corosync vote quorum service v1.0' (2)
>> 1508158493.906785 notice [VOTEQ ] get_total_votes: node 1 is a MEMBER
>> so counting vote
>> 1508158493.906815 notice [VOTEQ ] get_total_votes: node 2 is a MEMBER
>> so counting vote
>> 1508158493.906856 notice [QUORUM] Members[1]: 2
>> 1508158493.906886 notice [MAIN ] Completed service synchronization,
>> ready to provide service.
>>
>> Now it seems that sync calls votequorum's exec_handler_fn in both
>> working and non-working cases. So now it relies on votequorum_sync_init
>> being called once with both members and once without node 1 in order to
>> set the node as NODESTATE_DEAD.
>
> votequorum's exec_handler_fn is probably executed because of icmap events.
>
> Can you please try to set debug to trace (so config file will look like
>
> logging {
> to_syslog: yes
> to_stderr: yes
> debug: trace
> }
>
> and send full log (so including "notice [MAIN ] Corosync Cluster
> Engine ('2.4.2.45-2fbb'): started and ready to provide service.") from
> both nodes with description which one is which (ip + nodeid) for failing
> case?
Interestingly I didn't manage to reproduce the problem when I use
"debug: trace" on the node that joins and immediately leaves (cluster1).
That suggests to me that it's a timing issue (and is probably the reason
I can reproduce it consistently but you can't).
Therefore the best I can do is give you "debug: trace" logs from
cluster2 but only "debug: off" logs from cluster1 in the case where it
ends up in the wrong state. Is that at all useful?
IP addresses:
10.71.218.18 cluster2
10.71.218.17 cluster1
Nodeid mapping:
nodelist {
node {
nodeid: 1
ring0_addr: cluster1
}
node {
nodeid: 2
ring0_addr: cluster2
}
}
Method:
* I run the following command on cluster2:
corosync -f 2>&1 | ts '%.s' | tee cluster2.log
* Then run the following command on cluster1:
corosync -f 2>&1 | ts '%.s' | tee cluster1.log
* Then I take a look at corosync-quorumtool on cluster2 to check
whether it triggered the bug or not.
* Then I press Ctrl-C on cluster2.
I have attached zipped versions of cluster1.log and cluster2.log from
when it worked and when it broke. I hope you're able to spot something
useful!
Thanks,
Jonathan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: notworking_cluster1.log.gz
Type: application/gzip
Size: 1133 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20171016/eaedfde4/attachment-0004.gz>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: notworking_cluster2.log.gz
Type: application/gzip
Size: 7864 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20171016/eaedfde4/attachment-0005.gz>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: working_cluster1.log.gz
Type: application/gzip
Size: 5103 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20171016/eaedfde4/attachment-0006.gz>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: working_cluster2.log.gz
Type: application/gzip
Size: 7391 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20171016/eaedfde4/attachment-0007.gz>
More information about the Users
mailing list