[ClusterLabs] corosync 2.4 CPG config change callback

Thomas Lamprecht t.lamprecht at proxmox.com
Mon May 7 07:20:25 EDT 2018


Hi,

Am 04/25/2018 um 09:57 AM schrieb Jan Friesse:
> Thomas Lamprecht napsal(a):
>> On 4/24/18 6:38 PM, Jan Friesse wrote:
>>>> On 4/6/18 10:59 AM, Jan Friesse wrote:
>>>>> Thomas Lamprecht napsal(a):
>>>>>> Am 03/09/2018 um 05:26 PM schrieb Jan Friesse:
>>>>>>> I've tested it too and yes, you are 100% right. Bug is there and 
>>>>>>> it's
>>>>>>> pretty easy to reproduce when node with lowest nodeid is paused. 
>>>>>>> It's
>>>>>>> slightly harder when node with higher nodeid is paused.
>>>>>>>
>>>>>>
>>>>>> Do you were able to make some progress on this issue?
>>>>>
>>>>> Ya, kind of. Sadly I had to work on different problem, but I'm 
>>>>> expecting to sent patch next week.
>>>>>
>>>>
>>>> I guess the different problems where the ones related to the issued 
>>>> CVEs :)
>>>
>>> Yep.
>>>
>>> Also I've spent quite a lot of the time thinking about best possible 
>>> solution. CPG is quite old, it was full of weird bugs and risk of 
>>> breakage is very high.
>>>
>>> Anyway, I've decided to not to try hack what is apparently broken and 
>>> just go for risky but proper solution (= needs a LOT more testing, 
>>> but so far looks good).
>>>
>>
>> I did not looked deep into how your revert plays out with the
>> mentioned commits of the heuristics approach, but this fix would
>> mean to bring corosync back to a state it had already, and thus
>> was already battle tested?
> 
> Yep, but not fully. Important change was to use joinlists as 
> authoritative source of information about other node clients, so I 
> believe that solved problems which should had been "solved" by downlist 
> heuristics.
> 
> 
>>
>> Patch and approach seems good to me, with my limited knowledge,
>> when looking at the various "bandaid" fix commits you mentioned.
>>
>>> Patch is in PR (needle): https://github.com/corosync/corosync/pull/347
>>>
>>
>> Much thanks! First tests work well here.
>> I could not yet reproduce the problem with the patch applied in both,
>> testcpg and our cluster configuration file system.
> 
> That's good to hear :)
> 
>>
>> I'll let it run
> 
> Perfect.
> 


Just wanted to give some quick feedback.
We deployed this to your community repository about a week ago (after
another week of successful testing), we had no negative feedback or
issues reported or seen yet, with (strong lower bound) > 10k systems
running the fix by now.

I saw just now that you merged it into needle and master, so, while a 
bit late, this just backs the confidence into the fix up.

Much thanks for your, and the reviewers, work!

cheers,
Thomas



More information about the Users mailing list