[ClusterLabs] corosync 2.4 CPG config change callback
Thomas Lamprecht
t.lamprecht at proxmox.com
Mon May 7 07:20:25 EDT 2018
Hi,
Am 04/25/2018 um 09:57 AM schrieb Jan Friesse:
> Thomas Lamprecht napsal(a):
>> On 4/24/18 6:38 PM, Jan Friesse wrote:
>>>> On 4/6/18 10:59 AM, Jan Friesse wrote:
>>>>> Thomas Lamprecht napsal(a):
>>>>>> Am 03/09/2018 um 05:26 PM schrieb Jan Friesse:
>>>>>>> I've tested it too and yes, you are 100% right. Bug is there and
>>>>>>> it's
>>>>>>> pretty easy to reproduce when node with lowest nodeid is paused.
>>>>>>> It's
>>>>>>> slightly harder when node with higher nodeid is paused.
>>>>>>>
>>>>>>
>>>>>> Do you were able to make some progress on this issue?
>>>>>
>>>>> Ya, kind of. Sadly I had to work on different problem, but I'm
>>>>> expecting to sent patch next week.
>>>>>
>>>>
>>>> I guess the different problems where the ones related to the issued
>>>> CVEs :)
>>>
>>> Yep.
>>>
>>> Also I've spent quite a lot of the time thinking about best possible
>>> solution. CPG is quite old, it was full of weird bugs and risk of
>>> breakage is very high.
>>>
>>> Anyway, I've decided to not to try hack what is apparently broken and
>>> just go for risky but proper solution (= needs a LOT more testing,
>>> but so far looks good).
>>>
>>
>> I did not looked deep into how your revert plays out with the
>> mentioned commits of the heuristics approach, but this fix would
>> mean to bring corosync back to a state it had already, and thus
>> was already battle tested?
>
> Yep, but not fully. Important change was to use joinlists as
> authoritative source of information about other node clients, so I
> believe that solved problems which should had been "solved" by downlist
> heuristics.
>
>
>>
>> Patch and approach seems good to me, with my limited knowledge,
>> when looking at the various "bandaid" fix commits you mentioned.
>>
>>> Patch is in PR (needle): https://github.com/corosync/corosync/pull/347
>>>
>>
>> Much thanks! First tests work well here.
>> I could not yet reproduce the problem with the patch applied in both,
>> testcpg and our cluster configuration file system.
>
> That's good to hear :)
>
>>
>> I'll let it run
>
> Perfect.
>
Just wanted to give some quick feedback.
We deployed this to your community repository about a week ago (after
another week of successful testing), we had no negative feedback or
issues reported or seen yet, with (strong lower bound) > 10k systems
running the fix by now.
I saw just now that you merged it into needle and master, so, while a
bit late, this just backs the confidence into the fix up.
Much thanks for your, and the reviewers, work!
cheers,
Thomas
More information about the Users
mailing list