[ClusterLabs] Upgrade corosync problem

Mon Jul 2 02:54:51 EDT 2018

On 29/06/18 17:20, Jan Pokorný wrote:
> On 29/06/18 10:00 +0100, Christine Caulfield wrote:
>> On 27/06/18 08:35, Salvatore D'angelo wrote:
>>> One thing that I do not understand is that I tried to compare corosync
>>> 2.3.5 (the old version that worked fine) and 2.4.4 to understand
>>> differences but I haven’t found anything related to the piece of code
>>> that affects the issue. The quorum tool.c and cfg.c are almost the same.
>>> Probably the issue is somewhere else.
>>>
>>
>> This might be asking a bit much, but would it be possible to try this
>> using Virtual Machines rather than Docker images? That would at least
>> eliminate a lot of complex variables.
> 
> Salvatore, you can ignore the part below, try following the "--shm"
> advice in other part of this thread.  Also the previous suggestion
> to compile corosync with --small-memory-footprint may be of help,
> but comes with other costs (expect lower throughput).
> 
> 
> Chrissie, I have a plausible explanation and if it's true, then the
> same will be reproduced wherever /dev/shm is small enough.
> 
> If I am right, then the offending commit is
> https://github.com/corosync/corosync/commit/238e2e62d8b960e7c10bfa0a8281d78ec99f3a26
> (present since 2.4.3), and while it arranges things for the better
> in the context of prioritized, low jitter process, it all of
> a sudden prevents as-you-need memory acquisition from the system,
> meaning that the memory consumption constraints are checked immediately
> when the memory is claimed (as it must fit into dedicated physical
> memory in full).  Hence this impact we likely never realized may
> be perceived as a sort of a regression.
> 
> Since we can calculate the approximate requirements statically, might
> be worthy to add something like README.requirements, detailing how much
> space will be occupied for typical configurations at minimum, e.g.:
> 
> - standard + --small-memory-footprint configuration
> - 2 + 3 + X nodes (5?)
> - without any service on top + teamed with qnetd + teamed with
>   pacemaker atop (including just IPC channels between pacemaker
>   daemons and corosync's CPG service, indeed)
> 

That is possible explanation I suppose, yes.it's not something we can
sensibly revert because it was already fixing another regression!

I like the idea of documenting the /dev/shm requrements - that would
certainly help with other people using containers - Salvatore mentioned
earlier that there was nothing to guide him about the size needed. I'll
raise an issue in github to cover it. Your input on how to do it for
containers would also be helpful.

Chrissie