[ClusterLabs] Upgrade corosync problem

Tue Jun 26 12:08:40 UTC 2018

On 26/06/18 12:16, Salvatore D'angelo wrote:
> libqb update to 1.0.3 but same issue.
> 
> I know corosync has also these dependencies nspr and nss3. I updated
> them using apt-get install, here the version installed:
> 
>    libnspr4, libnspr4-dev  2:4.13.1-0ubuntu0.14.04.1
>    libnss3, libnss3-dev, libnss3-nssb   2:3.28.4-0ubuntu0.14.04.3
> 
> but same problem.
> 
> I am working on Ubuntu 14.04 image and I know that packages could be
> quite old here. Are there new versions for these libraries?
> Where I can download them? I tried to search on google but results where
> quite confusing.
> 

It's pretty unlikely to be the crypto libraries. It's almost certainly
in libqb, with a small possibility that of corosync.  Which versions did
you have that worked (libqb and corosync) ?

Chrissie

> 
>> On 26 Jun 2018, at 12:27, Christine Caulfield <ccaulfie at redhat.com
>> <mailto:ccaulfie at redhat.com>> wrote:
>>
>> On 26/06/18 11:24, Salvatore D'angelo wrote:
>>> Hi,
>>>
>>> I have tried with:
>>> 0.16.0.real-1ubuntu4
>>> 0.16.0.real-1ubuntu5
>>>
>>> which version should I try?
>>
>>
>> Hmm both of those are actually quite old! maybe a newer one?
>>
>> Chrissie
>>
>>>
>>>> On 26 Jun 2018, at 12:03, Christine Caulfield <ccaulfie at redhat.com
>>>> <mailto:ccaulfie at redhat.com>
>>>> <mailto:ccaulfie at redhat.com>> wrote:
>>>>
>>>> On 26/06/18 11:00, Salvatore D'angelo wrote:
>>>>> Consider that the container is the same when corosync 2.3.5 run.
>>>>> If it is something related to the container probably the 2.4.4
>>>>> introduced a feature that has an impact on container.
>>>>> Should be something related to libqb according to the code.
>>>>> Anyone can help?
>>>>>
>>>>
>>>>
>>>> Have you tried downgrading libqb to the previous version to see if it
>>>> still happens?
>>>>
>>>> Chrissie
>>>>
>>>>>> On 26 Jun 2018, at 11:56, Christine Caulfield <ccaulfie at redhat.com
>>>>>> <mailto:ccaulfie at redhat.com>
>>>>>> <mailto:ccaulfie at redhat.com>
>>>>>> <mailto:ccaulfie at redhat.com>> wrote:
>>>>>>
>>>>>> On 26/06/18 10:35, Salvatore D'angelo wrote:
>>>>>>> Sorry after the command:
>>>>>>>
>>>>>>> corosync-quorumtool -ps
>>>>>>>
>>>>>>> the error in log are still visible. Looking at the source code it
>>>>>>> seems
>>>>>>> problem is at this line:
>>>>>>> https://github.com/corosync/corosync/blob/master/tools/corosync-quorumtool.c
>>>>>>>
>>>>>>>     if (quorum_initialize(&q_handle, &q_callbacks, &q_type) !=
>>>>>>> CS_OK) {
>>>>>>> fprintf(stderr, "Cannot initialize QUORUM service\n");
>>>>>>> q_handle = 0;
>>>>>>> goto out;
>>>>>>> }
>>>>>>>
>>>>>>> if (corosync_cfg_initialize(&c_handle, &c_callbacks) != CS_OK) {
>>>>>>> fprintf(stderr, "Cannot initialise CFG service\n");
>>>>>>> c_handle = 0;
>>>>>>> goto out;
>>>>>>> }
>>>>>>>
>>>>>>> The quorum_initialize function is defined here:
>>>>>>> https://github.com/corosync/corosync/blob/master/lib/quorum.c
>>>>>>>
>>>>>>> It seems interacts with libqb to allocate space on /dev/shm but
>>>>>>> something fails. I tried to update the libqb with apt-get install
>>>>>>> but no
>>>>>>> success.
>>>>>>>
>>>>>>> The same for second function:
>>>>>>> https://github.com/corosync/corosync/blob/master/lib/cfg.c
>>>>>>>
>>>>>>> Now I am not an expert of libqb. I have the
>>>>>>> version 0.16.0.real-1ubuntu5.
>>>>>>>
>>>>>>> The folder /dev/shm has 777 permission like other nodes with older
>>>>>>> corosync and pacemaker that work fine. The only difference is that I
>>>>>>> only see files created by root, no one created by hacluster like
>>>>>>> other
>>>>>>> two nodes (probably because pacemaker didn’t start correctly).
>>>>>>>
>>>>>>> This is the analysis I have done so far.
>>>>>>> Any suggestion?
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> Hmm. t seems very likely something to do with the way the container is
>>>>>> set up then - and I know nothing about containers. Sorry :/
>>>>>>
>>>>>> Can anyone else help here?
>>>>>>
>>>>>> Chrissie
>>>>>>
>>>>>>>> On 26 Jun 2018, at 11:03, Salvatore D'angelo
>>>>>>>> <sasadangelo at gmail.com <mailto:sasadangelo at gmail.com>
>>>>>>>> <mailto:sasadangelo at gmail.com>
>>>>>>>> <mailto:sasadangelo at gmail.com>
>>>>>>>> <mailto:sasadangelo at gmail.com>> wrote:
>>>>>>>>
>>>>>>>> Yes, sorry you’re right I could find it by myself.
>>>>>>>> However, I did the following:
>>>>>>>>
>>>>>>>> 1. Added the line you suggested to /etc/fstab
>>>>>>>> 2. mount -o remount /dev/shm
>>>>>>>> 3. Now I correctly see /dev/shm of 512M with df -h
>>>>>>>> Filesystem      Size  Used Avail Use% Mounted on
>>>>>>>> overlay          63G   11G   49G  19% /
>>>>>>>> tmpfs            64M  4.0K   64M   1% /dev
>>>>>>>> tmpfs          1000M     0 1000M   0% /sys/fs/cgroup
>>>>>>>> osxfs           466G  158G  305G  35% /Users
>>>>>>>> /dev/sda1        63G   11G   49G  19% /etc/hosts
>>>>>>>> *shm             512M   15M  498M   3% /dev/shm*
>>>>>>>> tmpfs          1000M     0 1000M   0% /sys/firmware
>>>>>>>> tmpfs           128M     0  128M   0% /tmp
>>>>>>>>
>>>>>>>> The errors in log went away. Consider that I remove the log file
>>>>>>>> before start corosync so it does not contains lines of previous
>>>>>>>> executions.
>>>>>>>> <corosync.log>
>>>>>>>>
>>>>>>>> But the command:
>>>>>>>> corosync-quorumtool -ps
>>>>>>>>
>>>>>>>> still give:
>>>>>>>> Cannot initialize QUORUM service
>>>>>>>>
>>>>>>>> Consider that few minutes before it gave me the message:
>>>>>>>> Cannot initialize CFG service
>>>>>>>>
>>>>>>>> I do not know the differences between CFG and QUORUM in this case.
>>>>>>>>
>>>>>>>> If I try to start pacemaker the service is OK but I see only
>>>>>>>> pacemaker
>>>>>>>> and the Transport does not work if I try to run a cam command.
>>>>>>>> Any suggestion?
>>>>>>>>
>>>>>>>>
>>>>>>>>> On 26 Jun 2018, at 10:49, Christine Caulfield
>>>>>>>>> <ccaulfie at redhat.com <mailto:ccaulfie at redhat.com>
>>>>>>>>> <mailto:ccaulfie at redhat.com>
>>>>>>>>> <mailto:ccaulfie at redhat.com>
>>>>>>>>> <mailto:ccaulfie at redhat.com>> wrote:
>>>>>>>>>
>>>>>>>>> On 26/06/18 09:40, Salvatore D'angelo wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> Yes,
>>>>>>>>>>
>>>>>>>>>> I am reproducing only the required part for test. I think the
>>>>>>>>>> original
>>>>>>>>>> system has a larger shm. The problem is that I do not know
>>>>>>>>>> exactly how
>>>>>>>>>> to change it.
>>>>>>>>>> I tried the following steps, but I have the impression I didn’t
>>>>>>>>>> performed the right one:
>>>>>>>>>>
>>>>>>>>>> 1. remove everything under /tmp
>>>>>>>>>> 2. Added the following line to /etc/fstab
>>>>>>>>>> tmpfs   /tmp         tmpfs  
>>>>>>>>>> defaults,nodev,nosuid,mode=1777,size=128M 
>>>>>>>>>>         0  0
>>>>>>>>>> 3. mount /tmp
>>>>>>>>>> 4. df -h
>>>>>>>>>> Filesystem      Size  Used Avail Use% Mounted on
>>>>>>>>>> overlay          63G   11G   49G  19% /
>>>>>>>>>> tmpfs            64M  4.0K   64M   1% /dev
>>>>>>>>>> tmpfs          1000M     0 1000M   0% /sys/fs/cgroup
>>>>>>>>>> osxfs           466G  158G  305G  35% /Users
>>>>>>>>>> /dev/sda1        63G   11G   49G  19% /etc/hosts
>>>>>>>>>> shm              64M   11M   54M  16% /dev/shm
>>>>>>>>>> tmpfs          1000M     0 1000M   0% /sys/firmware
>>>>>>>>>> *tmpfs           128M     0  128M   0% /tmp*
>>>>>>>>>>
>>>>>>>>>> The errors are exactly the same.
>>>>>>>>>> I have the impression that I changed the wrong parameter.
>>>>>>>>>> Probably I
>>>>>>>>>> have to change:
>>>>>>>>>> shm              64M   11M   54M  16% /dev/shm
>>>>>>>>>>
>>>>>>>>>> but I do not know how to do that. Any suggestion?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> According to google, you just add a new line to /etc/fstab for
>>>>>>>>> /dev/shm
>>>>>>>>>
>>>>>>>>> tmpfs      /dev/shm      tmpfs   defaults,size=512m   0   0
>>>>>>>>>
>>>>>>>>> Chrissie
>>>>>>>>>
>>>>>>>>>>> On 26 Jun 2018, at 09:48, Christine Caulfield
>>>>>>>>>>> <ccaulfie at redhat.com <mailto:ccaulfie at redhat.com>
>>>>>>>>>>> <mailto:ccaulfie at redhat.com>
>>>>>>>>>>> <mailto:ccaulfie at redhat.com>
>>>>>>>>>>> <mailto:ccaulfie at redhat.com>
>>>>>>>>>>> <mailto:ccaulfie at redhat.com>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On 25/06/18 20:41, Salvatore D'angelo wrote:
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> Let me add here one important detail. I use Docker for my test
>>>>>>>>>>>> with 5
>>>>>>>>>>>> containers deployed on my Mac.
>>>>>>>>>>>> Basically the team that worked on this project installed the
>>>>>>>>>>>> cluster
>>>>>>>>>>>> on soft layer bare metal.
>>>>>>>>>>>> The PostgreSQL cluster was hard to test and if a
>>>>>>>>>>>> misconfiguration
>>>>>>>>>>>> occurred recreate the cluster from scratch is not easy.
>>>>>>>>>>>> Test it was a cumbersome if you consider that we access to the
>>>>>>>>>>>> machines with a complex system hard to describe here.
>>>>>>>>>>>> For this reason I ported the cluster on Docker for test purpose.
>>>>>>>>>>>> I am
>>>>>>>>>>>> not interested to have it working for months, I just need a
>>>>>>>>>>>> proof of
>>>>>>>>>>>> concept. 
>>>>>>>>>>>>
>>>>>>>>>>>> When the migration works I’ll port everything on bare metal
>>>>>>>>>>>> where the
>>>>>>>>>>>> size of resources are ambundant.  
>>>>>>>>>>>>
>>>>>>>>>>>> Now I have enough RAM and disk space on my Mac so if you tell me
>>>>>>>>>>>> what
>>>>>>>>>>>> should be an acceptable size for several days of running it
>>>>>>>>>>>> is ok
>>>>>>>>>>>> for me.
>>>>>>>>>>>> It is ok also have commands to clean the shm when required.
>>>>>>>>>>>> I know I can find them on Google but if you can suggest me these
>>>>>>>>>>>> info
>>>>>>>>>>>> I’ll appreciate. I have OS knowledge to do that but I would
>>>>>>>>>>>> like to
>>>>>>>>>>>> avoid days of guesswork and try and error if possible.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I would recommend at least 128MB of space on /dev/shm, 256MB if
>>>>>>>>>>> you can
>>>>>>>>>>> spare it. My 'standard' system uses 75MB under normal running
>>>>>>>>>>> allowing
>>>>>>>>>>> for one command-line query to run.
>>>>>>>>>>>
>>>>>>>>>>> If I read this right then you're reproducing a bare-metal
>>>>>>>>>>> system in
>>>>>>>>>>> containers now? so the original systems will have a default
>>>>>>>>>>> /dev/shm
>>>>>>>>>>> size which is probably much larger than your containers?
>>>>>>>>>>>
>>>>>>>>>>> I'm just checking here that we don't have a regression in memory
>>>>>>>>>>> usage
>>>>>>>>>>> as Poki suggested.
>>>>>>>>>>>
>>>>>>>>>>> Chrissie
>>>>>>>>>>>
>>>>>>>>>>>>> On 25 Jun 2018, at 21:18, Jan Pokorný <jpokorny at redhat.com
>>>>>>>>>>>>> <mailto:jpokorny at redhat.com>
>>>>>>>>>>>>> <mailto:jpokorny at redhat.com>
>>>>>>>>>>>>> <mailto:jpokorny at redhat.com>
>>>>>>>>>>>>> <mailto:jpokorny at redhat.com>
>>>>>>>>>>>>> <mailto:jpokorny at redhat.com>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 25/06/18 19:06 +0200, Salvatore D'angelo wrote:
>>>>>>>>>>>>>> Thanks for reply. I scratched my cluster and created it
>>>>>>>>>>>>>> again and
>>>>>>>>>>>>>> then migrated as before. This time I uninstalled pacemaker,
>>>>>>>>>>>>>> corosync, crmsh and resource agents with make uninstall
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> then I installed new packages. The problem is the same, when
>>>>>>>>>>>>>> I launch:
>>>>>>>>>>>>>> corosync-quorumtool -ps
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I got: Cannot initialize QUORUM service
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Here the log with debug enabled:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [18019] pg3 corosyncerror   [QB    ] couldn't create
>>>>>>>>>>>>>> circular mmap
>>>>>>>>>>>>>> on /dev/shm/qb-cfg-event-18020-18028-23-data
>>>>>>>>>>>>>> [18019] pg3 corosyncerror   [QB    ]
>>>>>>>>>>>>>> qb_rb_open:cfg-event-18020-18028-23: Resource temporarily
>>>>>>>>>>>>>> unavailable (11)
>>>>>>>>>>>>>> [18019] pg3 corosyncdebug   [QB    ] Free'ing ringbuffer:
>>>>>>>>>>>>>> /dev/shm/qb-cfg-request-18020-18028-23-header
>>>>>>>>>>>>>> [18019] pg3 corosyncdebug   [QB    ] Free'ing ringbuffer:
>>>>>>>>>>>>>> /dev/shm/qb-cfg-response-18020-18028-23-header
>>>>>>>>>>>>>> [18019] pg3 corosyncerror   [QB    ] shm connection FAILED:
>>>>>>>>>>>>>> Resource temporarily unavailable (11)
>>>>>>>>>>>>>> [18019] pg3 corosyncerror   [QB    ] Error in connection setup
>>>>>>>>>>>>>> (18020-18028-23): Resource temporarily unavailable (11)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I tried to check /dev/shm and I am not sure these are the
>>>>>>>>>>>>>> right
>>>>>>>>>>>>>> commands, however:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> df -h /dev/shm
>>>>>>>>>>>>>> Filesystem      Size  Used Avail Use% Mounted on
>>>>>>>>>>>>>> shm              64M   16M   49M  24% /dev/shm
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ls /dev/shm
>>>>>>>>>>>>>> qb-cmap-request-18020-18036-25-data
>>>>>>>>>>>>>>    qb-corosync-blackbox-data
>>>>>>>>>>>>>>    qb-quorum-request-18020-18095-32-data
>>>>>>>>>>>>>> qb-cmap-request-18020-18036-25-header
>>>>>>>>>>>>>>  qb-corosync-blackbox-header
>>>>>>>>>>>>>>  qb-quorum-request-18020-18095-32-header
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Is 64 Mb enough for /dev/shm. If no, why it worked with
>>>>>>>>>>>>>> previous
>>>>>>>>>>>>>> corosync release?
>>>>>>>>>>>>>
>>>>>>>>>>>>> For a start, can you try configuring corosync with
>>>>>>>>>>>>> --enable-small-memory-footprint switch?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hard to say why the space provisioned to /dev/shm is the direct
>>>>>>>>>>>>> opposite of generous (per today's standards), but may be the
>>>>>>>>>>>>> result
>>>>>>>>>>>>> of automatic HW adaptation, and if RAM is so scarce in your
>>>>>>>>>>>>> case,
>>>>>>>>>>>>> the above build-time toggle might help.
>>>>>>>>>>>>>
>>>>>>>>>>>>> If not, then exponentially increasing size of /dev/shm space is
>>>>>>>>>>>>> likely your best bet (I don't recommended fiddling with
>>>>>>>>>>>>> mlockall()
>>>>>>>>>>>>> and similar measures in corosync).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Of course, feel free to raise a regression if you have a
>>>>>>>>>>>>> reproducible
>>>>>>>>>>>>> comparison between two corosync (plus possibly different
>>>>>>>>>>>>> libraries
>>>>>>>>>>>>> like libqb) versions, one that works and one that won't, in
>>>>>>>>>>>>> reproducible conditions (like this small /dev/shm, VM image,
>>>>>>>>>>>>> etc.).
>>>>>>>>>>>>>
>>>>>>>>>>>>> -- 
>>>>>>>>>>>>> Jan (Poki)
>>>>>>>>>>>>> _______________________________________________