[ClusterLabs] Upgrade corosync problem

Tue Jun 26 08:12:12 EDT 2018

corosync 2.3.5 and libqb 0.16.0

> On 26 Jun 2018, at 14:08, Christine Caulfield <ccaulfie at redhat.com> wrote:
> 
> On 26/06/18 12:16, Salvatore D'angelo wrote:
>> libqb update to 1.0.3 but same issue.
>> 
>> I know corosync has also these dependencies nspr and nss3. I updated
>> them using apt-get install, here the version installed:
>> 
>>    libnspr4, libnspr4-dev  2:4.13.1-0ubuntu0.14.04.1
>>    libnss3, libnss3-dev, libnss3-nssb   2:3.28.4-0ubuntu0.14.04.3
>> 
>> but same problem.
>> 
>> I am working on Ubuntu 14.04 image and I know that packages could be
>> quite old here. Are there new versions for these libraries?
>> Where I can download them? I tried to search on google but results where
>> quite confusing.
>> 
> 
> It's pretty unlikely to be the crypto libraries. It's almost certainly
> in libqb, with a small possibility that of corosync.  Which versions did
> you have that worked (libqb and corosync) ?
> 
> Chrissie
> 
> 
>> 
>>> On 26 Jun 2018, at 12:27, Christine Caulfield <ccaulfie at redhat.com <mailto:ccaulfie at redhat.com>
>>> <mailto:ccaulfie at redhat.com <mailto:ccaulfie at redhat.com>>> wrote:
>>> 
>>> On 26/06/18 11:24, Salvatore D'angelo wrote:
>>>> Hi,
>>>> 
>>>> I have tried with:
>>>> 0.16.0.real-1ubuntu4
>>>> 0.16.0.real-1ubuntu5
>>>> 
>>>> which version should I try?
>>> 
>>> 
>>> Hmm both of those are actually quite old! maybe a newer one?
>>> 
>>> Chrissie
>>> 
>>>> 
>>>>> On 26 Jun 2018, at 12:03, Christine Caulfield <ccaulfie at redhat.com <mailto:ccaulfie at redhat.com>
>>>>> <mailto:ccaulfie at redhat.com <mailto:ccaulfie at redhat.com>>
>>>>> <mailto:ccaulfie at redhat.com <mailto:ccaulfie at redhat.com>>> wrote:
>>>>> 
>>>>> On 26/06/18 11:00, Salvatore D'angelo wrote:
>>>>>> Consider that the container is the same when corosync 2.3.5 run.
>>>>>> If it is something related to the container probably the 2.4.4
>>>>>> introduced a feature that has an impact on container.
>>>>>> Should be something related to libqb according to the code.
>>>>>> Anyone can help?
>>>>>> 
>>>>> 
>>>>> 
>>>>> Have you tried downgrading libqb to the previous version to see if it
>>>>> still happens?
>>>>> 
>>>>> Chrissie
>>>>> 
>>>>>>> On 26 Jun 2018, at 11:56, Christine Caulfield <ccaulfie at redhat.com <mailto:ccaulfie at redhat.com>
>>>>>>> <mailto:ccaulfie at redhat.com <mailto:ccaulfie at redhat.com>>
>>>>>>> <mailto:ccaulfie at redhat.com <mailto:ccaulfie at redhat.com>>
>>>>>>> <mailto:ccaulfie at redhat.com <mailto:ccaulfie at redhat.com>>> wrote:
>>>>>>> 
>>>>>>> On 26/06/18 10:35, Salvatore D'angelo wrote:
>>>>>>>> Sorry after the command:
>>>>>>>> 
>>>>>>>> corosync-quorumtool -ps
>>>>>>>> 
>>>>>>>> the error in log are still visible. Looking at the source code it
>>>>>>>> seems
>>>>>>>> problem is at this line:
>>>>>>>> https://github.com/corosync/corosync/blob/master/tools/corosync-quorumtool.c <https://github.com/corosync/corosync/blob/master/tools/corosync-quorumtool.c>
>>>>>>>> 
>>>>>>>>     if (quorum_initialize(&q_handle, &q_callbacks, &q_type) !=
>>>>>>>> CS_OK) {
>>>>>>>> fprintf(stderr, "Cannot initialize QUORUM service\n");
>>>>>>>> q_handle = 0;
>>>>>>>> goto out;
>>>>>>>> }
>>>>>>>> 
>>>>>>>> if (corosync_cfg_initialize(&c_handle, &c_callbacks) != CS_OK) {
>>>>>>>> fprintf(stderr, "Cannot initialise CFG service\n");
>>>>>>>> c_handle = 0;
>>>>>>>> goto out;
>>>>>>>> }
>>>>>>>> 
>>>>>>>> The quorum_initialize function is defined here:
>>>>>>>> https://github.com/corosync/corosync/blob/master/lib/quorum.c <https://github.com/corosync/corosync/blob/master/lib/quorum.c>
>>>>>>>> 
>>>>>>>> It seems interacts with libqb to allocate space on /dev/shm but
>>>>>>>> something fails. I tried to update the libqb with apt-get install
>>>>>>>> but no
>>>>>>>> success.
>>>>>>>> 
>>>>>>>> The same for second function:
>>>>>>>> https://github.com/corosync/corosync/blob/master/lib/cfg.c <https://github.com/corosync/corosync/blob/master/lib/cfg.c>
>>>>>>>> 
>>>>>>>> Now I am not an expert of libqb. I have the
>>>>>>>> version 0.16.0.real-1ubuntu5.
>>>>>>>> 
>>>>>>>> The folder /dev/shm has 777 permission like other nodes with older
>>>>>>>> corosync and pacemaker that work fine. The only difference is that I
>>>>>>>> only see files created by root, no one created by hacluster like
>>>>>>>> other
>>>>>>>> two nodes (probably because pacemaker didn’t start correctly).
>>>>>>>> 
>>>>>>>> This is the analysis I have done so far.
>>>>>>>> Any suggestion?
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> Hmm. t seems very likely something to do with the way the container is
>>>>>>> set up then - and I know nothing about containers. Sorry :/
>>>>>>> 
>>>>>>> Can anyone else help here?
>>>>>>> 
>>>>>>> Chrissie
>>>>>>> 
>>>>>>>>> On 26 Jun 2018, at 11:03, Salvatore D'angelo
>>>>>>>>> <sasadangelo at gmail.com <mailto:sasadangelo at gmail.com> <mailto:sasadangelo at gmail.com <mailto:sasadangelo at gmail.com>>
>>>>>>>>> <mailto:sasadangelo at gmail.com <mailto:sasadangelo at gmail.com>>
>>>>>>>>> <mailto:sasadangelo at gmail.com <mailto:sasadangelo at gmail.com>>
>>>>>>>>> <mailto:sasadangelo at gmail.com <mailto:sasadangelo at gmail.com>>> wrote:
>>>>>>>>> 
>>>>>>>>> Yes, sorry you’re right I could find it by myself.
>>>>>>>>> However, I did the following:
>>>>>>>>> 
>>>>>>>>> 1. Added the line you suggested to /etc/fstab
>>>>>>>>> 2. mount -o remount /dev/shm
>>>>>>>>> 3. Now I correctly see /dev/shm of 512M with df -h
>>>>>>>>> Filesystem      Size  Used Avail Use% Mounted on
>>>>>>>>> overlay          63G   11G   49G  19% /
>>>>>>>>> tmpfs            64M  4.0K   64M   1% /dev
>>>>>>>>> tmpfs          1000M     0 1000M   0% /sys/fs/cgroup
>>>>>>>>> osxfs           466G  158G  305G  35% /Users
>>>>>>>>> /dev/sda1        63G   11G   49G  19% /etc/hosts
>>>>>>>>> *shm             512M   15M  498M   3% /dev/shm*
>>>>>>>>> tmpfs          1000M     0 1000M   0% /sys/firmware
>>>>>>>>> tmpfs           128M     0  128M   0% /tmp
>>>>>>>>> 
>>>>>>>>> The errors in log went away. Consider that I remove the log file
>>>>>>>>> before start corosync so it does not contains lines of previous
>>>>>>>>> executions.
>>>>>>>>> <corosync.log>
>>>>>>>>> 
>>>>>>>>> But the command:
>>>>>>>>> corosync-quorumtool -ps
>>>>>>>>> 
>>>>>>>>> still give:
>>>>>>>>> Cannot initialize QUORUM service
>>>>>>>>> 
>>>>>>>>> Consider that few minutes before it gave me the message:
>>>>>>>>> Cannot initialize CFG service
>>>>>>>>> 
>>>>>>>>> I do not know the differences between CFG and QUORUM in this case.
>>>>>>>>> 
>>>>>>>>> If I try to start pacemaker the service is OK but I see only
>>>>>>>>> pacemaker
>>>>>>>>> and the Transport does not work if I try to run a cam command.
>>>>>>>>> Any suggestion?
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On 26 Jun 2018, at 10:49, Christine Caulfield
>>>>>>>>>> <ccaulfie at redhat.com <mailto:ccaulfie at redhat.com> <mailto:ccaulfie at redhat.com <mailto:ccaulfie at redhat.com>>
>>>>>>>>>> <mailto:ccaulfie at redhat.com <mailto:ccaulfie at redhat.com>>
>>>>>>>>>> <mailto:ccaulfie at redhat.com <mailto:ccaulfie at redhat.com>>
>>>>>>>>>> <mailto:ccaulfie at redhat.com <mailto:ccaulfie at redhat.com>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> On 26/06/18 09:40, Salvatore D'angelo wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>> 
>>>>>>>>>>> Yes,
>>>>>>>>>>> 
>>>>>>>>>>> I am reproducing only the required part for test. I think the
>>>>>>>>>>> original
>>>>>>>>>>> system has a larger shm. The problem is that I do not know
>>>>>>>>>>> exactly how
>>>>>>>>>>> to change it.
>>>>>>>>>>> I tried the following steps, but I have the impression I didn’t
>>>>>>>>>>> performed the right one:
>>>>>>>>>>> 
>>>>>>>>>>> 1. remove everything under /tmp
>>>>>>>>>>> 2. Added the following line to /etc/fstab
>>>>>>>>>>> tmpfs   /tmp         tmpfs  
>>>>>>>>>>> defaults,nodev,nosuid,mode=1777,size=128M 
>>>>>>>>>>>         0  0
>>>>>>>>>>> 3. mount /tmp
>>>>>>>>>>> 4. df -h
>>>>>>>>>>> Filesystem      Size  Used Avail Use% Mounted on
>>>>>>>>>>> overlay          63G   11G   49G  19% /
>>>>>>>>>>> tmpfs            64M  4.0K   64M   1% /dev
>>>>>>>>>>> tmpfs          1000M     0 1000M   0% /sys/fs/cgroup
>>>>>>>>>>> osxfs           466G  158G  305G  35% /Users
>>>>>>>>>>> /dev/sda1        63G   11G   49G  19% /etc/hosts
>>>>>>>>>>> shm              64M   11M   54M  16% /dev/shm
>>>>>>>>>>> tmpfs          1000M     0 1000M   0% /sys/firmware
>>>>>>>>>>> *tmpfs           128M     0  128M   0% /tmp*
>>>>>>>>>>> 
>>>>>>>>>>> The errors are exactly the same.
>>>>>>>>>>> I have the impression that I changed the wrong parameter.
>>>>>>>>>>> Probably I
>>>>>>>>>>> have to change:
>>>>>>>>>>> shm              64M   11M   54M  16% /dev/shm
>>>>>>>>>>> 
>>>>>>>>>>> but I do not know how to do that. Any suggestion?
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> According to google, you just add a new line to /etc/fstab for
>>>>>>>>>> /dev/shm
>>>>>>>>>> 
>>>>>>>>>> tmpfs      /dev/shm      tmpfs   defaults,size=512m   0   0
>>>>>>>>>> 
>>>>>>>>>> Chrissie
>>>>>>>>>> 
>>>>>>>>>>>> On 26 Jun 2018, at 09:48, Christine Caulfield
>>>>>>>>>>>> <ccaulfie at redhat.com <mailto:ccaulfie at redhat.com> <mailto:ccaulfie at redhat.com <mailto:ccaulfie at redhat.com>>
>>>>>>>>>>>> <mailto:ccaulfie at redhat.com <mailto:ccaulfie at redhat.com>>
>>>>>>>>>>>> <mailto:ccaulfie at redhat.com <mailto:ccaulfie at redhat.com>>
>>>>>>>>>>>> <mailto:ccaulfie at redhat.com <mailto:ccaulfie at redhat.com>>
>>>>>>>>>>>> <mailto:ccaulfie at redhat.com <mailto:ccaulfie at redhat.com>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> On 25/06/18 20:41, Salvatore D'angelo wrote:
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Let me add here one important detail. I use Docker for my test
>>>>>>>>>>>>> with 5
>>>>>>>>>>>>> containers deployed on my Mac.
>>>>>>>>>>>>> Basically the team that worked on this project installed the
>>>>>>>>>>>>> cluster
>>>>>>>>>>>>> on soft layer bare metal.
>>>>>>>>>>>>> The PostgreSQL cluster was hard to test and if a
>>>>>>>>>>>>> misconfiguration
>>>>>>>>>>>>> occurred recreate the cluster from scratch is not easy.
>>>>>>>>>>>>> Test it was a cumbersome if you consider that we access to the
>>>>>>>>>>>>> machines with a complex system hard to describe here.
>>>>>>>>>>>>> For this reason I ported the cluster on Docker for test purpose.
>>>>>>>>>>>>> I am
>>>>>>>>>>>>> not interested to have it working for months, I just need a
>>>>>>>>>>>>> proof of
>>>>>>>>>>>>> concept. 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> When the migration works I’ll port everything on bare metal
>>>>>>>>>>>>> where the
>>>>>>>>>>>>> size of resources are ambundant.  
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Now I have enough RAM and disk space on my Mac so if you tell me
>>>>>>>>>>>>> what
>>>>>>>>>>>>> should be an acceptable size for several days of running it
>>>>>>>>>>>>> is ok
>>>>>>>>>>>>> for me.
>>>>>>>>>>>>> It is ok also have commands to clean the shm when required.
>>>>>>>>>>>>> I know I can find them on Google but if you can suggest me these
>>>>>>>>>>>>> info
>>>>>>>>>>>>> I’ll appreciate. I have OS knowledge to do that but I would
>>>>>>>>>>>>> like to
>>>>>>>>>>>>> avoid days of guesswork and try and error if possible.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> I would recommend at least 128MB of space on /dev/shm, 256MB if
>>>>>>>>>>>> you can
>>>>>>>>>>>> spare it. My 'standard' system uses 75MB under normal running
>>>>>>>>>>>> allowing
>>>>>>>>>>>> for one command-line query to run.
>>>>>>>>>>>> 
>>>>>>>>>>>> If I read this right then you're reproducing a bare-metal
>>>>>>>>>>>> system in
>>>>>>>>>>>> containers now? so the original systems will have a default
>>>>>>>>>>>> /dev/shm
>>>>>>>>>>>> size which is probably much larger than your containers?
>>>>>>>>>>>> 
>>>>>>>>>>>> I'm just checking here that we don't have a regression in memory
>>>>>>>>>>>> usage
>>>>>>>>>>>> as Poki suggested.
>>>>>>>>>>>> 
>>>>>>>>>>>> Chrissie
>>>>>>>>>>>> 
>>>>>>>>>>>>>> On 25 Jun 2018, at 21:18, Jan Pokorný <jpokorny at redhat.com <mailto:jpokorny at redhat.com>
>>>>>>>>>>>>>> <mailto:jpokorny at redhat.com <mailto:jpokorny at redhat.com>>
>>>>>>>>>>>>>> <mailto:jpokorny at redhat.com <mailto:jpokorny at redhat.com>>
>>>>>>>>>>>>>> <mailto:jpokorny at redhat.com <mailto:jpokorny at redhat.com>>
>>>>>>>>>>>>>> <mailto:jpokorny at redhat.com <mailto:jpokorny at redhat.com>>
>>>>>>>>>>>>>> <mailto:jpokorny at redhat.com <mailto:jpokorny at redhat.com>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On 25/06/18 19:06 +0200, Salvatore D'angelo wrote:
>>>>>>>>>>>>>>> Thanks for reply. I scratched my cluster and created it
>>>>>>>>>>>>>>> again and
>>>>>>>>>>>>>>> then migrated as before. This time I uninstalled pacemaker,
>>>>>>>>>>>>>>> corosync, crmsh and resource agents with make uninstall
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> then I installed new packages. The problem is the same, when
>>>>>>>>>>>>>>> I launch:
>>>>>>>>>>>>>>> corosync-quorumtool -ps
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I got: Cannot initialize QUORUM service
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Here the log with debug enabled:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> [18019] pg3 corosyncerror   [QB    ] couldn't create
>>>>>>>>>>>>>>> circular mmap
>>>>>>>>>>>>>>> on /dev/shm/qb-cfg-event-18020-18028-23-data
>>>>>>>>>>>>>>> [18019] pg3 corosyncerror   [QB    ]
>>>>>>>>>>>>>>> qb_rb_open:cfg-event-18020-18028-23: Resource temporarily
>>>>>>>>>>>>>>> unavailable (11)
>>>>>>>>>>>>>>> [18019] pg3 corosyncdebug   [QB    ] Free'ing ringbuffer:
>>>>>>>>>>>>>>> /dev/shm/qb-cfg-request-18020-18028-23-header
>>>>>>>>>>>>>>> [18019] pg3 corosyncdebug   [QB    ] Free'ing ringbuffer:
>>>>>>>>>>>>>>> /dev/shm/qb-cfg-response-18020-18028-23-header
>>>>>>>>>>>>>>> [18019] pg3 corosyncerror   [QB    ] shm connection FAILED:
>>>>>>>>>>>>>>> Resource temporarily unavailable (11)
>>>>>>>>>>>>>>> [18019] pg3 corosyncerror   [QB    ] Error in connection setup
>>>>>>>>>>>>>>> (18020-18028-23): Resource temporarily unavailable (11)
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I tried to check /dev/shm and I am not sure these are the
>>>>>>>>>>>>>>> right
>>>>>>>>>>>>>>> commands, however:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> df -h /dev/shm
>>>>>>>>>>>>>>> Filesystem      Size  Used Avail Use% Mounted on
>>>>>>>>>>>>>>> shm              64M   16M   49M  24% /dev/shm
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> ls /dev/shm
>>>>>>>>>>>>>>> qb-cmap-request-18020-18036-25-data
>>>>>>>>>>>>>>>    qb-corosync-blackbox-data
>>>>>>>>>>>>>>>    qb-quorum-request-18020-18095-32-data
>>>>>>>>>>>>>>> qb-cmap-request-18020-18036-25-header
>>>>>>>>>>>>>>>  qb-corosync-blackbox-header
>>>>>>>>>>>>>>>  qb-quorum-request-18020-18095-32-header
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Is 64 Mb enough for /dev/shm. If no, why it worked with
>>>>>>>>>>>>>>> previous
>>>>>>>>>>>>>>> corosync release?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> For a start, can you try configuring corosync with
>>>>>>>>>>>>>> --enable-small-memory-footprint switch?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hard to say why the space provisioned to /dev/shm is the direct
>>>>>>>>>>>>>> opposite of generous (per today's standards), but may be the
>>>>>>>>>>>>>> result
>>>>>>>>>>>>>> of automatic HW adaptation, and if RAM is so scarce in your
>>>>>>>>>>>>>> case,
>>>>>>>>>>>>>> the above build-time toggle might help.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> If not, then exponentially increasing size of /dev/shm space is
>>>>>>>>>>>>>> likely your best bet (I don't recommended fiddling with
>>>>>>>>>>>>>> mlockall()
>>>>>>>>>>>>>> and similar measures in corosync).
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Of course, feel free to raise a regression if you have a
>>>>>>>>>>>>>> reproducible
>>>>>>>>>>>>>> comparison between two corosync (plus possibly different
>>>>>>>>>>>>>> libraries
>>>>>>>>>>>>>> like libqb) versions, one that works and one that won't, in
>>>>>>>>>>>>>> reproducible conditions (like this small /dev/shm, VM image,
>>>>>>>>>>>>>> etc.).
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>> Jan (Poki)
>>>>>>>>>>>>>> _______________________________________________
> _______________________________________________
> Users mailing list: Users at clusterlabs.org <mailto:Users at clusterlabs.org>
> https://lists.clusterlabs.org/mailman/listinfo/users <https://lists.clusterlabs.org/mailman/listinfo/users>
> 
> Project Home: http://www.clusterlabs.org <http://www.clusterlabs.org/>
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf <http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf>
> Bugs: http://bugs.clusterlabs.org <http://bugs.clusterlabs.org/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20180626/ba6bcf6c/attachment-0002.html>