[ClusterLabs] Upgrade corosync problem
Salvatore D'angelo
sasadangelo at gmail.com
Tue Jun 26 08:12:12 EDT 2018
corosync 2.3.5 and libqb 0.16.0
> On 26 Jun 2018, at 14:08, Christine Caulfield <ccaulfie at redhat.com> wrote:
>
> On 26/06/18 12:16, Salvatore D'angelo wrote:
>> libqb update to 1.0.3 but same issue.
>>
>> I know corosync has also these dependencies nspr and nss3. I updated
>> them using apt-get install, here the version installed:
>>
>> libnspr4, libnspr4-dev 2:4.13.1-0ubuntu0.14.04.1
>> libnss3, libnss3-dev, libnss3-nssb 2:3.28.4-0ubuntu0.14.04.3
>>
>> but same problem.
>>
>> I am working on Ubuntu 14.04 image and I know that packages could be
>> quite old here. Are there new versions for these libraries?
>> Where I can download them? I tried to search on google but results where
>> quite confusing.
>>
>
> It's pretty unlikely to be the crypto libraries. It's almost certainly
> in libqb, with a small possibility that of corosync. Which versions did
> you have that worked (libqb and corosync) ?
>
> Chrissie
>
>
>>
>>> On 26 Jun 2018, at 12:27, Christine Caulfield <ccaulfie at redhat.com <mailto:ccaulfie at redhat.com>
>>> <mailto:ccaulfie at redhat.com <mailto:ccaulfie at redhat.com>>> wrote:
>>>
>>> On 26/06/18 11:24, Salvatore D'angelo wrote:
>>>> Hi,
>>>>
>>>> I have tried with:
>>>> 0.16.0.real-1ubuntu4
>>>> 0.16.0.real-1ubuntu5
>>>>
>>>> which version should I try?
>>>
>>>
>>> Hmm both of those are actually quite old! maybe a newer one?
>>>
>>> Chrissie
>>>
>>>>
>>>>> On 26 Jun 2018, at 12:03, Christine Caulfield <ccaulfie at redhat.com <mailto:ccaulfie at redhat.com>
>>>>> <mailto:ccaulfie at redhat.com <mailto:ccaulfie at redhat.com>>
>>>>> <mailto:ccaulfie at redhat.com <mailto:ccaulfie at redhat.com>>> wrote:
>>>>>
>>>>> On 26/06/18 11:00, Salvatore D'angelo wrote:
>>>>>> Consider that the container is the same when corosync 2.3.5 run.
>>>>>> If it is something related to the container probably the 2.4.4
>>>>>> introduced a feature that has an impact on container.
>>>>>> Should be something related to libqb according to the code.
>>>>>> Anyone can help?
>>>>>>
>>>>>
>>>>>
>>>>> Have you tried downgrading libqb to the previous version to see if it
>>>>> still happens?
>>>>>
>>>>> Chrissie
>>>>>
>>>>>>> On 26 Jun 2018, at 11:56, Christine Caulfield <ccaulfie at redhat.com <mailto:ccaulfie at redhat.com>
>>>>>>> <mailto:ccaulfie at redhat.com <mailto:ccaulfie at redhat.com>>
>>>>>>> <mailto:ccaulfie at redhat.com <mailto:ccaulfie at redhat.com>>
>>>>>>> <mailto:ccaulfie at redhat.com <mailto:ccaulfie at redhat.com>>> wrote:
>>>>>>>
>>>>>>> On 26/06/18 10:35, Salvatore D'angelo wrote:
>>>>>>>> Sorry after the command:
>>>>>>>>
>>>>>>>> corosync-quorumtool -ps
>>>>>>>>
>>>>>>>> the error in log are still visible. Looking at the source code it
>>>>>>>> seems
>>>>>>>> problem is at this line:
>>>>>>>> https://github.com/corosync/corosync/blob/master/tools/corosync-quorumtool.c <https://github.com/corosync/corosync/blob/master/tools/corosync-quorumtool.c>
>>>>>>>>
>>>>>>>> if (quorum_initialize(&q_handle, &q_callbacks, &q_type) !=
>>>>>>>> CS_OK) {
>>>>>>>> fprintf(stderr, "Cannot initialize QUORUM service\n");
>>>>>>>> q_handle = 0;
>>>>>>>> goto out;
>>>>>>>> }
>>>>>>>>
>>>>>>>> if (corosync_cfg_initialize(&c_handle, &c_callbacks) != CS_OK) {
>>>>>>>> fprintf(stderr, "Cannot initialise CFG service\n");
>>>>>>>> c_handle = 0;
>>>>>>>> goto out;
>>>>>>>> }
>>>>>>>>
>>>>>>>> The quorum_initialize function is defined here:
>>>>>>>> https://github.com/corosync/corosync/blob/master/lib/quorum.c <https://github.com/corosync/corosync/blob/master/lib/quorum.c>
>>>>>>>>
>>>>>>>> It seems interacts with libqb to allocate space on /dev/shm but
>>>>>>>> something fails. I tried to update the libqb with apt-get install
>>>>>>>> but no
>>>>>>>> success.
>>>>>>>>
>>>>>>>> The same for second function:
>>>>>>>> https://github.com/corosync/corosync/blob/master/lib/cfg.c <https://github.com/corosync/corosync/blob/master/lib/cfg.c>
>>>>>>>>
>>>>>>>> Now I am not an expert of libqb. I have the
>>>>>>>> version 0.16.0.real-1ubuntu5.
>>>>>>>>
>>>>>>>> The folder /dev/shm has 777 permission like other nodes with older
>>>>>>>> corosync and pacemaker that work fine. The only difference is that I
>>>>>>>> only see files created by root, no one created by hacluster like
>>>>>>>> other
>>>>>>>> two nodes (probably because pacemaker didn’t start correctly).
>>>>>>>>
>>>>>>>> This is the analysis I have done so far.
>>>>>>>> Any suggestion?
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> Hmm. t seems very likely something to do with the way the container is
>>>>>>> set up then - and I know nothing about containers. Sorry :/
>>>>>>>
>>>>>>> Can anyone else help here?
>>>>>>>
>>>>>>> Chrissie
>>>>>>>
>>>>>>>>> On 26 Jun 2018, at 11:03, Salvatore D'angelo
>>>>>>>>> <sasadangelo at gmail.com <mailto:sasadangelo at gmail.com> <mailto:sasadangelo at gmail.com <mailto:sasadangelo at gmail.com>>
>>>>>>>>> <mailto:sasadangelo at gmail.com <mailto:sasadangelo at gmail.com>>
>>>>>>>>> <mailto:sasadangelo at gmail.com <mailto:sasadangelo at gmail.com>>
>>>>>>>>> <mailto:sasadangelo at gmail.com <mailto:sasadangelo at gmail.com>>> wrote:
>>>>>>>>>
>>>>>>>>> Yes, sorry you’re right I could find it by myself.
>>>>>>>>> However, I did the following:
>>>>>>>>>
>>>>>>>>> 1. Added the line you suggested to /etc/fstab
>>>>>>>>> 2. mount -o remount /dev/shm
>>>>>>>>> 3. Now I correctly see /dev/shm of 512M with df -h
>>>>>>>>> Filesystem Size Used Avail Use% Mounted on
>>>>>>>>> overlay 63G 11G 49G 19% /
>>>>>>>>> tmpfs 64M 4.0K 64M 1% /dev
>>>>>>>>> tmpfs 1000M 0 1000M 0% /sys/fs/cgroup
>>>>>>>>> osxfs 466G 158G 305G 35% /Users
>>>>>>>>> /dev/sda1 63G 11G 49G 19% /etc/hosts
>>>>>>>>> *shm 512M 15M 498M 3% /dev/shm*
>>>>>>>>> tmpfs 1000M 0 1000M 0% /sys/firmware
>>>>>>>>> tmpfs 128M 0 128M 0% /tmp
>>>>>>>>>
>>>>>>>>> The errors in log went away. Consider that I remove the log file
>>>>>>>>> before start corosync so it does not contains lines of previous
>>>>>>>>> executions.
>>>>>>>>> <corosync.log>
>>>>>>>>>
>>>>>>>>> But the command:
>>>>>>>>> corosync-quorumtool -ps
>>>>>>>>>
>>>>>>>>> still give:
>>>>>>>>> Cannot initialize QUORUM service
>>>>>>>>>
>>>>>>>>> Consider that few minutes before it gave me the message:
>>>>>>>>> Cannot initialize CFG service
>>>>>>>>>
>>>>>>>>> I do not know the differences between CFG and QUORUM in this case.
>>>>>>>>>
>>>>>>>>> If I try to start pacemaker the service is OK but I see only
>>>>>>>>> pacemaker
>>>>>>>>> and the Transport does not work if I try to run a cam command.
>>>>>>>>> Any suggestion?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> On 26 Jun 2018, at 10:49, Christine Caulfield
>>>>>>>>>> <ccaulfie at redhat.com <mailto:ccaulfie at redhat.com> <mailto:ccaulfie at redhat.com <mailto:ccaulfie at redhat.com>>
>>>>>>>>>> <mailto:ccaulfie at redhat.com <mailto:ccaulfie at redhat.com>>
>>>>>>>>>> <mailto:ccaulfie at redhat.com <mailto:ccaulfie at redhat.com>>
>>>>>>>>>> <mailto:ccaulfie at redhat.com <mailto:ccaulfie at redhat.com>>> wrote:
>>>>>>>>>>
>>>>>>>>>> On 26/06/18 09:40, Salvatore D'angelo wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> Yes,
>>>>>>>>>>>
>>>>>>>>>>> I am reproducing only the required part for test. I think the
>>>>>>>>>>> original
>>>>>>>>>>> system has a larger shm. The problem is that I do not know
>>>>>>>>>>> exactly how
>>>>>>>>>>> to change it.
>>>>>>>>>>> I tried the following steps, but I have the impression I didn’t
>>>>>>>>>>> performed the right one:
>>>>>>>>>>>
>>>>>>>>>>> 1. remove everything under /tmp
>>>>>>>>>>> 2. Added the following line to /etc/fstab
>>>>>>>>>>> tmpfs /tmp tmpfs
>>>>>>>>>>> defaults,nodev,nosuid,mode=1777,size=128M
>>>>>>>>>>> 0 0
>>>>>>>>>>> 3. mount /tmp
>>>>>>>>>>> 4. df -h
>>>>>>>>>>> Filesystem Size Used Avail Use% Mounted on
>>>>>>>>>>> overlay 63G 11G 49G 19% /
>>>>>>>>>>> tmpfs 64M 4.0K 64M 1% /dev
>>>>>>>>>>> tmpfs 1000M 0 1000M 0% /sys/fs/cgroup
>>>>>>>>>>> osxfs 466G 158G 305G 35% /Users
>>>>>>>>>>> /dev/sda1 63G 11G 49G 19% /etc/hosts
>>>>>>>>>>> shm 64M 11M 54M 16% /dev/shm
>>>>>>>>>>> tmpfs 1000M 0 1000M 0% /sys/firmware
>>>>>>>>>>> *tmpfs 128M 0 128M 0% /tmp*
>>>>>>>>>>>
>>>>>>>>>>> The errors are exactly the same.
>>>>>>>>>>> I have the impression that I changed the wrong parameter.
>>>>>>>>>>> Probably I
>>>>>>>>>>> have to change:
>>>>>>>>>>> shm 64M 11M 54M 16% /dev/shm
>>>>>>>>>>>
>>>>>>>>>>> but I do not know how to do that. Any suggestion?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> According to google, you just add a new line to /etc/fstab for
>>>>>>>>>> /dev/shm
>>>>>>>>>>
>>>>>>>>>> tmpfs /dev/shm tmpfs defaults,size=512m 0 0
>>>>>>>>>>
>>>>>>>>>> Chrissie
>>>>>>>>>>
>>>>>>>>>>>> On 26 Jun 2018, at 09:48, Christine Caulfield
>>>>>>>>>>>> <ccaulfie at redhat.com <mailto:ccaulfie at redhat.com> <mailto:ccaulfie at redhat.com <mailto:ccaulfie at redhat.com>>
>>>>>>>>>>>> <mailto:ccaulfie at redhat.com <mailto:ccaulfie at redhat.com>>
>>>>>>>>>>>> <mailto:ccaulfie at redhat.com <mailto:ccaulfie at redhat.com>>
>>>>>>>>>>>> <mailto:ccaulfie at redhat.com <mailto:ccaulfie at redhat.com>>
>>>>>>>>>>>> <mailto:ccaulfie at redhat.com <mailto:ccaulfie at redhat.com>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On 25/06/18 20:41, Salvatore D'angelo wrote:
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Let me add here one important detail. I use Docker for my test
>>>>>>>>>>>>> with 5
>>>>>>>>>>>>> containers deployed on my Mac.
>>>>>>>>>>>>> Basically the team that worked on this project installed the
>>>>>>>>>>>>> cluster
>>>>>>>>>>>>> on soft layer bare metal.
>>>>>>>>>>>>> The PostgreSQL cluster was hard to test and if a
>>>>>>>>>>>>> misconfiguration
>>>>>>>>>>>>> occurred recreate the cluster from scratch is not easy.
>>>>>>>>>>>>> Test it was a cumbersome if you consider that we access to the
>>>>>>>>>>>>> machines with a complex system hard to describe here.
>>>>>>>>>>>>> For this reason I ported the cluster on Docker for test purpose.
>>>>>>>>>>>>> I am
>>>>>>>>>>>>> not interested to have it working for months, I just need a
>>>>>>>>>>>>> proof of
>>>>>>>>>>>>> concept.
>>>>>>>>>>>>>
>>>>>>>>>>>>> When the migration works I’ll port everything on bare metal
>>>>>>>>>>>>> where the
>>>>>>>>>>>>> size of resources are ambundant.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Now I have enough RAM and disk space on my Mac so if you tell me
>>>>>>>>>>>>> what
>>>>>>>>>>>>> should be an acceptable size for several days of running it
>>>>>>>>>>>>> is ok
>>>>>>>>>>>>> for me.
>>>>>>>>>>>>> It is ok also have commands to clean the shm when required.
>>>>>>>>>>>>> I know I can find them on Google but if you can suggest me these
>>>>>>>>>>>>> info
>>>>>>>>>>>>> I’ll appreciate. I have OS knowledge to do that but I would
>>>>>>>>>>>>> like to
>>>>>>>>>>>>> avoid days of guesswork and try and error if possible.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I would recommend at least 128MB of space on /dev/shm, 256MB if
>>>>>>>>>>>> you can
>>>>>>>>>>>> spare it. My 'standard' system uses 75MB under normal running
>>>>>>>>>>>> allowing
>>>>>>>>>>>> for one command-line query to run.
>>>>>>>>>>>>
>>>>>>>>>>>> If I read this right then you're reproducing a bare-metal
>>>>>>>>>>>> system in
>>>>>>>>>>>> containers now? so the original systems will have a default
>>>>>>>>>>>> /dev/shm
>>>>>>>>>>>> size which is probably much larger than your containers?
>>>>>>>>>>>>
>>>>>>>>>>>> I'm just checking here that we don't have a regression in memory
>>>>>>>>>>>> usage
>>>>>>>>>>>> as Poki suggested.
>>>>>>>>>>>>
>>>>>>>>>>>> Chrissie
>>>>>>>>>>>>
>>>>>>>>>>>>>> On 25 Jun 2018, at 21:18, Jan Pokorný <jpokorny at redhat.com <mailto:jpokorny at redhat.com>
>>>>>>>>>>>>>> <mailto:jpokorny at redhat.com <mailto:jpokorny at redhat.com>>
>>>>>>>>>>>>>> <mailto:jpokorny at redhat.com <mailto:jpokorny at redhat.com>>
>>>>>>>>>>>>>> <mailto:jpokorny at redhat.com <mailto:jpokorny at redhat.com>>
>>>>>>>>>>>>>> <mailto:jpokorny at redhat.com <mailto:jpokorny at redhat.com>>
>>>>>>>>>>>>>> <mailto:jpokorny at redhat.com <mailto:jpokorny at redhat.com>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 25/06/18 19:06 +0200, Salvatore D'angelo wrote:
>>>>>>>>>>>>>>> Thanks for reply. I scratched my cluster and created it
>>>>>>>>>>>>>>> again and
>>>>>>>>>>>>>>> then migrated as before. This time I uninstalled pacemaker,
>>>>>>>>>>>>>>> corosync, crmsh and resource agents with make uninstall
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> then I installed new packages. The problem is the same, when
>>>>>>>>>>>>>>> I launch:
>>>>>>>>>>>>>>> corosync-quorumtool -ps
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I got: Cannot initialize QUORUM service
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Here the log with debug enabled:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [18019] pg3 corosyncerror [QB ] couldn't create
>>>>>>>>>>>>>>> circular mmap
>>>>>>>>>>>>>>> on /dev/shm/qb-cfg-event-18020-18028-23-data
>>>>>>>>>>>>>>> [18019] pg3 corosyncerror [QB ]
>>>>>>>>>>>>>>> qb_rb_open:cfg-event-18020-18028-23: Resource temporarily
>>>>>>>>>>>>>>> unavailable (11)
>>>>>>>>>>>>>>> [18019] pg3 corosyncdebug [QB ] Free'ing ringbuffer:
>>>>>>>>>>>>>>> /dev/shm/qb-cfg-request-18020-18028-23-header
>>>>>>>>>>>>>>> [18019] pg3 corosyncdebug [QB ] Free'ing ringbuffer:
>>>>>>>>>>>>>>> /dev/shm/qb-cfg-response-18020-18028-23-header
>>>>>>>>>>>>>>> [18019] pg3 corosyncerror [QB ] shm connection FAILED:
>>>>>>>>>>>>>>> Resource temporarily unavailable (11)
>>>>>>>>>>>>>>> [18019] pg3 corosyncerror [QB ] Error in connection setup
>>>>>>>>>>>>>>> (18020-18028-23): Resource temporarily unavailable (11)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I tried to check /dev/shm and I am not sure these are the
>>>>>>>>>>>>>>> right
>>>>>>>>>>>>>>> commands, however:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> df -h /dev/shm
>>>>>>>>>>>>>>> Filesystem Size Used Avail Use% Mounted on
>>>>>>>>>>>>>>> shm 64M 16M 49M 24% /dev/shm
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ls /dev/shm
>>>>>>>>>>>>>>> qb-cmap-request-18020-18036-25-data
>>>>>>>>>>>>>>> qb-corosync-blackbox-data
>>>>>>>>>>>>>>> qb-quorum-request-18020-18095-32-data
>>>>>>>>>>>>>>> qb-cmap-request-18020-18036-25-header
>>>>>>>>>>>>>>> qb-corosync-blackbox-header
>>>>>>>>>>>>>>> qb-quorum-request-18020-18095-32-header
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Is 64 Mb enough for /dev/shm. If no, why it worked with
>>>>>>>>>>>>>>> previous
>>>>>>>>>>>>>>> corosync release?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> For a start, can you try configuring corosync with
>>>>>>>>>>>>>> --enable-small-memory-footprint switch?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hard to say why the space provisioned to /dev/shm is the direct
>>>>>>>>>>>>>> opposite of generous (per today's standards), but may be the
>>>>>>>>>>>>>> result
>>>>>>>>>>>>>> of automatic HW adaptation, and if RAM is so scarce in your
>>>>>>>>>>>>>> case,
>>>>>>>>>>>>>> the above build-time toggle might help.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If not, then exponentially increasing size of /dev/shm space is
>>>>>>>>>>>>>> likely your best bet (I don't recommended fiddling with
>>>>>>>>>>>>>> mlockall()
>>>>>>>>>>>>>> and similar measures in corosync).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Of course, feel free to raise a regression if you have a
>>>>>>>>>>>>>> reproducible
>>>>>>>>>>>>>> comparison between two corosync (plus possibly different
>>>>>>>>>>>>>> libraries
>>>>>>>>>>>>>> like libqb) versions, one that works and one that won't, in
>>>>>>>>>>>>>> reproducible conditions (like this small /dev/shm, VM image,
>>>>>>>>>>>>>> etc.).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Jan (Poki)
>>>>>>>>>>>>>> _______________________________________________
> _______________________________________________
> Users mailing list: Users at clusterlabs.org <mailto:Users at clusterlabs.org>
> https://lists.clusterlabs.org/mailman/listinfo/users <https://lists.clusterlabs.org/mailman/listinfo/users>
>
> Project Home: http://www.clusterlabs.org <http://www.clusterlabs.org/>
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf <http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf>
> Bugs: http://bugs.clusterlabs.org <http://bugs.clusterlabs.org/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20180626/ba6bcf6c/attachment-0002.html>
More information about the Users
mailing list