[ClusterLabs] Upgrade corosync problem

Tue Jun 26 06:03:37 EDT 2018

On 26/06/18 11:00, Salvatore D'angelo wrote:
> Consider that the container is the same when corosync 2.3.5 run.
> If it is something related to the container probably the 2.4.4
> introduced a feature that has an impact on container.
> Should be something related to libqb according to the code.
> Anyone can help?
> 

Have you tried downgrading libqb to the previous version to see if it
still happens?

Chrissie

>> On 26 Jun 2018, at 11:56, Christine Caulfield <ccaulfie at redhat.com
>> <mailto:ccaulfie at redhat.com>> wrote:
>>
>> On 26/06/18 10:35, Salvatore D'angelo wrote:
>>> Sorry after the command:
>>>
>>> corosync-quorumtool -ps
>>>
>>> the error in log are still visible. Looking at the source code it seems
>>> problem is at this line:
>>> https://github.com/corosync/corosync/blob/master/tools/corosync-quorumtool.c
>>>
>>>     if (quorum_initialize(&q_handle, &q_callbacks, &q_type) != CS_OK) {
>>> fprintf(stderr, "Cannot initialize QUORUM service\n");
>>> q_handle = 0;
>>> goto out;
>>> }
>>>
>>> if (corosync_cfg_initialize(&c_handle, &c_callbacks) != CS_OK) {
>>> fprintf(stderr, "Cannot initialise CFG service\n");
>>> c_handle = 0;
>>> goto out;
>>> }
>>>
>>> The quorum_initialize function is defined here:
>>> https://github.com/corosync/corosync/blob/master/lib/quorum.c
>>>
>>> It seems interacts with libqb to allocate space on /dev/shm but
>>> something fails. I tried to update the libqb with apt-get install but no
>>> success.
>>>
>>> The same for second function:
>>> https://github.com/corosync/corosync/blob/master/lib/cfg.c
>>>
>>> Now I am not an expert of libqb. I have the version 0.16.0.real-1ubuntu5.
>>>
>>> The folder /dev/shm has 777 permission like other nodes with older
>>> corosync and pacemaker that work fine. The only difference is that I
>>> only see files created by root, no one created by hacluster like other
>>> two nodes (probably because pacemaker didn’t start correctly).
>>>
>>> This is the analysis I have done so far.
>>> Any suggestion?
>>>
>>>
>>
>> Hmm. t seems very likely something to do with the way the container is
>> set up then - and I know nothing about containers. Sorry :/
>>
>> Can anyone else help here?
>>
>> Chrissie
>>
>>>> On 26 Jun 2018, at 11:03, Salvatore D'angelo <sasadangelo at gmail.com
>>>> <mailto:sasadangelo at gmail.com>
>>>> <mailto:sasadangelo at gmail.com>> wrote:
>>>>
>>>> Yes, sorry you’re right I could find it by myself.
>>>> However, I did the following:
>>>>
>>>> 1. Added the line you suggested to /etc/fstab
>>>> 2. mount -o remount /dev/shm
>>>> 3. Now I correctly see /dev/shm of 512M with df -h
>>>> Filesystem      Size  Used Avail Use% Mounted on
>>>> overlay          63G   11G   49G  19% /
>>>> tmpfs            64M  4.0K   64M   1% /dev
>>>> tmpfs          1000M     0 1000M   0% /sys/fs/cgroup
>>>> osxfs           466G  158G  305G  35% /Users
>>>> /dev/sda1        63G   11G   49G  19% /etc/hosts
>>>> *shm             512M   15M  498M   3% /dev/shm*
>>>> tmpfs          1000M     0 1000M   0% /sys/firmware
>>>> tmpfs           128M     0  128M   0% /tmp
>>>>
>>>> The errors in log went away. Consider that I remove the log file
>>>> before start corosync so it does not contains lines of previous
>>>> executions.
>>>> <corosync.log>
>>>>
>>>> But the command:
>>>> corosync-quorumtool -ps
>>>>
>>>> still give:
>>>> Cannot initialize QUORUM service
>>>>
>>>> Consider that few minutes before it gave me the message:
>>>> Cannot initialize CFG service
>>>>
>>>> I do not know the differences between CFG and QUORUM in this case.
>>>>
>>>> If I try to start pacemaker the service is OK but I see only pacemaker
>>>> and the Transport does not work if I try to run a cam command.
>>>> Any suggestion?
>>>>
>>>>
>>>>> On 26 Jun 2018, at 10:49, Christine Caulfield <ccaulfie at redhat.com
>>>>> <mailto:ccaulfie at redhat.com>
>>>>> <mailto:ccaulfie at redhat.com>> wrote:
>>>>>
>>>>> On 26/06/18 09:40, Salvatore D'angelo wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Yes,
>>>>>>
>>>>>> I am reproducing only the required part for test. I think the original
>>>>>> system has a larger shm. The problem is that I do not know exactly how
>>>>>> to change it.
>>>>>> I tried the following steps, but I have the impression I didn’t
>>>>>> performed the right one:
>>>>>>
>>>>>> 1. remove everything under /tmp
>>>>>> 2. Added the following line to /etc/fstab
>>>>>> tmpfs   /tmp         tmpfs  
>>>>>> defaults,nodev,nosuid,mode=1777,size=128M 
>>>>>>         0  0
>>>>>> 3. mount /tmp
>>>>>> 4. df -h
>>>>>> Filesystem      Size  Used Avail Use% Mounted on
>>>>>> overlay          63G   11G   49G  19% /
>>>>>> tmpfs            64M  4.0K   64M   1% /dev
>>>>>> tmpfs          1000M     0 1000M   0% /sys/fs/cgroup
>>>>>> osxfs           466G  158G  305G  35% /Users
>>>>>> /dev/sda1        63G   11G   49G  19% /etc/hosts
>>>>>> shm              64M   11M   54M  16% /dev/shm
>>>>>> tmpfs          1000M     0 1000M   0% /sys/firmware
>>>>>> *tmpfs           128M     0  128M   0% /tmp*
>>>>>>
>>>>>> The errors are exactly the same.
>>>>>> I have the impression that I changed the wrong parameter. Probably I
>>>>>> have to change:
>>>>>> shm              64M   11M   54M  16% /dev/shm
>>>>>>
>>>>>> but I do not know how to do that. Any suggestion?
>>>>>>
>>>>>
>>>>> According to google, you just add a new line to /etc/fstab for /dev/shm
>>>>>
>>>>> tmpfs      /dev/shm      tmpfs   defaults,size=512m   0   0
>>>>>
>>>>> Chrissie
>>>>>
>>>>>>> On 26 Jun 2018, at 09:48, Christine Caulfield
>>>>>>> <ccaulfie at redhat.com <mailto:ccaulfie at redhat.com>
>>>>>>> <mailto:ccaulfie at redhat.com>
>>>>>>> <mailto:ccaulfie at redhat.com>> wrote:
>>>>>>>
>>>>>>> On 25/06/18 20:41, Salvatore D'angelo wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Let me add here one important detail. I use Docker for my test
>>>>>>>> with 5
>>>>>>>> containers deployed on my Mac.
>>>>>>>> Basically the team that worked on this project installed the cluster
>>>>>>>> on soft layer bare metal.
>>>>>>>> The PostgreSQL cluster was hard to test and if a misconfiguration
>>>>>>>> occurred recreate the cluster from scratch is not easy.
>>>>>>>> Test it was a cumbersome if you consider that we access to the
>>>>>>>> machines with a complex system hard to describe here.
>>>>>>>> For this reason I ported the cluster on Docker for test purpose.
>>>>>>>> I am
>>>>>>>> not interested to have it working for months, I just need a proof of
>>>>>>>> concept. 
>>>>>>>>
>>>>>>>> When the migration works I’ll port everything on bare metal
>>>>>>>> where the
>>>>>>>> size of resources are ambundant.  
>>>>>>>>
>>>>>>>> Now I have enough RAM and disk space on my Mac so if you tell me
>>>>>>>> what
>>>>>>>> should be an acceptable size for several days of running it is ok
>>>>>>>> for me.
>>>>>>>> It is ok also have commands to clean the shm when required.
>>>>>>>> I know I can find them on Google but if you can suggest me these
>>>>>>>> info
>>>>>>>> I’ll appreciate. I have OS knowledge to do that but I would like to
>>>>>>>> avoid days of guesswork and try and error if possible.
>>>>>>>
>>>>>>>
>>>>>>> I would recommend at least 128MB of space on /dev/shm, 256MB if
>>>>>>> you can
>>>>>>> spare it. My 'standard' system uses 75MB under normal running
>>>>>>> allowing
>>>>>>> for one command-line query to run.
>>>>>>>
>>>>>>> If I read this right then you're reproducing a bare-metal system in
>>>>>>> containers now? so the original systems will have a default /dev/shm
>>>>>>> size which is probably much larger than your containers?
>>>>>>>
>>>>>>> I'm just checking here that we don't have a regression in memory
>>>>>>> usage
>>>>>>> as Poki suggested.
>>>>>>>
>>>>>>> Chrissie
>>>>>>>
>>>>>>>>> On 25 Jun 2018, at 21:18, Jan Pokorný <jpokorny at redhat.com
>>>>>>>>> <mailto:jpokorny at redhat.com>
>>>>>>>>> <mailto:jpokorny at redhat.com>
>>>>>>>>> <mailto:jpokorny at redhat.com>> wrote:
>>>>>>>>>
>>>>>>>>> On 25/06/18 19:06 +0200, Salvatore D'angelo wrote:
>>>>>>>>>> Thanks for reply. I scratched my cluster and created it again and
>>>>>>>>>> then migrated as before. This time I uninstalled pacemaker,
>>>>>>>>>> corosync, crmsh and resource agents with make uninstall
>>>>>>>>>>
>>>>>>>>>> then I installed new packages. The problem is the same, when
>>>>>>>>>> I launch:
>>>>>>>>>> corosync-quorumtool -ps
>>>>>>>>>>
>>>>>>>>>> I got: Cannot initialize QUORUM service
>>>>>>>>>>
>>>>>>>>>> Here the log with debug enabled:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> [18019] pg3 corosyncerror   [QB    ] couldn't create circular mmap
>>>>>>>>>> on /dev/shm/qb-cfg-event-18020-18028-23-data
>>>>>>>>>> [18019] pg3 corosyncerror   [QB    ]
>>>>>>>>>> qb_rb_open:cfg-event-18020-18028-23: Resource temporarily
>>>>>>>>>> unavailable (11)
>>>>>>>>>> [18019] pg3 corosyncdebug   [QB    ] Free'ing ringbuffer:
>>>>>>>>>> /dev/shm/qb-cfg-request-18020-18028-23-header
>>>>>>>>>> [18019] pg3 corosyncdebug   [QB    ] Free'ing ringbuffer:
>>>>>>>>>> /dev/shm/qb-cfg-response-18020-18028-23-header
>>>>>>>>>> [18019] pg3 corosyncerror   [QB    ] shm connection FAILED:
>>>>>>>>>> Resource temporarily unavailable (11)
>>>>>>>>>> [18019] pg3 corosyncerror   [QB    ] Error in connection setup
>>>>>>>>>> (18020-18028-23): Resource temporarily unavailable (11)
>>>>>>>>>>
>>>>>>>>>> I tried to check /dev/shm and I am not sure these are the right
>>>>>>>>>> commands, however:
>>>>>>>>>>
>>>>>>>>>> df -h /dev/shm
>>>>>>>>>> Filesystem      Size  Used Avail Use% Mounted on
>>>>>>>>>> shm              64M   16M   49M  24% /dev/shm
>>>>>>>>>>
>>>>>>>>>> ls /dev/shm
>>>>>>>>>> qb-cmap-request-18020-18036-25-data    qb-corosync-blackbox-data
>>>>>>>>>>    qb-quorum-request-18020-18095-32-data
>>>>>>>>>> qb-cmap-request-18020-18036-25-header  qb-corosync-blackbox-header
>>>>>>>>>>  qb-quorum-request-18020-18095-32-header
>>>>>>>>>>
>>>>>>>>>> Is 64 Mb enough for /dev/shm. If no, why it worked with previous
>>>>>>>>>> corosync release?
>>>>>>>>>
>>>>>>>>> For a start, can you try configuring corosync with
>>>>>>>>> --enable-small-memory-footprint switch?
>>>>>>>>>
>>>>>>>>> Hard to say why the space provisioned to /dev/shm is the direct
>>>>>>>>> opposite of generous (per today's standards), but may be the result
>>>>>>>>> of automatic HW adaptation, and if RAM is so scarce in your case,
>>>>>>>>> the above build-time toggle might help.
>>>>>>>>>
>>>>>>>>> If not, then exponentially increasing size of /dev/shm space is
>>>>>>>>> likely your best bet (I don't recommended fiddling with mlockall()
>>>>>>>>> and similar measures in corosync).
>>>>>>>>>
>>>>>>>>> Of course, feel free to raise a regression if you have a
>>>>>>>>> reproducible
>>>>>>>>> comparison between two corosync (plus possibly different libraries
>>>>>>>>> like libqb) versions, one that works and one that won't, in
>>>>>>>>> reproducible conditions (like this small /dev/shm, VM image, etc.).
>>>>>>>>>
>>>>>>>>> -- 
>>>>>>>>> Jan (Poki)
>>>>>>>>> _______________________________________________
>>>>>>>>> Users mailing list: Users at clusterlabs.org
>>>>>>>>> <mailto:Users at clusterlabs.org>
>>>>>>>>> <mailto:Users at clusterlabs.org> <mailto:Users at clusterlabs.org>
>>>>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>>>>>
>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>> <http://www.clusterlabs.org/>
>>>>>>>>> <http://www.clusterlabs.org/>
>>>>>>>>> Getting
>>>>>>>>> started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>>> <http://bugs.clusterlabs.org/> <http://bugs.clusterlabs.org/>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Users mailing list: Users at clusterlabs.org
>>>>>>>> <mailto:Users at clusterlabs.org>
>>>>>>>> <mailto:Users at clusterlabs.org> <mailto:Users at clusterlabs.org>
>>>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>>>>
>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>> <http://www.clusterlabs.org/>
>>>>>>>> <http://www.clusterlabs.org/> <http://www.clusterlabs.org/>
>>>>>>>> Getting
>>>>>>>> started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>> Bugs: http://bugs.clusterlabs.org <http://bugs.clusterlabs.org/>
>>>>>>>> <http://bugs.clusterlabs.org/> <http://bugs.clusterlabs.org/>
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Users mailing list: Users at clusterlabs.org
>>>>>>> <mailto:Users at clusterlabs.org>
>>>>>>> <mailto:Users at clusterlabs.org> <mailto:Users at clusterlabs.org>
>>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>>>
>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>> <http://www.clusterlabs.org/>
>>>>>>> <http://www.clusterlabs.org/> <http://www.clusterlabs.org/>
>>>>>>> Getting
>>>>>>> started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>> Bugs: http://bugs.clusterlabs.org <http://bugs.clusterlabs.org/>
>>>>>>> <http://bugs.clusterlabs.org/> <http://bugs.clusterlabs.org/>
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Users mailing list: Users at clusterlabs.org
>>>>>> <mailto:Users at clusterlabs.org> <mailto:Users at clusterlabs.org>
>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>>
>>>>>> Project Home: http://www.clusterlabs.org <http://www.clusterlabs.org/>
>>>>>> Getting
>>>>>> started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>> Bugs: http://bugs.clusterlabs.org <http://bugs.clusterlabs.org/>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Users mailing list: Users at clusterlabs.org
>>>>> <mailto:Users at clusterlabs.org> <mailto:Users at clusterlabs.org>
>>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>
>>>>> Project Home: http://www.clusterlabs.org <http://www.clusterlabs.org/>
>>>>> Getting started:
>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>> Bugs: http://bugs.clusterlabs.org <http://bugs.clusterlabs.org/>
>>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org <mailto:Users at clusterlabs.org>
>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org <mailto:Users at clusterlabs.org>
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>