[ClusterLabs] Upgrade corosync problem

Mon Jun 25 13:06:03 EDT 2018

Hi,

Thanks for reply. I scratched my cluster and created it again and then migrated as before. This time I uninstalled pacemaker, corosync, crmsh and resource agents with
make uninstall

then I installed new packages. The problem is the same, when I launch:
corosync-quorumtool -ps

I got: Cannot initialize QUORUM service

Here the log with debug enabled:

[18019] pg3 corosyncerror   [QB    ] couldn't create circular mmap on /dev/shm/qb-cfg-event-18020-18028-23-data
[18019] pg3 corosyncerror   [QB    ] qb_rb_open:cfg-event-18020-18028-23: Resource temporarily unavailable (11)
[18019] pg3 corosyncdebug   [QB    ] Free'ing ringbuffer: /dev/shm/qb-cfg-request-18020-18028-23-header
[18019] pg3 corosyncdebug   [QB    ] Free'ing ringbuffer: /dev/shm/qb-cfg-response-18020-18028-23-header
[18019] pg3 corosyncerror   [QB    ] shm connection FAILED: Resource temporarily unavailable (11)
[18019] pg3 corosyncerror   [QB    ] Error in connection setup (18020-18028-23): Resource temporarily unavailable (11)

I tried to check /dev/shm and I am not sure these are the right commands, however:

df -h /dev/shm
Filesystem      Size  Used Avail Use% Mounted on
shm              64M   16M   49M  24% /dev/shm

ls /dev/shm
qb-cmap-request-18020-18036-25-data    qb-corosync-blackbox-data    qb-quorum-request-18020-18095-32-data
qb-cmap-request-18020-18036-25-header  qb-corosync-blackbox-header  qb-quorum-request-18020-18095-32-header

Is 64 Mb enough for /dev/shm. If no, why it worked with previous corosync release?

> On 25 Jun 2018, at 09:09, Christine Caulfield <ccaulfie at redhat.com> wrote:
> 
> On 22/06/18 11:23, Salvatore D'angelo wrote:
>> Hi,
>> Here the log:
>> 
>> 
>> 
> [17323] pg1 corosyncerror   [QB    ] couldn't create circular mmap on
> /dev/shm/qb-cfg-event-17324-17334-23-data
> [17323] pg1 corosyncerror   [QB    ]
> qb_rb_open:cfg-event-17324-17334-23: Resource temporarily unavailable (11)
> [17323] pg1 corosyncdebug   [QB    ] Free'ing ringbuffer:
> /dev/shm/qb-cfg-request-17324-17334-23-header
> [17323] pg1 corosyncdebug   [QB    ] Free'ing ringbuffer:
> /dev/shm/qb-cfg-response-17324-17334-23-header
> [17323] pg1 corosyncerror   [QB    ] shm connection FAILED: Resource
> temporarily unavailable (11)
> [17323] pg1 corosyncerror   [QB    ] Error in connection setup
> (17324-17334-23): Resource temporarily unavailable (11)
> [17323] pg1 corosyncdebug   [QB    ] qb_ipcs_disconnect(17324-17334-23)
> state:0
> 
> 
> 
> is /dev/shm full?
> 
> 
> Chrissie
> 
> 
>> 
>> 
>>> On 22 Jun 2018, at 12:10, Christine Caulfield <ccaulfie at redhat.com> wrote:
>>> 
>>> On 22/06/18 10:39, Salvatore D'angelo wrote:
>>>> Hi,
>>>> 
>>>> Can you tell me exactly which log you need. I’ll provide you as soon as possible.
>>>> 
>>>> Regarding some settings, I am not the original author of this cluster. People created it left the company I am working with and I inerithed the code and sometime I do not know why some settings are used.
>>>> The old versions of pacemaker, corosync,  crash and resource agents were compiled and installed.
>>>> I simply downloaded the new versions compiled and installed them. I didn’t get any compliant during ./configure that usually checks for library compatibility.
>>>> 
>>>> To be honest I do not know if this is the right approach. Should I “make unistall" old versions before installing the new one?
>>>> Which is the suggested approach?
>>>> Thank in advance for your help.
>>>> 
>>> 
>>> OK fair enough!
>>> 
>>> To be honest the best approach is almost always to get the latest
>>> packages from the distributor rather than compile from source. That way
>>> you can be more sure that upgrades will be more smoothly. Though, to be
>>> honest, I'm not sure how good the Ubuntu packages are (they might be
>>> great, they might not, I genuinely don't know)
>>> 
>>> When building from source and if you don't know the provenance of the
>>> previous version then I would recommend a 'make uninstall' first - or
>>> removal of the packages if that's where they came from.
>>> 
>>> One thing you should do is make sure that all the cluster nodes are
>>> running the same version. If some are running older versions then nodes
>>> could drop out for obscure reasons. We try and keep minor versions
>>> on-wire compatible but it's always best to be cautious.
>>> 
>>> The tidying of your corosync.conf wan wait for the moment, lets get
>>> things mostly working first. If you enable debug logging in corosync.conf:
>>> 
>>> logging {
>>>       to_syslog: yes
>>> 	debug: on
>>> }
>>> 
>>> Then see what happens and post the syslog file that has all of the
>>> corosync messages in it, we'll take it from there.
>>> 
>>> Chrissie
>>> 
>>>>> On 22 Jun 2018, at 11:30, Christine Caulfield <ccaulfie at redhat.com> wrote:
>>>>> 
>>>>> On 22/06/18 10:14, Salvatore D'angelo wrote:
>>>>>> Hi Christine,
>>>>>> 
>>>>>> Thanks for reply. Let me add few details. When I run the corosync
>>>>>> service I se the corosync process running. If I stop it and run:
>>>>>> 
>>>>>> corosync -f 
>>>>>> 
>>>>>> I see three warnings:
>>>>>> warning [MAIN  ] interface section bindnetaddr is used together with
>>>>>> nodelist. Nodelist one is going to be used.
>>>>>> warning [MAIN  ] Please migrate config file to nodelist.
>>>>>> warning [MAIN  ] Could not set SCHED_RR at priority 99: Operation not
>>>>>> permitted (1)
>>>>>> warning [MAIN  ] Could not set priority -2147483648: Permission denied (13)
>>>>>> 
>>>>>> but I see node joined.
>>>>>> 
>>>>> 
>>>>> Those certainly need fixing but are probably not the cause. Also why do
>>>>> you have these values below set?
>>>>> 
>>>>> max_network_delay: 100
>>>>> retransmits_before_loss_const: 25
>>>>> window_size: 150
>>>>> 
>>>>> I'm not saying they are causing the trouble, but they aren't going to
>>>>> help keep a stable cluster.
>>>>> 
>>>>> Without more logs (full logs are always better than just the bits you
>>>>> think are meaningful) I still can't be sure. it could easily be just
>>>>> that you've overwritten a packaged version of corosync with your own
>>>>> compiled one and they have different configure options or that the
>>>>> libraries now don't match.
>>>>> 
>>>>> Chrissie
>>>>> 
>>>>> 
>>>>>> My corosync.conf file is below.
>>>>>> 
>>>>>> With service corosync up and running I have the following output:
>>>>>> *corosync-cfgtool -s*
>>>>>> Printing ring status.
>>>>>> Local node ID 1
>>>>>> RING ID 0
>>>>>> id= 10.0.0.11
>>>>>> status= ring 0 active with no faults
>>>>>> RING ID 1
>>>>>> id= 192.168.0.11
>>>>>> status= ring 1 active with no faults
>>>>>> 
>>>>>> *corosync-cmapctl  | grep members*
>>>>>> runtime.totem.pg.mrp.srp.*members*.1.config_version (u64) = 0
>>>>>> runtime.totem.pg.mrp.srp.*members*.1.ip (str) = r(0) ip(10.0.0.11) r(1)
>>>>>> ip(192.168.0.11) 
>>>>>> runtime.totem.pg.mrp.srp.*members*.1.join_count (u32) = 1
>>>>>> runtime.totem.pg.mrp.srp.*members*.1.status (str) = joined
>>>>>> runtime.totem.pg.mrp.srp.*members*.2.config_version (u64) = 0
>>>>>> runtime.totem.pg.mrp.srp.*members*.2.ip (str) = r(0) ip(10.0.0.12) r(1)
>>>>>> ip(192.168.0.12) 
>>>>>> runtime.totem.pg.mrp.srp.*members*.2.join_count (u32) = 1
>>>>>> runtime.totem.pg.mrp.srp.*members*.2.status (str) = joined
>>>>>> 
>>>>>> For the moment I have two nodes in my cluster (third node and some
>>>>>> issues and at the moment I did crm node standby on it).
>>>>>> 
>>>>>> Here the dependency I have installed for corosync (that works fine with
>>>>>> pacemaker 1.1.14 and corosync 2.3.5):
>>>>>>    libnspr4-dev_2%253a4.10.10-0ubuntu0.14.04.1_amd64.deb
>>>>>>    libnspr4_2%253a4.10.10-0ubuntu0.14.04.1_amd64.deb
>>>>>>    libnss3-dev_2%253a3.19.2.1-0ubuntu0.14.04.2_amd64.deb
>>>>>>    libnss3-nssdb_2%253a3.19.2.1-0ubuntu0.14.04.2_all.deb
>>>>>>    libnss3_2%253a3.19.2.1-0ubuntu0.14.04.2_amd64.deb
>>>>>>    libqb-dev_0.16.0.real-1ubuntu4_amd64.deb
>>>>>>    libqb0_0.16.0.real-1ubuntu4_amd64.deb
>>>>>> 
>>>>>> *corosync.conf*
>>>>>> ---------------------
>>>>>> quorum {
>>>>>>       provider: corosync_votequorum
>>>>>>       expected_votes: 3
>>>>>> }
>>>>>> totem {
>>>>>>       version: 2
>>>>>>       crypto_cipher: none
>>>>>>       crypto_hash: none
>>>>>>       rrp_mode: passive
>>>>>>       interface {
>>>>>>               ringnumber: 0
>>>>>>               bindnetaddr: 10.0.0.0
>>>>>>               mcastport: 5405
>>>>>>               ttl: 1
>>>>>>       }
>>>>>>       interface {
>>>>>>               ringnumber: 1
>>>>>>               bindnetaddr: 192.168.0.0
>>>>>>               mcastport: 5405
>>>>>>               ttl: 1
>>>>>>       }
>>>>>>       transport: udpu
>>>>>>       max_network_delay: 100
>>>>>>       retransmits_before_loss_const: 25
>>>>>>       window_size: 150
>>>>>> }
>>>>>> nodelist {
>>>>>>       node {
>>>>>>               ring0_addr: pg1
>>>>>>               ring1_addr: pg1p
>>>>>>               nodeid: 1
>>>>>>       }
>>>>>>       node {
>>>>>>               ring0_addr: pg2
>>>>>>               ring1_addr: pg2p
>>>>>>               nodeid: 2
>>>>>>       }
>>>>>>       node {
>>>>>>               ring0_addr: pg3
>>>>>>               ring1_addr: pg3p
>>>>>>               nodeid: 3
>>>>>>       }
>>>>>> }
>>>>>> logging {
>>>>>>       to_syslog: yes
>>>>>> }
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On 22 Jun 2018, at 09:24, Christine Caulfield <ccaulfie at redhat.com
>>>>>>> <mailto:ccaulfie at redhat.com>> wrote:
>>>>>>> 
>>>>>>> On 21/06/18 16:16, Salvatore D'angelo wrote:
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> I upgraded my PostgreSQL/Pacemaker cluster with these versions.
>>>>>>>> Pacemaker 1.1.14 -> 1.1.18
>>>>>>>> Corosync 2.3.5 -> 2.4.4
>>>>>>>> Crmsh 2.2.0 -> 3.0.1
>>>>>>>> Resource agents 3.9.7 -> 4.1.1
>>>>>>>> 
>>>>>>>> I started on a first node  (I am trying one node at a time upgrade).
>>>>>>>> On a PostgreSQL slave node  I did:
>>>>>>>> 
>>>>>>>> *crm node standby <node>*
>>>>>>>> *service pacemaker stop*
>>>>>>>> *service corosync stop*
>>>>>>>> 
>>>>>>>> Then I build the tool above as described on their GitHub.com
>>>>>>>> <http://GitHub.com>
>>>>>>>> <http://GitHub.com <http://github.com/>> page. 
>>>>>>>> 
>>>>>>>> *./autogen.sh (where required)*
>>>>>>>> *./configure*
>>>>>>>> *make (where required)*
>>>>>>>> *make install*
>>>>>>>> 
>>>>>>>> Everything went ok. I expect new file overwrite old one. I left the
>>>>>>>> dependency I had with old software because I noticed the .configure
>>>>>>>> didn’t complain. 
>>>>>>>> I started corosync.
>>>>>>>> 
>>>>>>>> *service corosync start*
>>>>>>>> 
>>>>>>>> To verify corosync work properly I used the following commands:
>>>>>>>> *corosync-cfg-tool -s*
>>>>>>>> *corosync-cmapctl | grep members*
>>>>>>>> 
>>>>>>>> Everything seemed ok and I verified my node joined the cluster (at least
>>>>>>>> this is my impression).
>>>>>>>> 
>>>>>>>> Here I verified a problem. Doing the command:
>>>>>>>> corosync-quorumtool -ps
>>>>>>>> 
>>>>>>>> I got the following problem:
>>>>>>>> Cannot initialise CFG service
>>>>>>>> 
>>>>>>> That says that corosync is not running. Have a look in the log files to
>>>>>>> see why it stopped. The pacemaker logs below are showing the same thing,
>>>>>>> but we can't make any more guesses until we see what corosync itself is
>>>>>>> doing. Enabling debug in corosync.conf will also help if more detail is
>>>>>>> needed.
>>>>>>> 
>>>>>>> Also starting corosync with 'corosync -pf' on the command-line is often
>>>>>>> a quick way of checking things are starting OK.
>>>>>>> 
>>>>>>> Chrissie
>>>>>>> 
>>>>>>> 
>>>>>>>> If I try to start pacemaker, I only see pacemaker process running and
>>>>>>>> pacemaker.log containing the following lines:
>>>>>>>> 
>>>>>>>> /Jun 21 15:09:38 [17115] pg1 pacemakerd:     info: crm_log_init:Changed
>>>>>>>> active directory to /var/lib/pacemaker/cores/
>>>>>>>> /Jun 21 15:09:38 [17115] pg1 pacemakerd:     info:
>>>>>>>> get_cluster_type:Detected an active 'corosync' cluster/
>>>>>>>> /Jun 21 15:09:38 [17115] pg1 pacemakerd:     info:
>>>>>>>> mcp_read_config:Reading configure for stack: corosync/
>>>>>>>> /Jun 21 15:09:38 [17115] pg1 pacemakerd:   notice: main:Starting
>>>>>>>> Pacemaker 1.1.18 | build=2b07d5c5a9 features: libqb-logging libqb-ipc
>>>>>>>> lha-fencing nagios  corosync-native atomic-attrd acls/
>>>>>>>> /Jun 21 15:09:38 [17115] pg1 pacemakerd:     info: main:Maximum core
>>>>>>>> file size is: 18446744073709551615/
>>>>>>>> /Jun 21 15:09:38 [17115] pg1 pacemakerd:     info:
>>>>>>>> qb_ipcs_us_publish:server name: pacemakerd/
>>>>>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd:  warning:
>>>>>>>> corosync_node_name:Could not connect to Cluster Configuration Database
>>>>>>>> API, error CS_ERR_TRY_AGAIN/
>>>>>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd:     info:
>>>>>>>> corosync_node_name:Unable to get node name for nodeid 1/
>>>>>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd:   notice: get_node_name:Could
>>>>>>>> not obtain a node name for corosync nodeid 1/
>>>>>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd:     info: crm_get_peer:Created
>>>>>>>> entry 1aeef8ac-643b-44f7-8ce3-d82bbf40bbc1/0x557dc7f05d30 for node
>>>>>>>> (null)/1 (1 total)/
>>>>>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd:     info: crm_get_peer:Node 1
>>>>>>>> has uuid 1/
>>>>>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd:     info:
>>>>>>>> crm_update_peer_proc:cluster_connect_cpg: Node (null)[1] - corosync-cpg
>>>>>>>> is now online/
>>>>>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd:    error:
>>>>>>>> cluster_connect_quorum:Could not connect to the Quorum API: 2/
>>>>>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd:     info:
>>>>>>>> qb_ipcs_us_withdraw:withdrawing server sockets/
>>>>>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd:     info: main:Exiting
>>>>>>>> pacemakerd/
>>>>>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd:     info:
>>>>>>>> crm_xml_cleanup:Cleaning up memory from libxml2/
>>>>>>>> 
>>>>>>>> *What is wrong in my procedure?*
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org
>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>> 
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>> 
>> 
>> 
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org <mailto:Users at clusterlabs.org>
>> https://lists.clusterlabs.org/mailman/listinfo/users <https://lists.clusterlabs.org/mailman/listinfo/users>
>> 
>> Project Home: http://www.clusterlabs.org <http://www.clusterlabs.org/>
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf <http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf>
>> Bugs: http://bugs.clusterlabs.org <http://bugs.clusterlabs.org/>
>> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org <mailto:Users at clusterlabs.org>
> https://lists.clusterlabs.org/mailman/listinfo/users <https://lists.clusterlabs.org/mailman/listinfo/users>
> 
> Project Home: http://www.clusterlabs.org <http://www.clusterlabs.org/>
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf <http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf>
> Bugs: http://bugs.clusterlabs.org <http://bugs.clusterlabs.org/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20180625/cca3dd9c/attachment-0004.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: corosync.log
Type: application/octet-stream
Size: 39718 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20180625/cca3dd9c/attachment-0002.obj>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20180625/cca3dd9c/attachment-0005.html>