[ClusterLabs] Upgrade corosync problem

Mon Jun 25 07:09:31 UTC 2018

On 22/06/18 11:23, Salvatore D'angelo wrote:
> Hi,
> Here the log:
> 
> 
> 
[17323] pg1 corosyncerror   [QB    ] couldn't create circular mmap on
/dev/shm/qb-cfg-event-17324-17334-23-data
[17323] pg1 corosyncerror   [QB    ]
qb_rb_open:cfg-event-17324-17334-23: Resource temporarily unavailable (11)
[17323] pg1 corosyncdebug   [QB    ] Free'ing ringbuffer:
/dev/shm/qb-cfg-request-17324-17334-23-header
[17323] pg1 corosyncdebug   [QB    ] Free'ing ringbuffer:
/dev/shm/qb-cfg-response-17324-17334-23-header
[17323] pg1 corosyncerror   [QB    ] shm connection FAILED: Resource
temporarily unavailable (11)
[17323] pg1 corosyncerror   [QB    ] Error in connection setup
(17324-17334-23): Resource temporarily unavailable (11)
[17323] pg1 corosyncdebug   [QB    ] qb_ipcs_disconnect(17324-17334-23)
state:0

is /dev/shm full?

Chrissie

> 
> 
>> On 22 Jun 2018, at 12:10, Christine Caulfield <ccaulfie at redhat.com> wrote:
>>
>> On 22/06/18 10:39, Salvatore D'angelo wrote:
>>> Hi,
>>>
>>> Can you tell me exactly which log you need. I’ll provide you as soon as possible.
>>>
>>> Regarding some settings, I am not the original author of this cluster. People created it left the company I am working with and I inerithed the code and sometime I do not know why some settings are used.
>>> The old versions of pacemaker, corosync,  crash and resource agents were compiled and installed.
>>> I simply downloaded the new versions compiled and installed them. I didn’t get any compliant during ./configure that usually checks for library compatibility.
>>>
>>> To be honest I do not know if this is the right approach. Should I “make unistall" old versions before installing the new one?
>>> Which is the suggested approach?
>>> Thank in advance for your help.
>>>
>>
>> OK fair enough!
>>
>> To be honest the best approach is almost always to get the latest
>> packages from the distributor rather than compile from source. That way
>> you can be more sure that upgrades will be more smoothly. Though, to be
>> honest, I'm not sure how good the Ubuntu packages are (they might be
>> great, they might not, I genuinely don't know)
>>
>> When building from source and if you don't know the provenance of the
>> previous version then I would recommend a 'make uninstall' first - or
>> removal of the packages if that's where they came from.
>>
>> One thing you should do is make sure that all the cluster nodes are
>> running the same version. If some are running older versions then nodes
>> could drop out for obscure reasons. We try and keep minor versions
>> on-wire compatible but it's always best to be cautious.
>>
>> The tidying of your corosync.conf wan wait for the moment, lets get
>> things mostly working first. If you enable debug logging in corosync.conf:
>>
>> logging {
>>        to_syslog: yes
>> 	debug: on
>> }
>>
>> Then see what happens and post the syslog file that has all of the
>> corosync messages in it, we'll take it from there.
>>
>> Chrissie
>>
>>>> On 22 Jun 2018, at 11:30, Christine Caulfield <ccaulfie at redhat.com> wrote:
>>>>
>>>> On 22/06/18 10:14, Salvatore D'angelo wrote:
>>>>> Hi Christine,
>>>>>
>>>>> Thanks for reply. Let me add few details. When I run the corosync
>>>>> service I se the corosync process running. If I stop it and run:
>>>>>
>>>>> corosync -f 
>>>>>
>>>>> I see three warnings:
>>>>> warning [MAIN  ] interface section bindnetaddr is used together with
>>>>> nodelist. Nodelist one is going to be used.
>>>>> warning [MAIN  ] Please migrate config file to nodelist.
>>>>> warning [MAIN  ] Could not set SCHED_RR at priority 99: Operation not
>>>>> permitted (1)
>>>>> warning [MAIN  ] Could not set priority -2147483648: Permission denied (13)
>>>>>
>>>>> but I see node joined.
>>>>>
>>>>
>>>> Those certainly need fixing but are probably not the cause. Also why do
>>>> you have these values below set?
>>>>
>>>> max_network_delay: 100
>>>> retransmits_before_loss_const: 25
>>>> window_size: 150
>>>>
>>>> I'm not saying they are causing the trouble, but they aren't going to
>>>> help keep a stable cluster.
>>>>
>>>> Without more logs (full logs are always better than just the bits you
>>>> think are meaningful) I still can't be sure. it could easily be just
>>>> that you've overwritten a packaged version of corosync with your own
>>>> compiled one and they have different configure options or that the
>>>> libraries now don't match.
>>>>
>>>> Chrissie
>>>>
>>>>
>>>>> My corosync.conf file is below.
>>>>>
>>>>> With service corosync up and running I have the following output:
>>>>> *corosync-cfgtool -s*
>>>>> Printing ring status.
>>>>> Local node ID 1
>>>>> RING ID 0
>>>>> id= 10.0.0.11
>>>>> status= ring 0 active with no faults
>>>>> RING ID 1
>>>>> id= 192.168.0.11
>>>>> status= ring 1 active with no faults
>>>>>
>>>>> *corosync-cmapctl  | grep members*
>>>>> runtime.totem.pg.mrp.srp.*members*.1.config_version (u64) = 0
>>>>> runtime.totem.pg.mrp.srp.*members*.1.ip (str) = r(0) ip(10.0.0.11) r(1)
>>>>> ip(192.168.0.11) 
>>>>> runtime.totem.pg.mrp.srp.*members*.1.join_count (u32) = 1
>>>>> runtime.totem.pg.mrp.srp.*members*.1.status (str) = joined
>>>>> runtime.totem.pg.mrp.srp.*members*.2.config_version (u64) = 0
>>>>> runtime.totem.pg.mrp.srp.*members*.2.ip (str) = r(0) ip(10.0.0.12) r(1)
>>>>> ip(192.168.0.12) 
>>>>> runtime.totem.pg.mrp.srp.*members*.2.join_count (u32) = 1
>>>>> runtime.totem.pg.mrp.srp.*members*.2.status (str) = joined
>>>>>
>>>>> For the moment I have two nodes in my cluster (third node and some
>>>>> issues and at the moment I did crm node standby on it).
>>>>>
>>>>> Here the dependency I have installed for corosync (that works fine with
>>>>> pacemaker 1.1.14 and corosync 2.3.5):
>>>>>     libnspr4-dev_2%253a4.10.10-0ubuntu0.14.04.1_amd64.deb
>>>>>     libnspr4_2%253a4.10.10-0ubuntu0.14.04.1_amd64.deb
>>>>>     libnss3-dev_2%253a3.19.2.1-0ubuntu0.14.04.2_amd64.deb
>>>>>     libnss3-nssdb_2%253a3.19.2.1-0ubuntu0.14.04.2_all.deb
>>>>>     libnss3_2%253a3.19.2.1-0ubuntu0.14.04.2_amd64.deb
>>>>>     libqb-dev_0.16.0.real-1ubuntu4_amd64.deb
>>>>>     libqb0_0.16.0.real-1ubuntu4_amd64.deb
>>>>>
>>>>> *corosync.conf*
>>>>> ---------------------
>>>>> quorum {
>>>>>        provider: corosync_votequorum
>>>>>        expected_votes: 3
>>>>> }
>>>>> totem {
>>>>>        version: 2
>>>>>        crypto_cipher: none
>>>>>        crypto_hash: none
>>>>>        rrp_mode: passive
>>>>>        interface {
>>>>>                ringnumber: 0
>>>>>                bindnetaddr: 10.0.0.0
>>>>>                mcastport: 5405
>>>>>                ttl: 1
>>>>>        }
>>>>>        interface {
>>>>>                ringnumber: 1
>>>>>                bindnetaddr: 192.168.0.0
>>>>>                mcastport: 5405
>>>>>                ttl: 1
>>>>>        }
>>>>>        transport: udpu
>>>>>        max_network_delay: 100
>>>>>        retransmits_before_loss_const: 25
>>>>>        window_size: 150
>>>>> }
>>>>> nodelist {
>>>>>        node {
>>>>>                ring0_addr: pg1
>>>>>                ring1_addr: pg1p
>>>>>                nodeid: 1
>>>>>        }
>>>>>        node {
>>>>>                ring0_addr: pg2
>>>>>                ring1_addr: pg2p
>>>>>                nodeid: 2
>>>>>        }
>>>>>        node {
>>>>>                ring0_addr: pg3
>>>>>                ring1_addr: pg3p
>>>>>                nodeid: 3
>>>>>        }
>>>>> }
>>>>> logging {
>>>>>        to_syslog: yes
>>>>> }
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> On 22 Jun 2018, at 09:24, Christine Caulfield <ccaulfie at redhat.com
>>>>>> <mailto:ccaulfie at redhat.com>> wrote:
>>>>>>
>>>>>> On 21/06/18 16:16, Salvatore D'angelo wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I upgraded my PostgreSQL/Pacemaker cluster with these versions.
>>>>>>> Pacemaker 1.1.14 -> 1.1.18
>>>>>>> Corosync 2.3.5 -> 2.4.4
>>>>>>> Crmsh 2.2.0 -> 3.0.1
>>>>>>> Resource agents 3.9.7 -> 4.1.1
>>>>>>>
>>>>>>> I started on a first node  (I am trying one node at a time upgrade).
>>>>>>> On a PostgreSQL slave node  I did:
>>>>>>>
>>>>>>> *crm node standby <node>*
>>>>>>> *service pacemaker stop*
>>>>>>> *service corosync stop*
>>>>>>>
>>>>>>> Then I build the tool above as described on their GitHub.com
>>>>>>> <http://GitHub.com>
>>>>>>> <http://GitHub.com <http://github.com/>> page. 
>>>>>>>
>>>>>>> *./autogen.sh (where required)*
>>>>>>> *./configure*
>>>>>>> *make (where required)*
>>>>>>> *make install*
>>>>>>>
>>>>>>> Everything went ok. I expect new file overwrite old one. I left the
>>>>>>> dependency I had with old software because I noticed the .configure
>>>>>>> didn’t complain. 
>>>>>>> I started corosync.
>>>>>>>
>>>>>>> *service corosync start*
>>>>>>>
>>>>>>> To verify corosync work properly I used the following commands:
>>>>>>> *corosync-cfg-tool -s*
>>>>>>> *corosync-cmapctl | grep members*
>>>>>>>
>>>>>>> Everything seemed ok and I verified my node joined the cluster (at least
>>>>>>> this is my impression).
>>>>>>>
>>>>>>> Here I verified a problem. Doing the command:
>>>>>>> corosync-quorumtool -ps
>>>>>>>
>>>>>>> I got the following problem:
>>>>>>> Cannot initialise CFG service
>>>>>>>
>>>>>> That says that corosync is not running. Have a look in the log files to
>>>>>> see why it stopped. The pacemaker logs below are showing the same thing,
>>>>>> but we can't make any more guesses until we see what corosync itself is
>>>>>> doing. Enabling debug in corosync.conf will also help if more detail is
>>>>>> needed.
>>>>>>
>>>>>> Also starting corosync with 'corosync -pf' on the command-line is often
>>>>>> a quick way of checking things are starting OK.
>>>>>>
>>>>>> Chrissie
>>>>>>
>>>>>>
>>>>>>> If I try to start pacemaker, I only see pacemaker process running and
>>>>>>> pacemaker.log containing the following lines:
>>>>>>>
>>>>>>> /Jun 21 15:09:38 [17115] pg1 pacemakerd:     info: crm_log_init:Changed
>>>>>>> active directory to /var/lib/pacemaker/cores/
>>>>>>> /Jun 21 15:09:38 [17115] pg1 pacemakerd:     info:
>>>>>>> get_cluster_type:Detected an active 'corosync' cluster/
>>>>>>> /Jun 21 15:09:38 [17115] pg1 pacemakerd:     info:
>>>>>>> mcp_read_config:Reading configure for stack: corosync/
>>>>>>> /Jun 21 15:09:38 [17115] pg1 pacemakerd:   notice: main:Starting
>>>>>>> Pacemaker 1.1.18 | build=2b07d5c5a9 features: libqb-logging libqb-ipc
>>>>>>> lha-fencing nagios  corosync-native atomic-attrd acls/
>>>>>>> /Jun 21 15:09:38 [17115] pg1 pacemakerd:     info: main:Maximum core
>>>>>>> file size is: 18446744073709551615/
>>>>>>> /Jun 21 15:09:38 [17115] pg1 pacemakerd:     info:
>>>>>>> qb_ipcs_us_publish:server name: pacemakerd/
>>>>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd:  warning:
>>>>>>> corosync_node_name:Could not connect to Cluster Configuration Database
>>>>>>> API, error CS_ERR_TRY_AGAIN/
>>>>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd:     info:
>>>>>>> corosync_node_name:Unable to get node name for nodeid 1/
>>>>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd:   notice: get_node_name:Could
>>>>>>> not obtain a node name for corosync nodeid 1/
>>>>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd:     info: crm_get_peer:Created
>>>>>>> entry 1aeef8ac-643b-44f7-8ce3-d82bbf40bbc1/0x557dc7f05d30 for node
>>>>>>> (null)/1 (1 total)/
>>>>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd:     info: crm_get_peer:Node 1
>>>>>>> has uuid 1/
>>>>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd:     info:
>>>>>>> crm_update_peer_proc:cluster_connect_cpg: Node (null)[1] - corosync-cpg
>>>>>>> is now online/
>>>>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd:    error:
>>>>>>> cluster_connect_quorum:Could not connect to the Quorum API: 2/
>>>>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd:     info:
>>>>>>> qb_ipcs_us_withdraw:withdrawing server sockets/
>>>>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd:     info: main:Exiting
>>>>>>> pacemakerd/
>>>>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd:     info:
>>>>>>> crm_xml_cleanup:Cleaning up memory from libxml2/
>>>>>>>
>>>>>>> *What is wrong in my procedure?*
>>>>>>>
>>>>>>>
>>>>>>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>