[ClusterLabs] Upgrade corosync problem

Fri Jun 22 05:39:52 EDT 2018

Hi,

Can you tell me exactly which log you need. I’ll provide you as soon as possible.

Regarding some settings, I am not the original author of this cluster. People created it left the company I am working with and I inerithed the code and sometime I do not know why some settings are used.
The old versions of pacemaker, corosync,  crash and resource agents were compiled and installed.
I simply downloaded the new versions compiled and installed them. I didn’t get any compliant during ./configure that usually checks for library compatibility.

To be honest I do not know if this is the right approach. Should I “make unistall" old versions before installing the new one?
Which is the suggested approach?
Thank in advance for your help.

> On 22 Jun 2018, at 11:30, Christine Caulfield <ccaulfie at redhat.com> wrote:
> 
> On 22/06/18 10:14, Salvatore D'angelo wrote:
>> Hi Christine,
>> 
>> Thanks for reply. Let me add few details. When I run the corosync
>> service I se the corosync process running. If I stop it and run:
>> 
>> corosync -f 
>> 
>> I see three warnings:
>> warning [MAIN  ] interface section bindnetaddr is used together with
>> nodelist. Nodelist one is going to be used.
>> warning [MAIN  ] Please migrate config file to nodelist.
>> warning [MAIN  ] Could not set SCHED_RR at priority 99: Operation not
>> permitted (1)
>> warning [MAIN  ] Could not set priority -2147483648: Permission denied (13)
>> 
>> but I see node joined.
>> 
> 
> Those certainly need fixing but are probably not the cause. Also why do
> you have these values below set?
> 
> max_network_delay: 100
> retransmits_before_loss_const: 25
> window_size: 150
> 
> I'm not saying they are causing the trouble, but they aren't going to
> help keep a stable cluster.
> 
> Without more logs (full logs are always better than just the bits you
> think are meaningful) I still can't be sure. it could easily be just
> that you've overwritten a packaged version of corosync with your own
> compiled one and they have different configure options or that the
> libraries now don't match.
> 
> Chrissie
> 
> 
>> My corosync.conf file is below.
>> 
>> With service corosync up and running I have the following output:
>> *corosync-cfgtool -s*
>> Printing ring status.
>> Local node ID 1
>> RING ID 0
>> id= 10.0.0.11
>> status= ring 0 active with no faults
>> RING ID 1
>> id= 192.168.0.11
>> status= ring 1 active with no faults
>> 
>> *corosync-cmapctl  | grep members*
>> runtime.totem.pg.mrp.srp.*members*.1.config_version (u64) = 0
>> runtime.totem.pg.mrp.srp.*members*.1.ip (str) = r(0) ip(10.0.0.11) r(1)
>> ip(192.168.0.11) 
>> runtime.totem.pg.mrp.srp.*members*.1.join_count (u32) = 1
>> runtime.totem.pg.mrp.srp.*members*.1.status (str) = joined
>> runtime.totem.pg.mrp.srp.*members*.2.config_version (u64) = 0
>> runtime.totem.pg.mrp.srp.*members*.2.ip (str) = r(0) ip(10.0.0.12) r(1)
>> ip(192.168.0.12) 
>> runtime.totem.pg.mrp.srp.*members*.2.join_count (u32) = 1
>> runtime.totem.pg.mrp.srp.*members*.2.status (str) = joined
>> 
>> For the moment I have two nodes in my cluster (third node and some
>> issues and at the moment I did crm node standby on it).
>> 
>> Here the dependency I have installed for corosync (that works fine with
>> pacemaker 1.1.14 and corosync 2.3.5):
>>      libnspr4-dev_2%253a4.10.10-0ubuntu0.14.04.1_amd64.deb
>>      libnspr4_2%253a4.10.10-0ubuntu0.14.04.1_amd64.deb
>>      libnss3-dev_2%253a3.19.2.1-0ubuntu0.14.04.2_amd64.deb
>>      libnss3-nssdb_2%253a3.19.2.1-0ubuntu0.14.04.2_all.deb
>>      libnss3_2%253a3.19.2.1-0ubuntu0.14.04.2_amd64.deb
>>      libqb-dev_0.16.0.real-1ubuntu4_amd64.deb
>>      libqb0_0.16.0.real-1ubuntu4_amd64.deb
>> 
>> *corosync.conf*
>> ---------------------
>> quorum {
>>         provider: corosync_votequorum
>>         expected_votes: 3
>> }
>> totem {
>>         version: 2
>>         crypto_cipher: none
>>         crypto_hash: none
>>         rrp_mode: passive
>>         interface {
>>                 ringnumber: 0
>>                 bindnetaddr: 10.0.0.0
>>                 mcastport: 5405
>>                 ttl: 1
>>         }
>>         interface {
>>                 ringnumber: 1
>>                 bindnetaddr: 192.168.0.0
>>                 mcastport: 5405
>>                 ttl: 1
>>         }
>>         transport: udpu
>>         max_network_delay: 100
>>         retransmits_before_loss_const: 25
>>         window_size: 150
>> }
>> nodelist {
>>         node {
>>                 ring0_addr: pg1
>>                 ring1_addr: pg1p
>>                 nodeid: 1
>>         }
>>         node {
>>                 ring0_addr: pg2
>>                 ring1_addr: pg2p
>>                 nodeid: 2
>>         }
>>         node {
>>                 ring0_addr: pg3
>>                 ring1_addr: pg3p
>>                 nodeid: 3
>>         }
>> }
>> logging {
>>         to_syslog: yes
>> }
>> 
>> 
>> 
>> 
>>> On 22 Jun 2018, at 09:24, Christine Caulfield <ccaulfie at redhat.com
>>> <mailto:ccaulfie at redhat.com>> wrote:
>>> 
>>> On 21/06/18 16:16, Salvatore D'angelo wrote:
>>>> Hi,
>>>> 
>>>> I upgraded my PostgreSQL/Pacemaker cluster with these versions.
>>>> Pacemaker 1.1.14 -> 1.1.18
>>>> Corosync 2.3.5 -> 2.4.4
>>>> Crmsh 2.2.0 -> 3.0.1
>>>> Resource agents 3.9.7 -> 4.1.1
>>>> 
>>>> I started on a first node  (I am trying one node at a time upgrade).
>>>> On a PostgreSQL slave node  I did:
>>>> 
>>>> *crm node standby <node>*
>>>> *service pacemaker stop*
>>>> *service corosync stop*
>>>> 
>>>> Then I build the tool above as described on their GitHub.com
>>>> <http://GitHub.com>
>>>> <http://GitHub.com <http://github.com/>> page. 
>>>> 
>>>> *./autogen.sh (where required)*
>>>> *./configure*
>>>> *make (where required)*
>>>> *make install*
>>>> 
>>>> Everything went ok. I expect new file overwrite old one. I left the
>>>> dependency I had with old software because I noticed the .configure
>>>> didn’t complain. 
>>>> I started corosync.
>>>> 
>>>> *service corosync start*
>>>> 
>>>> To verify corosync work properly I used the following commands:
>>>> *corosync-cfg-tool -s*
>>>> *corosync-cmapctl | grep members*
>>>> 
>>>> Everything seemed ok and I verified my node joined the cluster (at least
>>>> this is my impression).
>>>> 
>>>> Here I verified a problem. Doing the command:
>>>> corosync-quorumtool -ps
>>>> 
>>>> I got the following problem:
>>>> Cannot initialise CFG service
>>>> 
>>> That says that corosync is not running. Have a look in the log files to
>>> see why it stopped. The pacemaker logs below are showing the same thing,
>>> but we can't make any more guesses until we see what corosync itself is
>>> doing. Enabling debug in corosync.conf will also help if more detail is
>>> needed.
>>> 
>>> Also starting corosync with 'corosync -pf' on the command-line is often
>>> a quick way of checking things are starting OK.
>>> 
>>> Chrissie
>>> 
>>> 
>>>> If I try to start pacemaker, I only see pacemaker process running and
>>>> pacemaker.log containing the following lines:
>>>> 
>>>> /Jun 21 15:09:38 [17115] pg1 pacemakerd:     info: crm_log_init:Changed
>>>> active directory to /var/lib/pacemaker/cores/
>>>> /Jun 21 15:09:38 [17115] pg1 pacemakerd:     info:
>>>> get_cluster_type:Detected an active 'corosync' cluster/
>>>> /Jun 21 15:09:38 [17115] pg1 pacemakerd:     info:
>>>> mcp_read_config:Reading configure for stack: corosync/
>>>> /Jun 21 15:09:38 [17115] pg1 pacemakerd:   notice: main:Starting
>>>> Pacemaker 1.1.18 | build=2b07d5c5a9 features: libqb-logging libqb-ipc
>>>> lha-fencing nagios  corosync-native atomic-attrd acls/
>>>> /Jun 21 15:09:38 [17115] pg1 pacemakerd:     info: main:Maximum core
>>>> file size is: 18446744073709551615/
>>>> /Jun 21 15:09:38 [17115] pg1 pacemakerd:     info:
>>>> qb_ipcs_us_publish:server name: pacemakerd/
>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd:  warning:
>>>> corosync_node_name:Could not connect to Cluster Configuration Database
>>>> API, error CS_ERR_TRY_AGAIN/
>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd:     info:
>>>> corosync_node_name:Unable to get node name for nodeid 1/
>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd:   notice: get_node_name:Could
>>>> not obtain a node name for corosync nodeid 1/
>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd:     info: crm_get_peer:Created
>>>> entry 1aeef8ac-643b-44f7-8ce3-d82bbf40bbc1/0x557dc7f05d30 for node
>>>> (null)/1 (1 total)/
>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd:     info: crm_get_peer:Node 1
>>>> has uuid 1/
>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd:     info:
>>>> crm_update_peer_proc:cluster_connect_cpg: Node (null)[1] - corosync-cpg
>>>> is now online/
>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd:    error:
>>>> cluster_connect_quorum:Could not connect to the Quorum API: 2/
>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd:     info:
>>>> qb_ipcs_us_withdraw:withdrawing server sockets/
>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd:     info: main:Exiting
>>>> pacemakerd/
>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd:     info:
>>>> crm_xml_cleanup:Cleaning up memory from libxml2/
>>>> 
>>>> *What is wrong in my procedure?*
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> Users mailing list: Users at clusterlabs.org <mailto:Users at clusterlabs.org>
>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>> 
>>>> Project Home: http://www.clusterlabs.org <http://www.clusterlabs.org/>
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org <http://bugs.clusterlabs.org/>
>>>> 
>>> 
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org <mailto:Users at clusterlabs.org>
>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>> 
>>> Project Home: http://www.clusterlabs.org <http://www.clusterlabs.org/>
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org <http://bugs.clusterlabs.org/>
>> 
>> 
>> 
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> https://lists.clusterlabs.org/mailman/listinfo/users
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org