[ClusterLabs] Upgrade corosync problem

Fri Jun 22 09:30:47 UTC 2018

On 22/06/18 10:14, Salvatore D'angelo wrote:
> Hi Christine,
> 
> Thanks for reply. Let me add few details. When I run the corosync
> service I se the corosync process running. If I stop it and run:
> 
> corosync -f 
> 
> I see three warnings:
> warning [MAIN  ] interface section bindnetaddr is used together with
> nodelist. Nodelist one is going to be used.
> warning [MAIN  ] Please migrate config file to nodelist.
> warning [MAIN  ] Could not set SCHED_RR at priority 99: Operation not
> permitted (1)
> warning [MAIN  ] Could not set priority -2147483648: Permission denied (13)
> 
> but I see node joined.
> 

Those certainly need fixing but are probably not the cause. Also why do
you have these values below set?

max_network_delay: 100
retransmits_before_loss_const: 25
window_size: 150

I'm not saying they are causing the trouble, but they aren't going to
help keep a stable cluster.

Without more logs (full logs are always better than just the bits you
think are meaningful) I still can't be sure. it could easily be just
that you've overwritten a packaged version of corosync with your own
compiled one and they have different configure options or that the
libraries now don't match.

Chrissie

> My corosync.conf file is below.
> 
> With service corosync up and running I have the following output:
> *corosync-cfgtool -s*
> Printing ring status.
> Local node ID 1
> RING ID 0
> id= 10.0.0.11
> status= ring 0 active with no faults
> RING ID 1
> id= 192.168.0.11
> status= ring 1 active with no faults
> 
> *corosync-cmapctl  | grep members*
> runtime.totem.pg.mrp.srp.*members*.1.config_version (u64) = 0
> runtime.totem.pg.mrp.srp.*members*.1.ip (str) = r(0) ip(10.0.0.11) r(1)
> ip(192.168.0.11) 
> runtime.totem.pg.mrp.srp.*members*.1.join_count (u32) = 1
> runtime.totem.pg.mrp.srp.*members*.1.status (str) = joined
> runtime.totem.pg.mrp.srp.*members*.2.config_version (u64) = 0
> runtime.totem.pg.mrp.srp.*members*.2.ip (str) = r(0) ip(10.0.0.12) r(1)
> ip(192.168.0.12) 
> runtime.totem.pg.mrp.srp.*members*.2.join_count (u32) = 1
> runtime.totem.pg.mrp.srp.*members*.2.status (str) = joined
> 
> For the moment I have two nodes in my cluster (third node and some
> issues and at the moment I did crm node standby on it).
> 
> Here the dependency I have installed for corosync (that works fine with
> pacemaker 1.1.14 and corosync 2.3.5):
>      libnspr4-dev_2%253a4.10.10-0ubuntu0.14.04.1_amd64.deb
>      libnspr4_2%253a4.10.10-0ubuntu0.14.04.1_amd64.deb
>      libnss3-dev_2%253a3.19.2.1-0ubuntu0.14.04.2_amd64.deb
>      libnss3-nssdb_2%253a3.19.2.1-0ubuntu0.14.04.2_all.deb
>      libnss3_2%253a3.19.2.1-0ubuntu0.14.04.2_amd64.deb
>      libqb-dev_0.16.0.real-1ubuntu4_amd64.deb
>      libqb0_0.16.0.real-1ubuntu4_amd64.deb
> 
> *corosync.conf*
> ---------------------
> quorum {
>         provider: corosync_votequorum
>         expected_votes: 3
> }
> totem {
>         version: 2
>         crypto_cipher: none
>         crypto_hash: none
>         rrp_mode: passive
>         interface {
>                 ringnumber: 0
>                 bindnetaddr: 10.0.0.0
>                 mcastport: 5405
>                 ttl: 1
>         }
>         interface {
>                 ringnumber: 1
>                 bindnetaddr: 192.168.0.0
>                 mcastport: 5405
>                 ttl: 1
>         }
>         transport: udpu
>         max_network_delay: 100
>         retransmits_before_loss_const: 25
>         window_size: 150
> }
> nodelist {
>         node {
>                 ring0_addr: pg1
>                 ring1_addr: pg1p
>                 nodeid: 1
>         }
>         node {
>                 ring0_addr: pg2
>                 ring1_addr: pg2p
>                 nodeid: 2
>         }
>         node {
>                 ring0_addr: pg3
>                 ring1_addr: pg3p
>                 nodeid: 3
>         }
> }
> logging {
>         to_syslog: yes
> }
> 
> 
> 
> 
>> On 22 Jun 2018, at 09:24, Christine Caulfield <ccaulfie at redhat.com
>> <mailto:ccaulfie at redhat.com>> wrote:
>>
>> On 21/06/18 16:16, Salvatore D'angelo wrote:
>>> Hi,
>>>
>>> I upgraded my PostgreSQL/Pacemaker cluster with these versions.
>>> Pacemaker 1.1.14 -> 1.1.18
>>> Corosync 2.3.5 -> 2.4.4
>>> Crmsh 2.2.0 -> 3.0.1
>>> Resource agents 3.9.7 -> 4.1.1
>>>
>>> I started on a first node  (I am trying one node at a time upgrade).
>>> On a PostgreSQL slave node  I did:
>>>
>>> *crm node standby <node>*
>>> *service pacemaker stop*
>>> *service corosync stop*
>>>
>>> Then I build the tool above as described on their GitHub.com
>>> <http://GitHub.com>
>>> <http://GitHub.com <http://github.com/>> page. 
>>>
>>> *./autogen.sh (where required)*
>>> *./configure*
>>> *make (where required)*
>>> *make install*
>>>
>>> Everything went ok. I expect new file overwrite old one. I left the
>>> dependency I had with old software because I noticed the .configure
>>> didn’t complain. 
>>> I started corosync.
>>>
>>> *service corosync start*
>>>
>>> To verify corosync work properly I used the following commands:
>>> *corosync-cfg-tool -s*
>>> *corosync-cmapctl | grep members*
>>>
>>> Everything seemed ok and I verified my node joined the cluster (at least
>>> this is my impression).
>>>
>>> Here I verified a problem. Doing the command:
>>> corosync-quorumtool -ps
>>>
>>> I got the following problem:
>>> Cannot initialise CFG service
>>>
>> That says that corosync is not running. Have a look in the log files to
>> see why it stopped. The pacemaker logs below are showing the same thing,
>> but we can't make any more guesses until we see what corosync itself is
>> doing. Enabling debug in corosync.conf will also help if more detail is
>> needed.
>>
>> Also starting corosync with 'corosync -pf' on the command-line is often
>> a quick way of checking things are starting OK.
>>
>> Chrissie
>>
>>
>>> If I try to start pacemaker, I only see pacemaker process running and
>>> pacemaker.log containing the following lines:
>>>
>>> /Jun 21 15:09:38 [17115] pg1 pacemakerd:     info: crm_log_init:Changed
>>> active directory to /var/lib/pacemaker/cores/
>>> /Jun 21 15:09:38 [17115] pg1 pacemakerd:     info:
>>> get_cluster_type:Detected an active 'corosync' cluster/
>>> /Jun 21 15:09:38 [17115] pg1 pacemakerd:     info:
>>> mcp_read_config:Reading configure for stack: corosync/
>>> /Jun 21 15:09:38 [17115] pg1 pacemakerd:   notice: main:Starting
>>> Pacemaker 1.1.18 | build=2b07d5c5a9 features: libqb-logging libqb-ipc
>>> lha-fencing nagios  corosync-native atomic-attrd acls/
>>> /Jun 21 15:09:38 [17115] pg1 pacemakerd:     info: main:Maximum core
>>> file size is: 18446744073709551615/
>>> /Jun 21 15:09:38 [17115] pg1 pacemakerd:     info:
>>> qb_ipcs_us_publish:server name: pacemakerd/
>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd:  warning:
>>> corosync_node_name:Could not connect to Cluster Configuration Database
>>> API, error CS_ERR_TRY_AGAIN/
>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd:     info:
>>> corosync_node_name:Unable to get node name for nodeid 1/
>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd:   notice: get_node_name:Could
>>> not obtain a node name for corosync nodeid 1/
>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd:     info: crm_get_peer:Created
>>> entry 1aeef8ac-643b-44f7-8ce3-d82bbf40bbc1/0x557dc7f05d30 for node
>>> (null)/1 (1 total)/
>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd:     info: crm_get_peer:Node 1
>>> has uuid 1/
>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd:     info:
>>> crm_update_peer_proc:cluster_connect_cpg: Node (null)[1] - corosync-cpg
>>> is now online/
>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd:    error:
>>> cluster_connect_quorum:Could not connect to the Quorum API: 2/
>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd:     info:
>>> qb_ipcs_us_withdraw:withdrawing server sockets/
>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd:     info: main:Exiting
>>> pacemakerd/
>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd:     info:
>>> crm_xml_cleanup:Cleaning up memory from libxml2/
>>>
>>> *What is wrong in my procedure?*
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org <mailto:Users at clusterlabs.org>
>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org <http://www.clusterlabs.org/>
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org <http://bugs.clusterlabs.org/>
>>>
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org <mailto:Users at clusterlabs.org>
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org <http://www.clusterlabs.org/>
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org <http://bugs.clusterlabs.org/>
> 
> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>