[ClusterLabs] Upgrade corosync problem

Fri Jun 22 06:10:26 EDT 2018

On 22/06/18 10:39, Salvatore D'angelo wrote:
> Hi,
> 
> Can you tell me exactly which log you need. I’ll provide you as soon as possible.
> 
> Regarding some settings, I am not the original author of this cluster. People created it left the company I am working with and I inerithed the code and sometime I do not know why some settings are used.
> The old versions of pacemaker, corosync,  crash and resource agents were compiled and installed.
> I simply downloaded the new versions compiled and installed them. I didn’t get any compliant during ./configure that usually checks for library compatibility.
> 
> To be honest I do not know if this is the right approach. Should I “make unistall" old versions before installing the new one?
> Which is the suggested approach?
> Thank in advance for your help.
> 

OK fair enough!

To be honest the best approach is almost always to get the latest
packages from the distributor rather than compile from source. That way
you can be more sure that upgrades will be more smoothly. Though, to be
honest, I'm not sure how good the Ubuntu packages are (they might be
great, they might not, I genuinely don't know)

When building from source and if you don't know the provenance of the
previous version then I would recommend a 'make uninstall' first - or
removal of the packages if that's where they came from.

One thing you should do is make sure that all the cluster nodes are
running the same version. If some are running older versions then nodes
could drop out for obscure reasons. We try and keep minor versions
on-wire compatible but it's always best to be cautious.

The tidying of your corosync.conf wan wait for the moment, lets get
things mostly working first. If you enable debug logging in corosync.conf:

logging {
        to_syslog: yes
	debug: on
}

Then see what happens and post the syslog file that has all of the
corosync messages in it, we'll take it from there.

Chrissie

>> On 22 Jun 2018, at 11:30, Christine Caulfield <ccaulfie at redhat.com> wrote:
>>
>> On 22/06/18 10:14, Salvatore D'angelo wrote:
>>> Hi Christine,
>>>
>>> Thanks for reply. Let me add few details. When I run the corosync
>>> service I se the corosync process running. If I stop it and run:
>>>
>>> corosync -f 
>>>
>>> I see three warnings:
>>> warning [MAIN  ] interface section bindnetaddr is used together with
>>> nodelist. Nodelist one is going to be used.
>>> warning [MAIN  ] Please migrate config file to nodelist.
>>> warning [MAIN  ] Could not set SCHED_RR at priority 99: Operation not
>>> permitted (1)
>>> warning [MAIN  ] Could not set priority -2147483648: Permission denied (13)
>>>
>>> but I see node joined.
>>>
>>
>> Those certainly need fixing but are probably not the cause. Also why do
>> you have these values below set?
>>
>> max_network_delay: 100
>> retransmits_before_loss_const: 25
>> window_size: 150
>>
>> I'm not saying they are causing the trouble, but they aren't going to
>> help keep a stable cluster.
>>
>> Without more logs (full logs are always better than just the bits you
>> think are meaningful) I still can't be sure. it could easily be just
>> that you've overwritten a packaged version of corosync with your own
>> compiled one and they have different configure options or that the
>> libraries now don't match.
>>
>> Chrissie
>>
>>
>>> My corosync.conf file is below.
>>>
>>> With service corosync up and running I have the following output:
>>> *corosync-cfgtool -s*
>>> Printing ring status.
>>> Local node ID 1
>>> RING ID 0
>>> id= 10.0.0.11
>>> status= ring 0 active with no faults
>>> RING ID 1
>>> id= 192.168.0.11
>>> status= ring 1 active with no faults
>>>
>>> *corosync-cmapctl  | grep members*
>>> runtime.totem.pg.mrp.srp.*members*.1.config_version (u64) = 0
>>> runtime.totem.pg.mrp.srp.*members*.1.ip (str) = r(0) ip(10.0.0.11) r(1)
>>> ip(192.168.0.11) 
>>> runtime.totem.pg.mrp.srp.*members*.1.join_count (u32) = 1
>>> runtime.totem.pg.mrp.srp.*members*.1.status (str) = joined
>>> runtime.totem.pg.mrp.srp.*members*.2.config_version (u64) = 0
>>> runtime.totem.pg.mrp.srp.*members*.2.ip (str) = r(0) ip(10.0.0.12) r(1)
>>> ip(192.168.0.12) 
>>> runtime.totem.pg.mrp.srp.*members*.2.join_count (u32) = 1
>>> runtime.totem.pg.mrp.srp.*members*.2.status (str) = joined
>>>
>>> For the moment I have two nodes in my cluster (third node and some
>>> issues and at the moment I did crm node standby on it).
>>>
>>> Here the dependency I have installed for corosync (that works fine with
>>> pacemaker 1.1.14 and corosync 2.3.5):
>>>      libnspr4-dev_2%253a4.10.10-0ubuntu0.14.04.1_amd64.deb
>>>      libnspr4_2%253a4.10.10-0ubuntu0.14.04.1_amd64.deb
>>>      libnss3-dev_2%253a3.19.2.1-0ubuntu0.14.04.2_amd64.deb
>>>      libnss3-nssdb_2%253a3.19.2.1-0ubuntu0.14.04.2_all.deb
>>>      libnss3_2%253a3.19.2.1-0ubuntu0.14.04.2_amd64.deb
>>>      libqb-dev_0.16.0.real-1ubuntu4_amd64.deb
>>>      libqb0_0.16.0.real-1ubuntu4_amd64.deb
>>>
>>> *corosync.conf*
>>> ---------------------
>>> quorum {
>>>         provider: corosync_votequorum
>>>         expected_votes: 3
>>> }
>>> totem {
>>>         version: 2
>>>         crypto_cipher: none
>>>         crypto_hash: none
>>>         rrp_mode: passive
>>>         interface {
>>>                 ringnumber: 0
>>>                 bindnetaddr: 10.0.0.0
>>>                 mcastport: 5405
>>>                 ttl: 1
>>>         }
>>>         interface {
>>>                 ringnumber: 1
>>>                 bindnetaddr: 192.168.0.0
>>>                 mcastport: 5405
>>>                 ttl: 1
>>>         }
>>>         transport: udpu
>>>         max_network_delay: 100
>>>         retransmits_before_loss_const: 25
>>>         window_size: 150
>>> }
>>> nodelist {
>>>         node {
>>>                 ring0_addr: pg1
>>>                 ring1_addr: pg1p
>>>                 nodeid: 1
>>>         }
>>>         node {
>>>                 ring0_addr: pg2
>>>                 ring1_addr: pg2p
>>>                 nodeid: 2
>>>         }
>>>         node {
>>>                 ring0_addr: pg3
>>>                 ring1_addr: pg3p
>>>                 nodeid: 3
>>>         }
>>> }
>>> logging {
>>>         to_syslog: yes
>>> }
>>>
>>>
>>>
>>>
>>>> On 22 Jun 2018, at 09:24, Christine Caulfield <ccaulfie at redhat.com
>>>> <mailto:ccaulfie at redhat.com>> wrote:
>>>>
>>>> On 21/06/18 16:16, Salvatore D'angelo wrote:
>>>>> Hi,
>>>>>
>>>>> I upgraded my PostgreSQL/Pacemaker cluster with these versions.
>>>>> Pacemaker 1.1.14 -> 1.1.18
>>>>> Corosync 2.3.5 -> 2.4.4
>>>>> Crmsh 2.2.0 -> 3.0.1
>>>>> Resource agents 3.9.7 -> 4.1.1
>>>>>
>>>>> I started on a first node  (I am trying one node at a time upgrade).
>>>>> On a PostgreSQL slave node  I did:
>>>>>
>>>>> *crm node standby <node>*
>>>>> *service pacemaker stop*
>>>>> *service corosync stop*
>>>>>
>>>>> Then I build the tool above as described on their GitHub.com
>>>>> <http://GitHub.com>
>>>>> <http://GitHub.com <http://github.com/>> page. 
>>>>>
>>>>> *./autogen.sh (where required)*
>>>>> *./configure*
>>>>> *make (where required)*
>>>>> *make install*
>>>>>
>>>>> Everything went ok. I expect new file overwrite old one. I left the
>>>>> dependency I had with old software because I noticed the .configure
>>>>> didn’t complain. 
>>>>> I started corosync.
>>>>>
>>>>> *service corosync start*
>>>>>
>>>>> To verify corosync work properly I used the following commands:
>>>>> *corosync-cfg-tool -s*
>>>>> *corosync-cmapctl | grep members*
>>>>>
>>>>> Everything seemed ok and I verified my node joined the cluster (at least
>>>>> this is my impression).
>>>>>
>>>>> Here I verified a problem. Doing the command:
>>>>> corosync-quorumtool -ps
>>>>>
>>>>> I got the following problem:
>>>>> Cannot initialise CFG service
>>>>>
>>>> That says that corosync is not running. Have a look in the log files to
>>>> see why it stopped. The pacemaker logs below are showing the same thing,
>>>> but we can't make any more guesses until we see what corosync itself is
>>>> doing. Enabling debug in corosync.conf will also help if more detail is
>>>> needed.
>>>>
>>>> Also starting corosync with 'corosync -pf' on the command-line is often
>>>> a quick way of checking things are starting OK.
>>>>
>>>> Chrissie
>>>>
>>>>
>>>>> If I try to start pacemaker, I only see pacemaker process running and
>>>>> pacemaker.log containing the following lines:
>>>>>
>>>>> /Jun 21 15:09:38 [17115] pg1 pacemakerd:     info: crm_log_init:Changed
>>>>> active directory to /var/lib/pacemaker/cores/
>>>>> /Jun 21 15:09:38 [17115] pg1 pacemakerd:     info:
>>>>> get_cluster_type:Detected an active 'corosync' cluster/
>>>>> /Jun 21 15:09:38 [17115] pg1 pacemakerd:     info:
>>>>> mcp_read_config:Reading configure for stack: corosync/
>>>>> /Jun 21 15:09:38 [17115] pg1 pacemakerd:   notice: main:Starting
>>>>> Pacemaker 1.1.18 | build=2b07d5c5a9 features: libqb-logging libqb-ipc
>>>>> lha-fencing nagios  corosync-native atomic-attrd acls/
>>>>> /Jun 21 15:09:38 [17115] pg1 pacemakerd:     info: main:Maximum core
>>>>> file size is: 18446744073709551615/
>>>>> /Jun 21 15:09:38 [17115] pg1 pacemakerd:     info:
>>>>> qb_ipcs_us_publish:server name: pacemakerd/
>>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd:  warning:
>>>>> corosync_node_name:Could not connect to Cluster Configuration Database
>>>>> API, error CS_ERR_TRY_AGAIN/
>>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd:     info:
>>>>> corosync_node_name:Unable to get node name for nodeid 1/
>>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd:   notice: get_node_name:Could
>>>>> not obtain a node name for corosync nodeid 1/
>>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd:     info: crm_get_peer:Created
>>>>> entry 1aeef8ac-643b-44f7-8ce3-d82bbf40bbc1/0x557dc7f05d30 for node
>>>>> (null)/1 (1 total)/
>>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd:     info: crm_get_peer:Node 1
>>>>> has uuid 1/
>>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd:     info:
>>>>> crm_update_peer_proc:cluster_connect_cpg: Node (null)[1] - corosync-cpg
>>>>> is now online/
>>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd:    error:
>>>>> cluster_connect_quorum:Could not connect to the Quorum API: 2/
>>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd:     info:
>>>>> qb_ipcs_us_withdraw:withdrawing server sockets/
>>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd:     info: main:Exiting
>>>>> pacemakerd/
>>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd:     info:
>>>>> crm_xml_cleanup:Cleaning up memory from libxml2/
>>>>>
>>>>> *What is wrong in my procedure?*
>>>>>
>>>>>
>>>>>