[ClusterLabs] Upgrade corosync problem

Fri Jun 22 06:23:20 EDT 2018

Hi,
Here the log:
-------------- next part --------------
A non-text attachment was scrubbed...
Name: corosync.log
Type: application/octet-stream
Size: 22676 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20180622/594ef1f7/attachment-0002.obj>
-------------- next part --------------


> On 22 Jun 2018, at 12:10, Christine Caulfield <ccaulfie at redhat.com> wrote:
> 
> On 22/06/18 10:39, Salvatore D'angelo wrote:
>> Hi,
>> 
>> Can you tell me exactly which log you need. I?ll provide you as soon as possible.
>> 
>> Regarding some settings, I am not the original author of this cluster. People created it left the company I am working with and I inerithed the code and sometime I do not know why some settings are used.
>> The old versions of pacemaker, corosync,  crash and resource agents were compiled and installed.
>> I simply downloaded the new versions compiled and installed them. I didn?t get any compliant during ./configure that usually checks for library compatibility.
>> 
>> To be honest I do not know if this is the right approach. Should I ?make unistall" old versions before installing the new one?
>> Which is the suggested approach?
>> Thank in advance for your help.
>> 
> 
> OK fair enough!
> 
> To be honest the best approach is almost always to get the latest
> packages from the distributor rather than compile from source. That way
> you can be more sure that upgrades will be more smoothly. Though, to be
> honest, I'm not sure how good the Ubuntu packages are (they might be
> great, they might not, I genuinely don't know)
> 
> When building from source and if you don't know the provenance of the
> previous version then I would recommend a 'make uninstall' first - or
> removal of the packages if that's where they came from.
> 
> One thing you should do is make sure that all the cluster nodes are
> running the same version. If some are running older versions then nodes
> could drop out for obscure reasons. We try and keep minor versions
> on-wire compatible but it's always best to be cautious.
> 
> The tidying of your corosync.conf wan wait for the moment, lets get
> things mostly working first. If you enable debug logging in corosync.conf:
> 
> logging {
>        to_syslog: yes
> 	debug: on
> }
> 
> Then see what happens and post the syslog file that has all of the
> corosync messages in it, we'll take it from there.
> 
> Chrissie
> 
>>> On 22 Jun 2018, at 11:30, Christine Caulfield <ccaulfie at redhat.com> wrote:
>>> 
>>> On 22/06/18 10:14, Salvatore D'angelo wrote:
>>>> Hi Christine,
>>>> 
>>>> Thanks for reply. Let me add few details. When I run the corosync
>>>> service I se the corosync process running. If I stop it and run:
>>>> 
>>>> corosync -f 
>>>> 
>>>> I see three warnings:
>>>> warning [MAIN  ] interface section bindnetaddr is used together with
>>>> nodelist. Nodelist one is going to be used.
>>>> warning [MAIN  ] Please migrate config file to nodelist.
>>>> warning [MAIN  ] Could not set SCHED_RR at priority 99: Operation not
>>>> permitted (1)
>>>> warning [MAIN  ] Could not set priority -2147483648: Permission denied (13)
>>>> 
>>>> but I see node joined.
>>>> 
>>> 
>>> Those certainly need fixing but are probably not the cause. Also why do
>>> you have these values below set?
>>> 
>>> max_network_delay: 100
>>> retransmits_before_loss_const: 25
>>> window_size: 150
>>> 
>>> I'm not saying they are causing the trouble, but they aren't going to
>>> help keep a stable cluster.
>>> 
>>> Without more logs (full logs are always better than just the bits you
>>> think are meaningful) I still can't be sure. it could easily be just
>>> that you've overwritten a packaged version of corosync with your own
>>> compiled one and they have different configure options or that the
>>> libraries now don't match.
>>> 
>>> Chrissie
>>> 
>>> 
>>>> My corosync.conf file is below.
>>>> 
>>>> With service corosync up and running I have the following output:
>>>> *corosync-cfgtool -s*
>>>> Printing ring status.
>>>> Local node ID 1
>>>> RING ID 0
>>>> id= 10.0.0.11
>>>> status= ring 0 active with no faults
>>>> RING ID 1
>>>> id= 192.168.0.11
>>>> status= ring 1 active with no faults
>>>> 
>>>> *corosync-cmapctl  | grep members*
>>>> runtime.totem.pg.mrp.srp.*members*.1.config_version (u64) = 0
>>>> runtime.totem.pg.mrp.srp.*members*.1.ip (str) = r(0) ip(10.0.0.11) r(1)
>>>> ip(192.168.0.11) 
>>>> runtime.totem.pg.mrp.srp.*members*.1.join_count (u32) = 1
>>>> runtime.totem.pg.mrp.srp.*members*.1.status (str) = joined
>>>> runtime.totem.pg.mrp.srp.*members*.2.config_version (u64) = 0
>>>> runtime.totem.pg.mrp.srp.*members*.2.ip (str) = r(0) ip(10.0.0.12) r(1)
>>>> ip(192.168.0.12) 
>>>> runtime.totem.pg.mrp.srp.*members*.2.join_count (u32) = 1
>>>> runtime.totem.pg.mrp.srp.*members*.2.status (str) = joined
>>>> 
>>>> For the moment I have two nodes in my cluster (third node and some
>>>> issues and at the moment I did crm node standby on it).
>>>> 
>>>> Here the dependency I have installed for corosync (that works fine with
>>>> pacemaker 1.1.14 and corosync 2.3.5):
>>>>     libnspr4-dev_2%253a4.10.10-0ubuntu0.14.04.1_amd64.deb
>>>>     libnspr4_2%253a4.10.10-0ubuntu0.14.04.1_amd64.deb
>>>>     libnss3-dev_2%253a3.19.2.1-0ubuntu0.14.04.2_amd64.deb
>>>>     libnss3-nssdb_2%253a3.19.2.1-0ubuntu0.14.04.2_all.deb
>>>>     libnss3_2%253a3.19.2.1-0ubuntu0.14.04.2_amd64.deb
>>>>     libqb-dev_0.16.0.real-1ubuntu4_amd64.deb
>>>>     libqb0_0.16.0.real-1ubuntu4_amd64.deb
>>>> 
>>>> *corosync.conf*
>>>> ---------------------
>>>> quorum {
>>>>        provider: corosync_votequorum
>>>>        expected_votes: 3
>>>> }
>>>> totem {
>>>>        version: 2
>>>>        crypto_cipher: none
>>>>        crypto_hash: none
>>>>        rrp_mode: passive
>>>>        interface {
>>>>                ringnumber: 0
>>>>                bindnetaddr: 10.0.0.0
>>>>                mcastport: 5405
>>>>                ttl: 1
>>>>        }
>>>>        interface {
>>>>                ringnumber: 1
>>>>                bindnetaddr: 192.168.0.0
>>>>                mcastport: 5405
>>>>                ttl: 1
>>>>        }
>>>>        transport: udpu
>>>>        max_network_delay: 100
>>>>        retransmits_before_loss_const: 25
>>>>        window_size: 150
>>>> }
>>>> nodelist {
>>>>        node {
>>>>                ring0_addr: pg1
>>>>                ring1_addr: pg1p
>>>>                nodeid: 1
>>>>        }
>>>>        node {
>>>>                ring0_addr: pg2
>>>>                ring1_addr: pg2p
>>>>                nodeid: 2
>>>>        }
>>>>        node {
>>>>                ring0_addr: pg3
>>>>                ring1_addr: pg3p
>>>>                nodeid: 3
>>>>        }
>>>> }
>>>> logging {
>>>>        to_syslog: yes
>>>> }
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On 22 Jun 2018, at 09:24, Christine Caulfield <ccaulfie at redhat.com
>>>>> <mailto:ccaulfie at redhat.com>> wrote:
>>>>> 
>>>>> On 21/06/18 16:16, Salvatore D'angelo wrote:
>>>>>> Hi,
>>>>>> 
>>>>>> I upgraded my PostgreSQL/Pacemaker cluster with these versions.
>>>>>> Pacemaker 1.1.14 -> 1.1.18
>>>>>> Corosync 2.3.5 -> 2.4.4
>>>>>> Crmsh 2.2.0 -> 3.0.1
>>>>>> Resource agents 3.9.7 -> 4.1.1
>>>>>> 
>>>>>> I started on a first node  (I am trying one node at a time upgrade).
>>>>>> On a PostgreSQL slave node  I did:
>>>>>> 
>>>>>> *crm node standby <node>*
>>>>>> *service pacemaker stop*
>>>>>> *service corosync stop*
>>>>>> 
>>>>>> Then I build the tool above as described on their GitHub.com
>>>>>> <http://GitHub.com>
>>>>>> <http://GitHub.com <http://github.com/>> page. 
>>>>>> 
>>>>>> *./autogen.sh (where required)*
>>>>>> *./configure*
>>>>>> *make (where required)*
>>>>>> *make install*
>>>>>> 
>>>>>> Everything went ok. I expect new file overwrite old one. I left the
>>>>>> dependency I had with old software because I noticed the .configure
>>>>>> didn?t complain. 
>>>>>> I started corosync.
>>>>>> 
>>>>>> *service corosync start*
>>>>>> 
>>>>>> To verify corosync work properly I used the following commands:
>>>>>> *corosync-cfg-tool -s*
>>>>>> *corosync-cmapctl | grep members*
>>>>>> 
>>>>>> Everything seemed ok and I verified my node joined the cluster (at least
>>>>>> this is my impression).
>>>>>> 
>>>>>> Here I verified a problem. Doing the command:
>>>>>> corosync-quorumtool -ps
>>>>>> 
>>>>>> I got the following problem:
>>>>>> Cannot initialise CFG service
>>>>>> 
>>>>> That says that corosync is not running. Have a look in the log files to
>>>>> see why it stopped. The pacemaker logs below are showing the same thing,
>>>>> but we can't make any more guesses until we see what corosync itself is
>>>>> doing. Enabling debug in corosync.conf will also help if more detail is
>>>>> needed.
>>>>> 
>>>>> Also starting corosync with 'corosync -pf' on the command-line is often
>>>>> a quick way of checking things are starting OK.
>>>>> 
>>>>> Chrissie
>>>>> 
>>>>> 
>>>>>> If I try to start pacemaker, I only see pacemaker process running and
>>>>>> pacemaker.log containing the following lines:
>>>>>> 
>>>>>> /Jun 21 15:09:38 [17115] pg1 pacemakerd:     info: crm_log_init:Changed
>>>>>> active directory to /var/lib/pacemaker/cores/
>>>>>> /Jun 21 15:09:38 [17115] pg1 pacemakerd:     info:
>>>>>> get_cluster_type:Detected an active 'corosync' cluster/
>>>>>> /Jun 21 15:09:38 [17115] pg1 pacemakerd:     info:
>>>>>> mcp_read_config:Reading configure for stack: corosync/
>>>>>> /Jun 21 15:09:38 [17115] pg1 pacemakerd:   notice: main:Starting
>>>>>> Pacemaker 1.1.18 | build=2b07d5c5a9 features: libqb-logging libqb-ipc
>>>>>> lha-fencing nagios  corosync-native atomic-attrd acls/
>>>>>> /Jun 21 15:09:38 [17115] pg1 pacemakerd:     info: main:Maximum core
>>>>>> file size is: 18446744073709551615/
>>>>>> /Jun 21 15:09:38 [17115] pg1 pacemakerd:     info:
>>>>>> qb_ipcs_us_publish:server name: pacemakerd/
>>>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd:  warning:
>>>>>> corosync_node_name:Could not connect to Cluster Configuration Database
>>>>>> API, error CS_ERR_TRY_AGAIN/
>>>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd:     info:
>>>>>> corosync_node_name:Unable to get node name for nodeid 1/
>>>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd:   notice: get_node_name:Could
>>>>>> not obtain a node name for corosync nodeid 1/
>>>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd:     info: crm_get_peer:Created
>>>>>> entry 1aeef8ac-643b-44f7-8ce3-d82bbf40bbc1/0x557dc7f05d30 for node
>>>>>> (null)/1 (1 total)/
>>>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd:     info: crm_get_peer:Node 1
>>>>>> has uuid 1/
>>>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd:     info:
>>>>>> crm_update_peer_proc:cluster_connect_cpg: Node (null)[1] - corosync-cpg
>>>>>> is now online/
>>>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd:    error:
>>>>>> cluster_connect_quorum:Could not connect to the Quorum API: 2/
>>>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd:     info:
>>>>>> qb_ipcs_us_withdraw:withdrawing server sockets/
>>>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd:     info: main:Exiting
>>>>>> pacemakerd/
>>>>>> /Jun 21 15:09:53 [17115] pg1 pacemakerd:     info:
>>>>>> crm_xml_cleanup:Cleaning up memory from libxml2/
>>>>>> 
>>>>>> *What is wrong in my procedure?*
>>>>>> 
>>>>>> 
>>>>>> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org