[ClusterLabs] pacemaker after upgrade from wheezy to jessie

Thu Nov 3 10:39:07 EDT 2016

On 11/03/2016 05:51 AM, Toni Tschampke wrote:
> Hi,
> 
> we just upgraded our nodes from wheezy 7.11 (pacemaker 1.1.7) to jessie
> (pacemaker 1.1.15, corosync 2.3.6).
> During the upgrade pacemaker was removed (rc) and reinstalled after from
> jessie-backports, same for crmsh.
> 
> Now we are encountering multiple problems:
> 
> First I checked the configuration on a single node running pacemaker &
> corosync which dropped a strange error, followed by multiple lines
> stating syntax is wrong. crm configure show then showed up a mixed view
> of xml and crmsh singleline syntax.
> 
>> ERROR: Cannot read schema file
> '/usr/share/pacemaker/pacemaker-1.1.rng': [Errno 2] No such file or
> directory: '/usr/share/pacemaker/pacemaker-1.1.rng'

pacemaker-1.1.rng was renamed to pacemaker-next.rng in Pacemaker 1.1.12,
as it was used to hold experimental new features rather than as the
actual next version of the schema. So, the schema skipped to 1.2.

I'm going to guess you were using the experimental 1.1 schema as the
"validate-with" at the top of /var/lib/pacemaker/cib/cib.xml. Try
changing the validate-with to pacemaker-next or pacemaker-1.2 and see if
you get better results. Don't edit the file directly though; use the
cibadmin command so it signs the end result properly.

After changing the validate-with, run:

  crm_verify -x /var/lib/pacemaker/cib/cib.xml

and fix any errors that show up.

> When we looked into that folder there was pacemaker-1.0.rng, 1.2 and so
> on. As a quick try we symlinked the 1.2 to 1.1 and the syntax errors
> were gone. When running crm resource show, all resources showed up, when
> running crm_mon -1fA the output was unexpected as it showed all nodes
> offline, with no DC elected:
> 
>> Stack: corosync
>> Current DC: NONE
>> Last updated: Thu Nov  3 11:11:16 2016
>> Last change: Thu Nov  3 09:54:52 2016 by root via cibadmin on nebel1
>>
>>              *** Resource management is DISABLED ***
>>  The cluster will not attempt to start, stop or recover services
>>
>> 3 nodes and 73 resources configured:
>> 5 resources DISABLED and 0 BLOCKED from being started due to failures
>>
>> OFFLINE: [ nebel1 nebel2 nebel3 ]
> 
> we tried to manually change dc-version
> 
> when issuing a simple cleanup command I got the following error:
> 
>> crm resource cleanup DrbdBackuppcMs
>> Error signing on to the CRMd service
>> Error performing operation: Transport endpoint is not connected
> 
> which looks like crmsh is not able to communicate with crmd and nothing
> is logged in this case in corosync.log
> 
> we experimented with multiple config changes (corosync.conf: pacemaker
> ver 0 > 1)
> cib-bootstrap-options: cluster-infrastructure from openais to corosync
> 
>> Package versions:
>> cman 3.1.8-1.2+b1
>> corosync 2.3.6-3~bpo8+1
>> crmsh 2.2.0-1~bpo8+1
>> csync2 1.34-2.3+b1
>> dlm-pcmk 3.0.12-3.2+deb7u2
>> libcman3 3.1.8-1.2+b1
>> libcorosync-common4:amd64 2.3.6-3~bpo8+1
>> munin-libvirt-plugins 0.0.6-1
>> pacemaker 1.1.15-2~bpo8+1
>> pacemaker-cli-utils 1.1.15-2~bpo8+1
>> pacemaker-common 1.1.15-2~bpo8+1
>> pacemaker-resource-agents 1.1.15-2~bpo8+1
> 
>> Kernel: #1 SMP Debian 3.16.36-1+deb8u2 (2016-10-19) x86_64 GNU/Linux
> 
> I attached our cib before upgrade and after, as well as the one with the
> mixed syntax and our corosync.conf.
> 
> When we tried to connect a second node to the cluster, pacemaker starts
> it's deamons, starts corosync and dies after 15 tries with following in
> corosync log:
> 
>> crmd: info: crm_timer_popped: Wait Timer (I_NULL) just popped (2000ms)
>> crmd: info: do_cib_control: Could not connect to the CIB service:
>> Transport endpoint is not connected
>> crmd:  warning: do_cib_control:
>> Couldn't complete CIB registration 15 times... pause and retry
>> attrd: error: attrd_cib_connect: Signon to CIB failed:
>> Transport endpoint is not connected (-107)
>> attrd: info: main: Shutting down attribute manager
>> attrd: info: qb_ipcs_us_withdraw: withdrawing server sockets
>> attrd: info: crm_xml_cleanup: Cleaning up memory from libxml2
>> crmd: info: crm_timer_popped: Wait Timer (I_NULL) just popped (2000ms)
>> pacemakerd:  warning: pcmk_child_exit:
>> The attrd process (12761) can no longer be respawned,
>> shutting the cluster down.
>> pacemakerd: notice: pcmk_shutdown_worker: Shutting down Pacemaker
> 
> A third node joins without above error, but crm_mon still shows all
> nodes as offline.
> 
> Thanks for any advice how to solve this, I'm out of ideas now.
> 
> Regards, Toni
>