[ClusterLabs] pacemaker after upgrade from wheezy to jessie

Thu Nov 3 15:39:47 UTC 2016

 > I'm going to guess you were using the experimental 1.1 schema as the
 > "validate-with" at the top of /var/lib/pacemaker/cib/cib.xml. Try
 > changing the validate-with to pacemaker-next or pacemaker-1.2 and see if
 > you get better results. Don't edit the file directly though; use the
 > cibadmin command so it signs the end result properly.
 >
 > After changing the validate-with, run:
 >
 >    crm_verify -x /var/lib/pacemaker/cib/cib.xml
 >
 > and fix any errors that show up.

strange, the location of our cib.xml differs from your path, our cib is 
located in /var/lib/heartbeat/crm/

running cibadmin --modify --xml-text '<cib validate-with="pacemaker-1.2"/>'

gave no output but was logged to corosync:

cib:     info: cib_perform_op:    -- <cib num_updates="0" 
validate-with="pacemaker-1.1"/>
cib:     info: cib_perform_op:    ++ <cib admin_epoch="0" epoch="8462" 
num_updates="1" validate-with="pacemaker-1.2" crm_feature_set="3.0.6"
  have-quorum="1" cib-last-written="Thu Nov  3 10:05:52 2016" 
update-origin="nebel1" update-client="cibadmin" update-user="root"/>

I'm guessing this change should be instantly written into the xml file?
If this is the case something is wrong, greping for validate gives the 
old string back.

<cib admin_epoch="0" epoch="8462" num_updates="0" 
validate-with="pacemaker-1.1" crm_feature_set="3.0.6" have-quorum="1" 
cib-last-written="Thu Nov  3 16:19:51 2016" update-origin="nebel1" 
update-client="cibadmin" update-user="root">

pacemakerd --features
Pacemaker 1.1.15 (Build: e174ec8)
Supporting v3.0.10:

Should the crm_feature_set be updated this way too? I'm guessing this is 
done when "cibadmin --upgrade" succeeds?

We just get an timeout error when trying to upgrade it with cibadmin:
Call cib_upgrade failed (-62): Timer expired

Do have permissions changed from 1.1.7 to 1.1.15? when looking at our 
quite big /var/lib/heartbeat/crm/ folder some permissions changed:

-rw------- 1 hacluster root      80K Nov  1 16:56 cib-31.raw
-rw-r--r-- 1 hacluster root       32 Nov  1 16:56 cib-31.raw.sig
-rw------- 1 hacluster haclient  80K Nov  1 18:53 cib-32.raw
-rw------- 1 hacluster haclient   32 Nov  1 18:53 cib-32.raw.sig

cib-31 was before upgrading, cib-32 after starting upgraded pacemaker

--
Mit freundlichen Grüßen

Toni Tschampke | tt at halle.it
bcs kommunikationslösungen
Inh. Dipl. Ing. Carsten Burkhardt
Harz 51 | 06108 Halle (Saale) | Germany
tel +49 345 29849-0 | fax +49 345 29849-22
www.b-c-s.de | www.halle.it | www.wivewa.de

EINFACH ADRESSEN, TELEFONATE UND DOKUMENTE VERWALTEN - MIT WIVEWA -
IHREM WISSENSVERWALTER FUER IHREN BETRIEB!

Weitere Informationen erhalten Sie unter www.wivewa.de

Am 03.11.2016 um 15:39 schrieb Ken Gaillot:
> On 11/03/2016 05:51 AM, Toni Tschampke wrote:
>> Hi,
>>
>> we just upgraded our nodes from wheezy 7.11 (pacemaker 1.1.7) to jessie
>> (pacemaker 1.1.15, corosync 2.3.6).
>> During the upgrade pacemaker was removed (rc) and reinstalled after from
>> jessie-backports, same for crmsh.
>>
>> Now we are encountering multiple problems:
>>
>> First I checked the configuration on a single node running pacemaker &
>> corosync which dropped a strange error, followed by multiple lines
>> stating syntax is wrong. crm configure show then showed up a mixed view
>> of xml and crmsh singleline syntax.
>>
>>> ERROR: Cannot read schema file
>> '/usr/share/pacemaker/pacemaker-1.1.rng': [Errno 2] No such file or
>> directory: '/usr/share/pacemaker/pacemaker-1.1.rng'
>
> pacemaker-1.1.rng was renamed to pacemaker-next.rng in Pacemaker 1.1.12,
> as it was used to hold experimental new features rather than as the
> actual next version of the schema. So, the schema skipped to 1.2.
>
> I'm going to guess you were using the experimental 1.1 schema as the
> "validate-with" at the top of /var/lib/pacemaker/cib/cib.xml. Try
> changing the validate-with to pacemaker-next or pacemaker-1.2 and see if
> you get better results. Don't edit the file directly though; use the
> cibadmin command so it signs the end result properly.
>
> After changing the validate-with, run:
>
>    crm_verify -x /var/lib/pacemaker/cib/cib.xml
>
> and fix any errors that show up.
>
>> When we looked into that folder there was pacemaker-1.0.rng, 1.2 and so
>> on. As a quick try we symlinked the 1.2 to 1.1 and the syntax errors
>> were gone. When running crm resource show, all resources showed up, when
>> running crm_mon -1fA the output was unexpected as it showed all nodes
>> offline, with no DC elected:
>>
>>> Stack: corosync
>>> Current DC: NONE
>>> Last updated: Thu Nov  3 11:11:16 2016
>>> Last change: Thu Nov  3 09:54:52 2016 by root via cibadmin on nebel1
>>>
>>>               *** Resource management is DISABLED ***
>>>   The cluster will not attempt to start, stop or recover services
>>>
>>> 3 nodes and 73 resources configured:
>>> 5 resources DISABLED and 0 BLOCKED from being started due to failures
>>>
>>> OFFLINE: [ nebel1 nebel2 nebel3 ]
>>
>> we tried to manually change dc-version
>>
>> when issuing a simple cleanup command I got the following error:
>>
>>> crm resource cleanup DrbdBackuppcMs
>>> Error signing on to the CRMd service
>>> Error performing operation: Transport endpoint is not connected
>>
>> which looks like crmsh is not able to communicate with crmd and nothing
>> is logged in this case in corosync.log
>>
>> we experimented with multiple config changes (corosync.conf: pacemaker
>> ver 0 > 1)
>> cib-bootstrap-options: cluster-infrastructure from openais to corosync
>>
>>> Package versions:
>>> cman 3.1.8-1.2+b1
>>> corosync 2.3.6-3~bpo8+1
>>> crmsh 2.2.0-1~bpo8+1
>>> csync2 1.34-2.3+b1
>>> dlm-pcmk 3.0.12-3.2+deb7u2
>>> libcman3 3.1.8-1.2+b1
>>> libcorosync-common4:amd64 2.3.6-3~bpo8+1
>>> munin-libvirt-plugins 0.0.6-1
>>> pacemaker 1.1.15-2~bpo8+1
>>> pacemaker-cli-utils 1.1.15-2~bpo8+1
>>> pacemaker-common 1.1.15-2~bpo8+1
>>> pacemaker-resource-agents 1.1.15-2~bpo8+1
>>
>>> Kernel: #1 SMP Debian 3.16.36-1+deb8u2 (2016-10-19) x86_64 GNU/Linux
>>
>> I attached our cib before upgrade and after, as well as the one with the
>> mixed syntax and our corosync.conf.
>>
>> When we tried to connect a second node to the cluster, pacemaker starts
>> it's deamons, starts corosync and dies after 15 tries with following in
>> corosync log:
>>
>>> crmd: info: crm_timer_popped: Wait Timer (I_NULL) just popped (2000ms)
>>> crmd: info: do_cib_control: Could not connect to the CIB service:
>>> Transport endpoint is not connected
>>> crmd:  warning: do_cib_control:
>>> Couldn't complete CIB registration 15 times... pause and retry
>>> attrd: error: attrd_cib_connect: Signon to CIB failed:
>>> Transport endpoint is not connected (-107)
>>> attrd: info: main: Shutting down attribute manager
>>> attrd: info: qb_ipcs_us_withdraw: withdrawing server sockets
>>> attrd: info: crm_xml_cleanup: Cleaning up memory from libxml2
>>> crmd: info: crm_timer_popped: Wait Timer (I_NULL) just popped (2000ms)
>>> pacemakerd:  warning: pcmk_child_exit:
>>> The attrd process (12761) can no longer be respawned,
>>> shutting the cluster down.
>>> pacemakerd: notice: pcmk_shutdown_worker: Shutting down Pacemaker
>>
>> A third node joins without above error, but crm_mon still shows all
>> nodes as offline.
>>
>> Thanks for any advice how to solve this, I'm out of ideas now.
>>
>> Regards, Toni
>>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>