[ClusterLabs] pacemaker after upgrade from wheezy to jessie
Toni Tschampke
tt at halle.it
Thu Nov 3 15:39:47 UTC 2016
> I'm going to guess you were using the experimental 1.1 schema as the
> "validate-with" at the top of /var/lib/pacemaker/cib/cib.xml. Try
> changing the validate-with to pacemaker-next or pacemaker-1.2 and see if
> you get better results. Don't edit the file directly though; use the
> cibadmin command so it signs the end result properly.
>
> After changing the validate-with, run:
>
> crm_verify -x /var/lib/pacemaker/cib/cib.xml
>
> and fix any errors that show up.
strange, the location of our cib.xml differs from your path, our cib is
located in /var/lib/heartbeat/crm/
running cibadmin --modify --xml-text '<cib validate-with="pacemaker-1.2"/>'
gave no output but was logged to corosync:
cib: info: cib_perform_op: -- <cib num_updates="0"
validate-with="pacemaker-1.1"/>
cib: info: cib_perform_op: ++ <cib admin_epoch="0" epoch="8462"
num_updates="1" validate-with="pacemaker-1.2" crm_feature_set="3.0.6"
have-quorum="1" cib-last-written="Thu Nov 3 10:05:52 2016"
update-origin="nebel1" update-client="cibadmin" update-user="root"/>
I'm guessing this change should be instantly written into the xml file?
If this is the case something is wrong, greping for validate gives the
old string back.
<cib admin_epoch="0" epoch="8462" num_updates="0"
validate-with="pacemaker-1.1" crm_feature_set="3.0.6" have-quorum="1"
cib-last-written="Thu Nov 3 16:19:51 2016" update-origin="nebel1"
update-client="cibadmin" update-user="root">
pacemakerd --features
Pacemaker 1.1.15 (Build: e174ec8)
Supporting v3.0.10:
Should the crm_feature_set be updated this way too? I'm guessing this is
done when "cibadmin --upgrade" succeeds?
We just get an timeout error when trying to upgrade it with cibadmin:
Call cib_upgrade failed (-62): Timer expired
Do have permissions changed from 1.1.7 to 1.1.15? when looking at our
quite big /var/lib/heartbeat/crm/ folder some permissions changed:
-rw------- 1 hacluster root 80K Nov 1 16:56 cib-31.raw
-rw-r--r-- 1 hacluster root 32 Nov 1 16:56 cib-31.raw.sig
-rw------- 1 hacluster haclient 80K Nov 1 18:53 cib-32.raw
-rw------- 1 hacluster haclient 32 Nov 1 18:53 cib-32.raw.sig
cib-31 was before upgrading, cib-32 after starting upgraded pacemaker
--
Mit freundlichen Grüßen
Toni Tschampke | tt at halle.it
bcs kommunikationslösungen
Inh. Dipl. Ing. Carsten Burkhardt
Harz 51 | 06108 Halle (Saale) | Germany
tel +49 345 29849-0 | fax +49 345 29849-22
www.b-c-s.de | www.halle.it | www.wivewa.de
EINFACH ADRESSEN, TELEFONATE UND DOKUMENTE VERWALTEN - MIT WIVEWA -
IHREM WISSENSVERWALTER FUER IHREN BETRIEB!
Weitere Informationen erhalten Sie unter www.wivewa.de
Am 03.11.2016 um 15:39 schrieb Ken Gaillot:
> On 11/03/2016 05:51 AM, Toni Tschampke wrote:
>> Hi,
>>
>> we just upgraded our nodes from wheezy 7.11 (pacemaker 1.1.7) to jessie
>> (pacemaker 1.1.15, corosync 2.3.6).
>> During the upgrade pacemaker was removed (rc) and reinstalled after from
>> jessie-backports, same for crmsh.
>>
>> Now we are encountering multiple problems:
>>
>> First I checked the configuration on a single node running pacemaker &
>> corosync which dropped a strange error, followed by multiple lines
>> stating syntax is wrong. crm configure show then showed up a mixed view
>> of xml and crmsh singleline syntax.
>>
>>> ERROR: Cannot read schema file
>> '/usr/share/pacemaker/pacemaker-1.1.rng': [Errno 2] No such file or
>> directory: '/usr/share/pacemaker/pacemaker-1.1.rng'
>
> pacemaker-1.1.rng was renamed to pacemaker-next.rng in Pacemaker 1.1.12,
> as it was used to hold experimental new features rather than as the
> actual next version of the schema. So, the schema skipped to 1.2.
>
> I'm going to guess you were using the experimental 1.1 schema as the
> "validate-with" at the top of /var/lib/pacemaker/cib/cib.xml. Try
> changing the validate-with to pacemaker-next or pacemaker-1.2 and see if
> you get better results. Don't edit the file directly though; use the
> cibadmin command so it signs the end result properly.
>
> After changing the validate-with, run:
>
> crm_verify -x /var/lib/pacemaker/cib/cib.xml
>
> and fix any errors that show up.
>
>> When we looked into that folder there was pacemaker-1.0.rng, 1.2 and so
>> on. As a quick try we symlinked the 1.2 to 1.1 and the syntax errors
>> were gone. When running crm resource show, all resources showed up, when
>> running crm_mon -1fA the output was unexpected as it showed all nodes
>> offline, with no DC elected:
>>
>>> Stack: corosync
>>> Current DC: NONE
>>> Last updated: Thu Nov 3 11:11:16 2016
>>> Last change: Thu Nov 3 09:54:52 2016 by root via cibadmin on nebel1
>>>
>>> *** Resource management is DISABLED ***
>>> The cluster will not attempt to start, stop or recover services
>>>
>>> 3 nodes and 73 resources configured:
>>> 5 resources DISABLED and 0 BLOCKED from being started due to failures
>>>
>>> OFFLINE: [ nebel1 nebel2 nebel3 ]
>>
>> we tried to manually change dc-version
>>
>> when issuing a simple cleanup command I got the following error:
>>
>>> crm resource cleanup DrbdBackuppcMs
>>> Error signing on to the CRMd service
>>> Error performing operation: Transport endpoint is not connected
>>
>> which looks like crmsh is not able to communicate with crmd and nothing
>> is logged in this case in corosync.log
>>
>> we experimented with multiple config changes (corosync.conf: pacemaker
>> ver 0 > 1)
>> cib-bootstrap-options: cluster-infrastructure from openais to corosync
>>
>>> Package versions:
>>> cman 3.1.8-1.2+b1
>>> corosync 2.3.6-3~bpo8+1
>>> crmsh 2.2.0-1~bpo8+1
>>> csync2 1.34-2.3+b1
>>> dlm-pcmk 3.0.12-3.2+deb7u2
>>> libcman3 3.1.8-1.2+b1
>>> libcorosync-common4:amd64 2.3.6-3~bpo8+1
>>> munin-libvirt-plugins 0.0.6-1
>>> pacemaker 1.1.15-2~bpo8+1
>>> pacemaker-cli-utils 1.1.15-2~bpo8+1
>>> pacemaker-common 1.1.15-2~bpo8+1
>>> pacemaker-resource-agents 1.1.15-2~bpo8+1
>>
>>> Kernel: #1 SMP Debian 3.16.36-1+deb8u2 (2016-10-19) x86_64 GNU/Linux
>>
>> I attached our cib before upgrade and after, as well as the one with the
>> mixed syntax and our corosync.conf.
>>
>> When we tried to connect a second node to the cluster, pacemaker starts
>> it's deamons, starts corosync and dies after 15 tries with following in
>> corosync log:
>>
>>> crmd: info: crm_timer_popped: Wait Timer (I_NULL) just popped (2000ms)
>>> crmd: info: do_cib_control: Could not connect to the CIB service:
>>> Transport endpoint is not connected
>>> crmd: warning: do_cib_control:
>>> Couldn't complete CIB registration 15 times... pause and retry
>>> attrd: error: attrd_cib_connect: Signon to CIB failed:
>>> Transport endpoint is not connected (-107)
>>> attrd: info: main: Shutting down attribute manager
>>> attrd: info: qb_ipcs_us_withdraw: withdrawing server sockets
>>> attrd: info: crm_xml_cleanup: Cleaning up memory from libxml2
>>> crmd: info: crm_timer_popped: Wait Timer (I_NULL) just popped (2000ms)
>>> pacemakerd: warning: pcmk_child_exit:
>>> The attrd process (12761) can no longer be respawned,
>>> shutting the cluster down.
>>> pacemakerd: notice: pcmk_shutdown_worker: Shutting down Pacemaker
>>
>> A third node joins without above error, but crm_mon still shows all
>> nodes as offline.
>>
>> Thanks for any advice how to solve this, I'm out of ideas now.
>>
>> Regards, Toni
>>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
More information about the Users
mailing list