[ClusterLabs] pacemaker after upgrade from wheezy to jessie

Thu Nov 3 16:42:24 UTC 2016

 > I'm guessing this change should be instantly written into the xml file?
 > If this is the case something is wrong, greping for validate gives the
 > old string back.

We found some strange behavior when setting "validate-with" via 
cibadmin, corosync.log shows the successful transaction, issuing 
cibadmin --query gives the correct value but it is NOT written into 
cib.xml.

We restarted pacemaker and value is reset to pacemaker-1.1
If signatures for the cib.xml are generated from pacemaker/cib, which 
algorithm is used? looks like md5 to me.

Would it be possible to manual edit the cib.xml and generate a valid 
cib.xml.sig to get one step further in debugging process?

Regards, Toni

--
Mit freundlichen Grüßen

Toni Tschampke | tt at halle.it
bcs kommunikationslösungen
Inh. Dipl. Ing. Carsten Burkhardt
Harz 51 | 06108 Halle (Saale) | Germany
tel +49 345 29849-0 | fax +49 345 29849-22
www.b-c-s.de | www.halle.it | www.wivewa.de

EINFACH ADRESSEN, TELEFONATE UND DOKUMENTE VERWALTEN - MIT WIVEWA -
IHREM WISSENSVERWALTER FUER IHREN BETRIEB!

Weitere Informationen erhalten Sie unter www.wivewa.de

Am 03.11.2016 um 16:39 schrieb Toni Tschampke:
>  > I'm going to guess you were using the experimental 1.1 schema as the
>  > "validate-with" at the top of /var/lib/pacemaker/cib/cib.xml. Try
>  > changing the validate-with to pacemaker-next or pacemaker-1.2 and see if
>  > you get better results. Don't edit the file directly though; use the
>  > cibadmin command so it signs the end result properly.
>  >
>  > After changing the validate-with, run:
>  >
>  >    crm_verify -x /var/lib/pacemaker/cib/cib.xml
>  >
>  > and fix any errors that show up.
>
> strange, the location of our cib.xml differs from your path, our cib is
> located in /var/lib/heartbeat/crm/
>
> running cibadmin --modify --xml-text '<cib validate-with="pacemaker-1.2"/>'
>
> gave no output but was logged to corosync:
>
> cib:     info: cib_perform_op:    -- <cib num_updates="0"
> validate-with="pacemaker-1.1"/>
> cib:     info: cib_perform_op:    ++ <cib admin_epoch="0" epoch="8462"
> num_updates="1" validate-with="pacemaker-1.2" crm_feature_set="3.0.6"
>   have-quorum="1" cib-last-written="Thu Nov  3 10:05:52 2016"
> update-origin="nebel1" update-client="cibadmin" update-user="root"/>
>
> I'm guessing this change should be instantly written into the xml file?
> If this is the case something is wrong, greping for validate gives the
> old string back.
>
> <cib admin_epoch="0" epoch="8462" num_updates="0"
> validate-with="pacemaker-1.1" crm_feature_set="3.0.6" have-quorum="1"
> cib-last-written="Thu Nov  3 16:19:51 2016" update-origin="nebel1"
> update-client="cibadmin" update-user="root">
>
> pacemakerd --features
> Pacemaker 1.1.15 (Build: e174ec8)
> Supporting v3.0.10:
>
> Should the crm_feature_set be updated this way too? I'm guessing this is
> done when "cibadmin --upgrade" succeeds?
>
> We just get an timeout error when trying to upgrade it with cibadmin:
> Call cib_upgrade failed (-62): Timer expired
>
> Do have permissions changed from 1.1.7 to 1.1.15? when looking at our
> quite big /var/lib/heartbeat/crm/ folder some permissions changed:
>
> -rw------- 1 hacluster root      80K Nov  1 16:56 cib-31.raw
> -rw-r--r-- 1 hacluster root       32 Nov  1 16:56 cib-31.raw.sig
> -rw------- 1 hacluster haclient  80K Nov  1 18:53 cib-32.raw
> -rw------- 1 hacluster haclient   32 Nov  1 18:53 cib-32.raw.sig
>
> cib-31 was before upgrading, cib-32 after starting upgraded pacemaker
>
>
> --
> Mit freundlichen Grüßen
>
> Toni Tschampke | tt at halle.it
> bcs kommunikationslösungen
> Inh. Dipl. Ing. Carsten Burkhardt
> Harz 51 | 06108 Halle (Saale) | Germany
> tel +49 345 29849-0 | fax +49 345 29849-22
> www.b-c-s.de | www.halle.it | www.wivewa.de
>
>
> EINFACH ADRESSEN, TELEFONATE UND DOKUMENTE VERWALTEN - MIT WIVEWA -
> IHREM WISSENSVERWALTER FUER IHREN BETRIEB!
>
> Weitere Informationen erhalten Sie unter www.wivewa.de
>
> Am 03.11.2016 um 15:39 schrieb Ken Gaillot:
>> On 11/03/2016 05:51 AM, Toni Tschampke wrote:
>>> Hi,
>>>
>>> we just upgraded our nodes from wheezy 7.11 (pacemaker 1.1.7) to jessie
>>> (pacemaker 1.1.15, corosync 2.3.6).
>>> During the upgrade pacemaker was removed (rc) and reinstalled after from
>>> jessie-backports, same for crmsh.
>>>
>>> Now we are encountering multiple problems:
>>>
>>> First I checked the configuration on a single node running pacemaker &
>>> corosync which dropped a strange error, followed by multiple lines
>>> stating syntax is wrong. crm configure show then showed up a mixed view
>>> of xml and crmsh singleline syntax.
>>>
>>>> ERROR: Cannot read schema file
>>> '/usr/share/pacemaker/pacemaker-1.1.rng': [Errno 2] No such file or
>>> directory: '/usr/share/pacemaker/pacemaker-1.1.rng'
>>
>> pacemaker-1.1.rng was renamed to pacemaker-next.rng in Pacemaker 1.1.12,
>> as it was used to hold experimental new features rather than as the
>> actual next version of the schema. So, the schema skipped to 1.2.
>>
>> I'm going to guess you were using the experimental 1.1 schema as the
>> "validate-with" at the top of /var/lib/pacemaker/cib/cib.xml. Try
>> changing the validate-with to pacemaker-next or pacemaker-1.2 and see if
>> you get better results. Don't edit the file directly though; use the
>> cibadmin command so it signs the end result properly.
>>
>> After changing the validate-with, run:
>>
>>    crm_verify -x /var/lib/pacemaker/cib/cib.xml
>>
>> and fix any errors that show up.
>>
>>> When we looked into that folder there was pacemaker-1.0.rng, 1.2 and so
>>> on. As a quick try we symlinked the 1.2 to 1.1 and the syntax errors
>>> were gone. When running crm resource show, all resources showed up, when
>>> running crm_mon -1fA the output was unexpected as it showed all nodes
>>> offline, with no DC elected:
>>>
>>>> Stack: corosync
>>>> Current DC: NONE
>>>> Last updated: Thu Nov  3 11:11:16 2016
>>>> Last change: Thu Nov  3 09:54:52 2016 by root via cibadmin on nebel1
>>>>
>>>>               *** Resource management is DISABLED ***
>>>>   The cluster will not attempt to start, stop or recover services
>>>>
>>>> 3 nodes and 73 resources configured:
>>>> 5 resources DISABLED and 0 BLOCKED from being started due to failures
>>>>
>>>> OFFLINE: [ nebel1 nebel2 nebel3 ]
>>>
>>> we tried to manually change dc-version
>>>
>>> when issuing a simple cleanup command I got the following error:
>>>
>>>> crm resource cleanup DrbdBackuppcMs
>>>> Error signing on to the CRMd service
>>>> Error performing operation: Transport endpoint is not connected
>>>
>>> which looks like crmsh is not able to communicate with crmd and nothing
>>> is logged in this case in corosync.log
>>>
>>> we experimented with multiple config changes (corosync.conf: pacemaker
>>> ver 0 > 1)
>>> cib-bootstrap-options: cluster-infrastructure from openais to corosync
>>>
>>>> Package versions:
>>>> cman 3.1.8-1.2+b1
>>>> corosync 2.3.6-3~bpo8+1
>>>> crmsh 2.2.0-1~bpo8+1
>>>> csync2 1.34-2.3+b1
>>>> dlm-pcmk 3.0.12-3.2+deb7u2
>>>> libcman3 3.1.8-1.2+b1
>>>> libcorosync-common4:amd64 2.3.6-3~bpo8+1
>>>> munin-libvirt-plugins 0.0.6-1
>>>> pacemaker 1.1.15-2~bpo8+1
>>>> pacemaker-cli-utils 1.1.15-2~bpo8+1
>>>> pacemaker-common 1.1.15-2~bpo8+1
>>>> pacemaker-resource-agents 1.1.15-2~bpo8+1
>>>
>>>> Kernel: #1 SMP Debian 3.16.36-1+deb8u2 (2016-10-19) x86_64 GNU/Linux
>>>
>>> I attached our cib before upgrade and after, as well as the one with the
>>> mixed syntax and our corosync.conf.
>>>
>>> When we tried to connect a second node to the cluster, pacemaker starts
>>> it's deamons, starts corosync and dies after 15 tries with following in
>>> corosync log:
>>>
>>>> crmd: info: crm_timer_popped: Wait Timer (I_NULL) just popped (2000ms)
>>>> crmd: info: do_cib_control: Could not connect to the CIB service:
>>>> Transport endpoint is not connected
>>>> crmd:  warning: do_cib_control:
>>>> Couldn't complete CIB registration 15 times... pause and retry
>>>> attrd: error: attrd_cib_connect: Signon to CIB failed:
>>>> Transport endpoint is not connected (-107)
>>>> attrd: info: main: Shutting down attribute manager
>>>> attrd: info: qb_ipcs_us_withdraw: withdrawing server sockets
>>>> attrd: info: crm_xml_cleanup: Cleaning up memory from libxml2
>>>> crmd: info: crm_timer_popped: Wait Timer (I_NULL) just popped (2000ms)
>>>> pacemakerd:  warning: pcmk_child_exit:
>>>> The attrd process (12761) can no longer be respawned,
>>>> shutting the cluster down.
>>>> pacemakerd: notice: pcmk_shutdown_worker: Shutting down Pacemaker
>>>
>>> A third node joins without above error, but crm_mon still shows all
>>> nodes as offline.
>>>
>>> Thanks for any advice how to solve this, I'm out of ideas now.
>>>
>>> Regards, Toni
>>>
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org