[ClusterLabs] Pacemaker cluster not working after switching from 1.0 to 1.1

Thu Feb 9 17:50:04 EST 2017

On 01/16/2017 01:16 PM, Rick Kint wrote:
> 
>> Date: Mon, 16 Jan 2017 09:15:44 -0600
>> From: Ken Gaillot <kgaillot at redhat.com>
>> To: users at clusterlabs.org
>> Subject: Re: [ClusterLabs] Pacemaker cluster not working after
>>     switching from 1.0 to 1.1 (resend as plain text)
>> Message-ID: <f51b9abd-df28-ec7b-6424-3c221a829b46 at redhat.com>
>> Content-Type: text/plain; charset=utf-8
>>
>> A preliminary question -- what cluster layer are you running?
>>
>> Pacemaker 1.0 worked with heartbeat or corosync 1, while Ubuntu 14.04
>> ships with corosync 2 by default, IIRC. There were major incompatible
>> changes between corosync 1 and 2, so it's important to get that right
>> before looking at pacemaker.
>>
>> A general note, when making such a big jump in the pacemaker version,
>> I'd recommend running "cibadmin --upgrade" both before exporting 
>> the
>> configuration from 1.0, and again after deploying it on 1.1. This will
>> apply any transformations needed in the CIB syntax. Pacemaker will do
>> this on the fly, but doing it manually lets you see any issues early, as
>> well as being more efficient.
> 
> TL;DR
> - Thanks.
> - Cluster mostly works so I don't think it's a corosync issue.
> - Configuration XML is actually created with crm shell.
> - Is there a summary of changes from 1.0 to 1.1?
> 
> 
> Thanks for the quick reply.
> 
> 
> Corosync is v2.3.3. We've already been through the issues getting corosync working. 
> 
> The cluster works in many ways:
> - Pacemaker sees both nodes.
> - Pacemaker starts all the resources.- Pacemaker promotes an instance of the stateful Encryptor resource to Master/active.
> - If the node running the active Encryptor goes down, the standby Encryptor is promoted and the DC changes.
> - Manual failover works (fiddling with the master-score attribute).
> 
> The problem is that a failure in one of the dependencies doesn't cause promotion anymore.
> 
> 
> 
> 
> Thanks for the cibadmin command, I missed that when reading the docs.
> 
> I omitted some detail. I didn't export the XML from the old cluster to the new cluster. We create the configuration with the crm shell, not with XML. The sequence of events is
> 
> 
> - install corosync, pacemaker, etc.- apply local config file changes.
> - start corosync and pacemaker on both nodes in cluster.
> - verify that cluster is formed (crm_mon shows both nodes online, but no resources).
> - create cluster by running script which passes a here document to the crm shell.
> - verify that cluster is formed
> 
> 
> The crm shell version is "1.2.5+hg1034-1ubuntu4". I've checked the XML against the "Pacemaker Configuration Explained" doc and it looks OK to my admittedly non-knowledgeable eye.
> 
> I tried the cibadmin command in hopes that this might tell me something, but it made no changes. "cib_verify --live-check" doesn't complain either.
> I copied the XML from a Pacemaker 1.0.X system to a Pacemaker 1.1.X system and ran "cibadmin --upgrade" on it. Nothing changed there either. 
> 
> 
> 
> Is there a quick summary of changes from 1.0 to 1.1 somewhere? The "Pacemaker 1.1 Configuration Explained" doc has a section entitled "What is new in 1.0" but nothing for 1.1. I wouldn't be surprised if there is something obvious that I'm missing and it would help if I could limit my search space.

No, there's just the change log, which is quite detailed.

There was no defining change from 1.0 to 1.1. Originally, it was planned
that 1.1 would be a "development" branch with new features, and 1.0
would be a "production" branch with bugfixes only. It proved too much
work to maintain two separate branches, so the 1.0 line was ended, and
1.1 became the sole production branch.

> I've done quite a bit of experimentation: changed the syntax of the colocation constraints, added ordering constraints, and fiddled with timeouts. When I was doing the port to Ubuntu, I tested resource agent exit status but I'll go back and check that again. Any other suggestions?
> 
> 
> BTW, I've fixed some issues with the Pacemaker init script running on Ubuntu. Should these go to Clusterlabs or the Debian/Ubuntu maintainer?

It depends on whether they're using the init script provided upstream,
or their own (which I suspect is more likely).

> CONFIGURATION
> 
> 
> Here's the configuration again, hopefully with indentation preserved this time:
> 
> <configuration>
> <crm_config>
>   <cluster_property_set id="cib-bootstrap-options">
>    <nvpair name="stonith-enabled" value="false" id="cib-bootstrap-options-stonith-enabled"/>
>    <nvpair name="no-quorum-policy" value="ignore" id="cib-bootstrap-options-no-quorum-policy"/>
>    <nvpair id="cib-bootstrap-options-last-lrm-refresh" name="last-lrm-refresh" value="1484336062"/>
>   </cluster_property_set>
> </crm_config>
> <nodes>
>   <node id="3232262401" uname="encryptor4"/>
>   <node id="3232262402" uname="encryptor5"/>
> </nodes>
> <resources>
>   <master id="Encryptor">
>    <meta_attributes id="Encryptor-meta_attributes">
>     <nvpair name="clone-max" value="2" id="Encryptor-meta_attributes-clone-max"/>
>     <nvpair name="clone-node-max" value="1" id="Encryptor-meta_attributes-clone-node-max"/>
>     <nvpair name="master-max" value="1" id="Encryptor-meta_attributes-master-max"/>
>     <nvpair name="notify" value="false" id="Encryptor-meta_attributes-notify"/>
>     <nvpair name="target-role" value="Master" id="Encryptor-meta_attributes-target-role"/>
>    </meta_attributes>
>    <primitive id="EncryptBase" class="ocf" provider="fnord" type="encryptor">
>     <operations>
>      <op name="start" interval="0s" timeout="20s" id="EncryptBase-start-0s"/>
>      <op name="monitor" interval="1s" role="Master" timeout="2s" id="EncryptBase-monitor-1s"/>
>      <op name="monitor" interval="2s" role="Slave" timeout="2s" id="EncryptBase-monitor-2s"/>
>     </operations>
>    </primitive>
>   </master>
>   <clone id="CredProxy">
>    <meta_attributes id="CredProxy-meta_attributes">
>     <nvpair name="clone-max" value="2" id="CredProxy-meta_attributes-clone-max"/>
>     <nvpair name="clone-node-max" value="1" id="CredProxy-meta_attributes-clone-node-max"/>
>     <nvpair name="notify" value="false" id="CredProxy-meta_attributes-notify"/>
>     <nvpair name="target-role" value="Started" id="CredProxy-meta_attributes-target-role"/>
>    </meta_attributes>
>    <primitive id="CredBase" class="ocf" provider="fnord" type="credproxy">
>     <operations>
>      <op name="start" interval="0s" timeout="20s" id="CredBase-start-0s"/>
>      <op name="monitor" interval="1s" timeout="2s" id="CredBase-monitor-1s"/>
>     </operations>
>    </primitive>
>   </clone>
>   <clone id="Ingress">
>    <meta_attributes id="Ingress-meta_attributes">
>     <nvpair name="clone-max" value="2" id="Ingress-meta_attributes-clone-max"/>
>     <nvpair name="clone-node-max" value="1" id="Ingress-meta_attributes-clone-node-max"/>
>     <nvpair name="notify" value="false" id="Ingress-meta_attributes-notify"/>
>     <nvpair name="target-role" value="Started" id="Ingress-meta_attributes-target-role"/>
>    </meta_attributes>
>    <primitive id="IngressBase" class="ocf" provider="fnord" type="interface">
>     <operations>
>      <op name="start" interval="0s" timeout="5s" id="IngressBase-start-0s"/>
>      <op name="monitor" interval="1s" timeout="1s" id="IngressBase-monitor-1s"/>
>     </operations>
>     <instance_attributes id="IngressBase-instance_attributes">
>      <nvpair name="interface" value="em1" id="IngressBase-instance_attributes-interface"/>
>      <nvpair name="label" value="ingress" id="IngressBase-instance_attributes-label"/>
>      <nvpair name="min_retries" value="5" id="IngressBase-instance_attributes-min_retries"/>
>      <nvpair name="max_retries" value="100" id="IngressBase-instance_attributes-max_retries"/>
>     </instance_attributes>
>    </primitive>
>   </clone>
>   <clone id="Egress">
>    <meta_attributes id="Egress-meta_attributes">
>     <nvpair name="clone-max" value="2" id="Egress-meta_attributes-clone-max"/>
>     <nvpair name="clone-node-max" value="1" id="Egress-meta_attributes-clone-node-max"/>
>     <nvpair name="notify" value="false" id="Egress-meta_attributes-notify"/>
>     <nvpair name="target-role" value="Started" id="Egress-meta_attributes-target-role"/>
>    </meta_attributes>
>    <primitive id="EgressBase" class="ocf" provider="fnord" type="interface">
>     <operations>
>      <op name="start" interval="0s" timeout="5s" id="EgressBase-start-0s"/>
>      <op name="monitor" interval="1s" timeout="1s" id="EgressBase-monitor-1s"/>
>     </operations>
>     <instance_attributes id="EgressBase-instance_attributes">
>      <nvpair name="interface" value="em2" id="EgressBase-instance_attributes-interface"/>
>      <nvpair name="label" value="egress" id="EgressBase-instance_attributes-label"/>
>      <nvpair name="min_retries" value="5" id="EgressBase-instance_attributes-min_retries"/>
>      <nvpair name="max_retries" value="100" id="EgressBase-instance_attributes-max_retries"/>
>     </instance_attributes>
>    </primitive>
>   </clone>
> </resources>
> <constraints>
>   <rsc_colocation id="encryptor-with-credproxy" score="INFINITY" rsc="Encryptor" rsc-role="Master" with- rsc="CredProxy" with-rsc-role="Started"/>
>   <rsc_colocation id="encryptor-with-ingress" score="INFINITY" rsc="Encryptor" rsc-role="Master" with-rsc="Ingress" with-rsc-role="Started"/>
>   <rsc_colocation id="encryptor-with-egress" score="INFINITY" rsc="Encryptor" rsc-role="Master" with-rsc="Egress" with-rsc-role="Started"/>
> </constraints>
> <rsc_defaults>
>   <meta_attributes id="rsc-options">
>    <nvpair name="resource-stickiness" value="10" id="rsc-options-resource-stickiness"/>
>   </meta_attributes>
> </rsc_defaults>
> </configuration>
> 
> 
> 
> CONFIGURATION CREATION
> 
> Here is the script which initializes the configuration. The XML above results from running this script:
> 
> /usr/sbin/crm <<%
> configure
> erase
> property stonith-enabled="false"
> property no-quorum-policy="ignore"
> rsc_defaults  \
>   resource-stickiness="10"
> primitive EncryptorBase ocf:fnord:encryptor \
>   op start interval="0s" timeout="20s" \
>   op monitor interval="1s" role="Master" timeout="2s" \
>   op monitor interval="2s" role="Slave" timeout="2s"
> ms Encryptor EncryptorBase \
>   meta clone-max="2" clone-node-max="1" master-max="1" notify="false" \
>     target-role="Master"
> primitive CredBase ocf:fnord:credproxy \
>   op start interval="0s" timeout="20s" \
>   op monitor interval="1s" timeout="2s"
> clone CredProxy CredBase \
>   meta clone-max="2" clone-node-max="1" notify="false" target-role="Started"
> primitive IngressBase ocf:fnord:interface \
>   op start interval="0s" timeout="5s" \
>   op monitor interval="1s" timeout="1s" \
>   params interface="em1" label="ingress" min_retries="5" max_retries="100"
> clone Ingress IngressBase \
>   meta clone-max="2" clone-node-max="1" notify="false" target-role="Started"
> primitive EgressBase ocf:fnord:interface \
>   op start interval="0s" timeout="5s" \
>   op monitor interval="1s" timeout="1s" \
>   params interface="em2" label="egress" min_retries="5" max_retries="100"
> clone Egress EgressBase \
>   meta clone-max="2" clone-node-max="1" notify="false" target-role="Started"
> 
> colocation encryptor-with-credproxy inf: Encryptor:Master CredProxy:Started
> colocation encryptor-with-ingress inf: Encryptor:Master Ingress:Started
> colocation encryptor-with-egress inf: Encryptor:Master Egress:Started
> commit
> exit
> 
> %
> 
> 
> Thanks again for your help.

I don't see anything obvious. If you could paste the logs from both
nodes around the time of the problem and send a link, that might help.