[ClusterLabs] Cloned ressource is restarted on all nodes if one node fails

Vladislav Bogdanov bubble at hoster-ok.com
Mon Aug 9 07:44:52 EDT 2021


Hi.
I'd suggest to set your clone meta attribute 'interleaved' to 'true'

Best,
Vladislav

On August 9, 2021 1:43:16 PM Andreas Janning <andreas.janning at qaware.de> wrote:
> Hi all,
>
> we recently experienced an outage in our pacemaker cluster and I would like 
> to understand how we can configure the cluster to avoid this problem in the 
> future.
>
> First our basic setup:
> - CentOS7
> - Pacemaker 1.1.23
> - Corosync 2.4.5
> - Resource-Agents 4.1.1
>
> Our cluster is composed of multiple active/passive nodes. Each software 
> component runs on two nodes simultaneously and all traffic is routed to the 
> active node via Virtual IP.
> If the active node fails, the passive node grabs the Virtual IP and 
> immediately takes over all work of the failed node. Since the software is 
> already up and running on the passive node, there should be virtually no 
> downtime.
> We have tried achieved this in pacemaker by configuring clone-sets for each 
> software component.
>
> Now the problem:
> When a software component fails on the active node, the Virtual-IP is 
> correctly grabbed by the passive node. BUT the software component is also 
> immediately restarted on the passive Node.
> That unfortunately defeats the purpose of the whole setup, since we now 
> have a downtime until the software component is restarted on the passive 
> node and the restart might even fail and lead to a complete outage.
> After some investigating I now understand that the cloned resource is 
> restarted on all nodes after a monitoring failure because the default 
> "on-fail" of "monitor" is restart. But that is not what I want.
>
> I have created a minimal setup that reproduces the problem:
>
> <configuration>
> <crm_config>
> <cluster_property_set id="cib-bootstrap-options">
> <nvpair id="cib-bootstrap-options-have-watchdog" name="have-watchdog" 
> value="false"/>
> <nvpair id="cib-bootstrap-options-dc-version" name="dc-version" 
> value="1.1.23-1.el7_9.1-9acf116022"/>
> <nvpair id="cib-bootstrap-options-cluster-infrastructure" 
> name="cluster-infrastructure" value="corosync"/>
> <nvpair id="cib-bootstrap-options-cluster-name" name="cluster-name" 
> value="pacemaker-test"/>
> <nvpair id="cib-bootstrap-options-stonith-enabled" name="stonith-enabled" 
> value="false"/>
> <nvpair id="cib-bootstrap-options-symmetric-cluster" 
> name="symmetric-cluster" value="false"/>
> </cluster_property_set>
> </crm_config>
> <nodes>
> <node id="1" uname="active-node"/>
> <node id="2" uname="passive-node"/>
> </nodes>
> <resources>
> <primitive class="ocf" id="vip" provider="heartbeat" type="IPaddr2">
> <instance_attributes id="vip-instance_attributes">
> <nvpair id="vip-instance_attributes-ip" name="ip" 
> value="{{infrastructure.virtual_ip}}"/>
> </instance_attributes>
> <operations>
> <op id="psa-vip-monitor-interval-10s" interval="10s" name="monitor" 
> timeout="20s"/>
> <op id="psa-vip-start-interval-0s" interval="0s" name="start" timeout="20s"/>
> <op id="psa-vip-stop-interval-0s" interval="0s" name="stop" timeout="20s"/>
> </operations>
> </primitive>
> <clone id="apache-clone">
> <primitive class="ocf" id="apache" provider="heartbeat" type="apache">
> <instance_attributes id="apache-instance_attributes">
> <nvpair id="apache-instance_attributes-port" name="port" value="80"/>
> <nvpair id="apache-instance_attributes-statusurl" name="statusurl" 
> value="http://localhost/server-status"/>
> </instance_attributes>
> <operations>
> <op id="apache-monitor-interval-10s" interval="10s" name="monitor" 
> timeout="20s"/>
> <op id="apache-start-interval-0s" interval="0s" name="start" timeout="40s"/>
> <op id="apache-stop-interval-0s" interval="0s" name="stop" timeout="60s"/>
> </operations>
> </primitive>
> <meta_attributes id="apache-meta_attributes">
> <nvpair id="apache-clone-meta_attributes-clone-max" name="clone-max" 
> value="2"/>
> <nvpair id="apache-clone-meta_attributes-clone-node-max" 
> name="clone-node-max" value="1"/>
> </meta_attributes>
> </clone>
> </resources>
> <constraints>
> <rsc_location id="location-apache-clone-active-node-100" node="active-node" 
> rsc="apache-clone" score="100" resource-discovery="exclusive"/>
> <rsc_location id="location-apache-clone-passive-node-0" node="passive-node" 
> rsc="apache-clone" score="0" resource-discovery="exclusive"/>
> <rsc_location id="location-vip-clone-active-node-100" node="active-node" 
> rsc="vip" score="100" resource-discovery="exclusive"/>
> <rsc_location id="location-vip-clone-passive-node-0" node="passive-node" 
> rsc="vip" score="0" resource-discovery="exclusive"/>
> <rsc_colocation id="colocation-vip-apache-clone-INFINITY" rsc="vip" 
> score="INFINITY" with-rsc="apache-clone"/>
> </constraints>
> <rsc_defaults>
> <meta_attributes id="rsc_defaults-options">
> <nvpair id="rsc_defaults-options-resource-stickiness" 
> name="resource-stickiness" value="50"/>
> </meta_attributes>
> </rsc_defaults>
> </configuration>
>
>
> When this configuration is started, httpd will be running on active-node 
> and passive-node. The VIP runs only on active-node.
> When crashing the httpd on active-node (with killall httpd), passive-node 
> immediately grabs the VIP and restarts its own httpd.
>
> How can I change this configuration so that when the resource fails on 
> active-node:
> - passive-node immediately grabs the VIP (as it does now).
> - active-node tries to restart the failed resource, giving up after x attempts.
> - passive-node does NOT restart the resource.
>
> Regards
>
> Andreas Janning
>
>
>
> --
>
> Beste Arbeitgeber ITK 2021 - 1. Platz für QAware
> ausgezeichnet von Great Place to Work
> Andreas Janning
> Expert Software Engineer
> QAware GmbH
> Aschauer Straße 32
> 81549 München, Germany
> Mobil +49 160 1492426
> andreas.janning at qaware.de
> www.qaware.de
> Geschäftsführer: Christian Kamm, Johannes Weigend, Dr. Josef Adersberger
> Registergericht: München
> Handelsregisternummer: HRB 163761
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20210809/992d5d3a/attachment.htm>


More information about the Users mailing list