[ClusterLabs] Antw: [EXT] Re: Cloned ressource is restarted on all nodes if one node fails

Ulrich Windl Ulrich.Windl at rz.uni-regensburg.de
Mon Aug 9 08:38:09 EDT 2021


>>> Andreas Janning <andreas.janning at qaware.de> schrieb am 09.08.2021 um 14:07
in
Nachricht
<CAGmA_=GCnHWJZiuUhoaafd=sd+oodgPfJbb=y56CXV52A5ARDA at mail.gmail.com>:
> Hi,
> 
> I have just tried your suggestion by adding
>                 <nvpair id="apache-clone-meta_attributes-interleave"
> name="interleave" value="true"/>
> to the clone configuration.
> Unfortunately, the behavior stays the same. The service is still restarted
> on the passive node when crashing it on the active node.

Maybe try to find out from the logs what is happening (wnd why it is
happening).

Regards,
Ulrich

> 
> Regards
> 
> Andreas
> 
> Am Mo., 9. Aug. 2021 um 13:45 Uhr schrieb Vladislav Bogdanov <
> bubble at hoster-ok.com>:
> 
>> Hi.
>> I'd suggest to set your clone meta attribute 'interleaved' to 'true'
>>
>> Best,
>> Vladislav
>>
>> On August 9, 2021 1:43:16 PM Andreas Janning <andreas.janning at qaware.de>
>> wrote:
>>
>>> Hi all,
>>>
>>> we recently experienced an outage in our pacemaker cluster and I would
>>> like to understand how we can configure the cluster to avoid this problem
>>> in the future.
>>>
>>> First our basic setup:
>>> - CentOS7
>>> - Pacemaker 1.1.23
>>> - Corosync 2.4.5
>>> - Resource-Agents 4.1.1
>>>
>>> Our cluster is composed of multiple active/passive nodes. Each software
>>> component runs on two nodes simultaneously and all traffic is routed to
the
>>> active node via Virtual IP.
>>> If the active node fails, the passive node grabs the Virtual IP and
>>> immediately takes over all work of the failed node. Since the software is
>>> already up and running on the passive node, there should be virtually no
>>> downtime.
>>> We have tried achieved this in pacemaker by configuring clone-sets for
>>> each software component.
>>>
>>> Now the problem:
>>> When a software component fails on the active node, the Virtual-IP is
>>> correctly grabbed by the passive node. BUT the software component is also
>>> immediately restarted on the passive Node.
>>> That unfortunately defeats the purpose of the whole setup, since we now
>>> have a downtime until the software component is restarted on the passive
>>> node and the restart might even fail and lead to a complete outage.
>>> After some investigating I now understand that the cloned resource is
>>> restarted on all nodes after a monitoring failure because the default
>>> "on-fail" of "monitor" is restart. But that is not what I want.
>>>
>>> I have created a minimal setup that reproduces the problem:
>>>
>>> <configuration>
>>>>  <crm_config>
>>>>  <cluster_property_set id="cib-bootstrap-options">
>>>>  <nvpair id="cib-bootstrap-options-have-watchdog" name="have-watchdog"
>>>> value="false"/>
>>>>  <nvpair id="cib-bootstrap-options-dc-version" name="dc-version"
>>>> value="1.1.23-1.el7_9.1-9acf116022"/>
>>>>  <nvpair id="cib-bootstrap-options-cluster-infrastructure"
>>>> name="cluster-infrastructure" value="corosync"/>
>>>>  <nvpair id="cib-bootstrap-options-cluster-name" name="cluster-name"
>>>> value="pacemaker-test"/>
>>>>  <nvpair id="cib-bootstrap-options-stonith-enabled"
>>>> name="stonith-enabled" value="false"/>
>>>>  <nvpair id="cib-bootstrap-options-symmetric-cluster"
>>>> name="symmetric-cluster" value="false"/>
>>>>  </cluster_property_set>
>>>>  </crm_config>
>>>>  <nodes>
>>>>  <node id="1" uname="active-node"/>
>>>>  <node id="2" uname="passive-node"/>
>>>>  </nodes>
>>>>  <resources>
>>>>  <primitive class="ocf" id="vip" provider="heartbeat" type="IPaddr2">
>>>>  <instance_attributes id="vip-instance_attributes">
>>>>  <nvpair id="vip-instance_attributes-ip" name="ip"
>>>> value="{{infrastructure.virtual_ip}}"/>
>>>>  </instance_attributes>
>>>>  <operations>
>>>>  <op id="psa-vip-monitor-interval-10s" interval="10s" name="monitor"
>>>> timeout="20s"/>
>>>>  <op id="psa-vip-start-interval-0s" interval="0s" name="start"
>>>> timeout="20s"/>
>>>>  <op id="psa-vip-stop-interval-0s" interval="0s" name="stop"
>>>> timeout="20s"/>
>>>>  </operations>
>>>>  </primitive>
>>>>  <clone id="apache-clone">
>>>>  <primitive class="ocf" id="apache" provider="heartbeat" type="apache">
>>>>  <instance_attributes id="apache-instance_attributes">
>>>>  <nvpair id="apache-instance_attributes-port" name="port" value="80"/>
>>>>  <nvpair id="apache-instance_attributes-statusurl" name="statusurl"
>>>> value="http://localhost/server-status"/>
>>>>  </instance_attributes>
>>>>  <operations>
>>>>  <op id="apache-monitor-interval-10s" interval="10s" name="monitor"
>>>> timeout="20s"/>
>>>>  <op id="apache-start-interval-0s" interval="0s" name="start"
>>>> timeout="40s"/>
>>>>  <op id="apache-stop-interval-0s" interval="0s" name="stop"
>>>> timeout="60s"/>
>>>>  </operations>
>>>>  </primitive>
>>>>  <meta_attributes id="apache-meta_attributes">
>>>>  <nvpair id="apache-clone-meta_attributes-clone-max" name="clone-max"
>>>> value="2"/>
>>>>  <nvpair id="apache-clone-meta_attributes-clone-node-max"
>>>> name="clone-node-max" value="1"/>
>>>>  </meta_attributes>
>>>>  </clone>
>>>>  </resources>
>>>>  <constraints>
>>>>  <rsc_location id="location-apache-clone-active-node-100"
>>>> node="active-node" rsc="apache-clone" score="100"
>>>> resource-discovery="exclusive"/>
>>>>  <rsc_location id="location-apache-clone-passive-node-0"
>>>> node="passive-node" rsc="apache-clone" score="0"
>>>> resource-discovery="exclusive"/>
>>>>  <rsc_location id="location-vip-clone-active-node-100"
>>>> node="active-node" rsc="vip" score="100"
resource-discovery="exclusive"/>
>>>>  <rsc_location id="location-vip-clone-passive-node-0"
>>>> node="passive-node" rsc="vip" score="0" resource-discovery="exclusive"/>
>>>>  <rsc_colocation id="colocation-vip-apache-clone-INFINITY" rsc="vip"
>>>> score="INFINITY" with-rsc="apache-clone"/>
>>>>  </constraints>
>>>>  <rsc_defaults>
>>>>  <meta_attributes id="rsc_defaults-options">
>>>>  <nvpair id="rsc_defaults-options-resource-stickiness"
>>>> name="resource-stickiness" value="50"/>
>>>>  </meta_attributes>
>>>>  </rsc_defaults>
>>>> </configuration>
>>>>
>>>
>>>
>>> When this configuration is started, httpd will be running on active-node
>>> and passive-node. The VIP runs only on active-node.
>>> When crashing the httpd on active-node (with killall httpd), passive-node
>>> immediately grabs the VIP and restarts its own httpd.
>>>
>>> How can I change this configuration so that when the resource fails on
>>> active-node:
>>> - passive-node immediately grabs the VIP (as it does now).
>>> - active-node tries to restart the failed resource, giving up after x
>>> attempts.
>>> - passive-node does NOT restart the resource.
>>>
>>> Regards
>>>
>>> Andreas Janning
>>>
>>>
>>>
>>> --
>>> ------------------------------
>>>
>>> *Beste Arbeitgeber ITK 2021 - 1. Platz für QAware*
>>> ausgezeichnet von Great Place to Work
>>>
<https://www.qaware.de/news/platz-1-bei-beste-arbeitgeber-in-der-itk-2021/>
>>> ------------------------------
>>>
>>> Andreas Janning
>>> Expert Software Engineer
>>>
>>> QAware GmbH
>>> Aschauer Straße 32
>>> 81549 München, Germany
>>> Mobil +49 160 1492426
>>> andreas.janning at qaware.de 
>>> www.qaware.de 
>>> ------------------------------
>>>
>>> Geschäftsführer: Christian Kamm, Johannes Weigend, Dr. Josef Adersberger
>>> Registergericht: München
>>> Handelsregisternummer: HRB 163761
>>> _______________________________________________
>>> Manage your subscription:
>>> https://lists.clusterlabs.org/mailman/listinfo/users 
>>>
>>> ClusterLabs home: https://www.clusterlabs.org/ 
>>>
>>>
>>
> 
> -- 
> ------------------------------
> 
> *Beste Arbeitgeber ITK 2021 - 1. Platz für QAware*
> ausgezeichnet von Great Place to Work
> <https://www.qaware.de/news/platz-1-bei-beste-arbeitgeber-in-der-itk-2021/>
> ------------------------------
> 
> Andreas Janning
> Expert Software Engineer
> 
> QAware GmbH
> Aschauer Straße 32
> 81549 München, Germany
> Mobil +49 160 1492426
> andreas.janning at qaware.de 
> www.qaware.de 
> ------------------------------
> 
> Geschäftsführer: Christian Kamm, Johannes Weigend, Dr. Josef Adersberger
> Registergericht: München
> Handelsregisternummer: HRB 163761





More information about the Users mailing list