[ClusterLabs] Cloned ressource is restarted on all nodes if one node fails

Mon Aug 9 08:07:00 EDT 2021

Hi,

I have just tried your suggestion by adding
                <nvpair id="apache-clone-meta_attributes-interleave"
name="interleave" value="true"/>
to the clone configuration.
Unfortunately, the behavior stays the same. The service is still restarted
on the passive node when crashing it on the active node.

Regards

Andreas

Am Mo., 9. Aug. 2021 um 13:45 Uhr schrieb Vladislav Bogdanov <
bubble at hoster-ok.com>:

> Hi.
> I'd suggest to set your clone meta attribute 'interleaved' to 'true'
>
> Best,
> Vladislav
>
> On August 9, 2021 1:43:16 PM Andreas Janning <andreas.janning at qaware.de>
> wrote:
>
>> Hi all,
>>
>> we recently experienced an outage in our pacemaker cluster and I would
>> like to understand how we can configure the cluster to avoid this problem
>> in the future.
>>
>> First our basic setup:
>> - CentOS7
>> - Pacemaker 1.1.23
>> - Corosync 2.4.5
>> - Resource-Agents 4.1.1
>>
>> Our cluster is composed of multiple active/passive nodes. Each software
>> component runs on two nodes simultaneously and all traffic is routed to the
>> active node via Virtual IP.
>> If the active node fails, the passive node grabs the Virtual IP and
>> immediately takes over all work of the failed node. Since the software is
>> already up and running on the passive node, there should be virtually no
>> downtime.
>> We have tried achieved this in pacemaker by configuring clone-sets for
>> each software component.
>>
>> Now the problem:
>> When a software component fails on the active node, the Virtual-IP is
>> correctly grabbed by the passive node. BUT the software component is also
>> immediately restarted on the passive Node.
>> That unfortunately defeats the purpose of the whole setup, since we now
>> have a downtime until the software component is restarted on the passive
>> node and the restart might even fail and lead to a complete outage.
>> After some investigating I now understand that the cloned resource is
>> restarted on all nodes after a monitoring failure because the default
>> "on-fail" of "monitor" is restart. But that is not what I want.
>>
>> I have created a minimal setup that reproduces the problem:
>>
>> <configuration>
>>>  <crm_config>
>>>  <cluster_property_set id="cib-bootstrap-options">
>>>  <nvpair id="cib-bootstrap-options-have-watchdog" name="have-watchdog"
>>> value="false"/>
>>>  <nvpair id="cib-bootstrap-options-dc-version" name="dc-version"
>>> value="1.1.23-1.el7_9.1-9acf116022"/>
>>>  <nvpair id="cib-bootstrap-options-cluster-infrastructure"
>>> name="cluster-infrastructure" value="corosync"/>
>>>  <nvpair id="cib-bootstrap-options-cluster-name" name="cluster-name"
>>> value="pacemaker-test"/>
>>>  <nvpair id="cib-bootstrap-options-stonith-enabled"
>>> name="stonith-enabled" value="false"/>
>>>  <nvpair id="cib-bootstrap-options-symmetric-cluster"
>>> name="symmetric-cluster" value="false"/>
>>>  </cluster_property_set>
>>>  </crm_config>
>>>  <nodes>
>>>  <node id="1" uname="active-node"/>
>>>  <node id="2" uname="passive-node"/>
>>>  </nodes>
>>>  <resources>
>>>  <primitive class="ocf" id="vip" provider="heartbeat" type="IPaddr2">
>>>  <instance_attributes id="vip-instance_attributes">
>>>  <nvpair id="vip-instance_attributes-ip" name="ip"
>>> value="{{infrastructure.virtual_ip}}"/>
>>>  </instance_attributes>
>>>  <operations>
>>>  <op id="psa-vip-monitor-interval-10s" interval="10s" name="monitor"
>>> timeout="20s"/>
>>>  <op id="psa-vip-start-interval-0s" interval="0s" name="start"
>>> timeout="20s"/>
>>>  <op id="psa-vip-stop-interval-0s" interval="0s" name="stop"
>>> timeout="20s"/>
>>>  </operations>
>>>  </primitive>
>>>  <clone id="apache-clone">
>>>  <primitive class="ocf" id="apache" provider="heartbeat" type="apache">
>>>  <instance_attributes id="apache-instance_attributes">
>>>  <nvpair id="apache-instance_attributes-port" name="port" value="80"/>
>>>  <nvpair id="apache-instance_attributes-statusurl" name="statusurl"
>>> value="http://localhost/server-status"/>
>>>  </instance_attributes>
>>>  <operations>
>>>  <op id="apache-monitor-interval-10s" interval="10s" name="monitor"
>>> timeout="20s"/>
>>>  <op id="apache-start-interval-0s" interval="0s" name="start"
>>> timeout="40s"/>
>>>  <op id="apache-stop-interval-0s" interval="0s" name="stop"
>>> timeout="60s"/>
>>>  </operations>
>>>  </primitive>
>>>  <meta_attributes id="apache-meta_attributes">
>>>  <nvpair id="apache-clone-meta_attributes-clone-max" name="clone-max"
>>> value="2"/>
>>>  <nvpair id="apache-clone-meta_attributes-clone-node-max"
>>> name="clone-node-max" value="1"/>
>>>  </meta_attributes>
>>>  </clone>
>>>  </resources>
>>>  <constraints>
>>>  <rsc_location id="location-apache-clone-active-node-100"
>>> node="active-node" rsc="apache-clone" score="100"
>>> resource-discovery="exclusive"/>
>>>  <rsc_location id="location-apache-clone-passive-node-0"
>>> node="passive-node" rsc="apache-clone" score="0"
>>> resource-discovery="exclusive"/>
>>>  <rsc_location id="location-vip-clone-active-node-100"
>>> node="active-node" rsc="vip" score="100" resource-discovery="exclusive"/>
>>>  <rsc_location id="location-vip-clone-passive-node-0"
>>> node="passive-node" rsc="vip" score="0" resource-discovery="exclusive"/>
>>>  <rsc_colocation id="colocation-vip-apache-clone-INFINITY" rsc="vip"
>>> score="INFINITY" with-rsc="apache-clone"/>
>>>  </constraints>
>>>  <rsc_defaults>
>>>  <meta_attributes id="rsc_defaults-options">
>>>  <nvpair id="rsc_defaults-options-resource-stickiness"
>>> name="resource-stickiness" value="50"/>
>>>  </meta_attributes>
>>>  </rsc_defaults>
>>> </configuration>
>>>
>>
>>
>> When this configuration is started, httpd will be running on active-node
>> and passive-node. The VIP runs only on active-node.
>> When crashing the httpd on active-node (with killall httpd), passive-node
>> immediately grabs the VIP and restarts its own httpd.
>>
>> How can I change this configuration so that when the resource fails on
>> active-node:
>> - passive-node immediately grabs the VIP (as it does now).
>> - active-node tries to restart the failed resource, giving up after x
>> attempts.
>> - passive-node does NOT restart the resource.
>>
>> Regards
>>
>> Andreas Janning
>>
>>
>>
>> --
>> ------------------------------
>>
>> *Beste Arbeitgeber ITK 2021 - 1. Platz für QAware*
>> ausgezeichnet von Great Place to Work
>> <https://www.qaware.de/news/platz-1-bei-beste-arbeitgeber-in-der-itk-2021/>
>> ------------------------------
>>
>> Andreas Janning
>> Expert Software Engineer
>>
>> QAware GmbH
>> Aschauer Straße 32
>> 81549 München, Germany
>> Mobil +49 160 1492426
>> andreas.janning at qaware.de
>> www.qaware.de
>> ------------------------------
>>
>> Geschäftsführer: Christian Kamm, Johannes Weigend, Dr. Josef Adersberger
>> Registergericht: München
>> Handelsregisternummer: HRB 163761
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>>
>>
>

-- 
------------------------------

*Beste Arbeitgeber ITK 2021 - 1. Platz für QAware*
ausgezeichnet von Great Place to Work
<https://www.qaware.de/news/platz-1-bei-beste-arbeitgeber-in-der-itk-2021/>
------------------------------

Andreas Janning
Expert Software Engineer

QAware GmbH
Aschauer Straße 32
81549 München, Germany
Mobil +49 160 1492426
andreas.janning at qaware.de
www.qaware.de
------------------------------

Geschäftsführer: Christian Kamm, Johannes Weigend, Dr. Josef Adersberger
Registergericht: München
Handelsregisternummer: HRB 163761
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20210809/0970b5de/attachment-0001.htm>