[ClusterLabs] resource start after network reconnected

Sat Nov 20 02:59:19 EST 2021

On 19.11.2021 19:26, john tillman wrote:
...
>>>
>>> If pacemaker tries to stop resources due to out of quorum condition, you
>>> could set suitable failure-timeout; this will be equivalent to using
>>> "pcs
>>> resource refresh". Keep in mind that pacemaker only checks for
>>> failure-timeout expiration every cluster-recheck-interval (15 minutes by
>>> default). This still is not directly related to network availability,
>>> but
>>> if network outage resulted in node going out of quorum, when network is
>>> back and node joined cluster again it will allow resources to be started
>>> on node.
>>>
>>
>> When quorum is lost I want all the resources to stop.  The cluster is
>> performing this step correctly for me.
>>
>> That cluster-recheck-interval would explain the intermittence I saw this
>> morning.  If I set that to 1 minute would that cause any gross negative
>> issues?
>>
> 
> 
> I tried setting cluster-recheck-interval to 1 minute and I saw no change
> to the resources after reconnecting the network.  They were still listed
> as However, "pcs resource refresh" started it, as usual in this scenario.
> 
> Anyone have any other ideas for a configuration setting that will
> effectively do whatever 'pcs resource refresh' is doing when quorum is
> restored?
> 

I already told you above and it most certainly works here.

Without failure-timeout resource is stuck in blocked state:

Cluster Summary:

  * Stack: corosync

  * Current DC: ha1 (version 2.1.0+20210816.c6a4f6e6c-1.1-2.1.0+20210816.c6a4f6e6c) - partition with quorum

  * Last updated: Sat Nov 20 10:48:48 2021

  * Last change:  Sat Nov 20 10:46:55 2021 by root via cibadmin on ha1

  * 3 nodes configured

  * 3 resource instances configured (1 BLOCKED from further action due to failure)

Node List:

  * Online: [ ha1 ha2 qnetd ]

Full List of Resources:

  * Clone Set: cln_Test [rsc_Test]:

    * rsc_Test	(ocf::_local:Dummy):	 FAILED ha1 (blocked)

    * Started: [ ha2 ]

    * Stopped: [ qnetd ]

Operations:

  * Node: ha2:

    * rsc_Test: migration-threshold=1000000:

      * (10) start

      * (11) monitor: interval="10000ms"

  * Node: ha1:

    * rsc_Test: migration-threshold=1000000 fail-count=1000000 last-failure='Sat Nov 20 10:47:14 2021':

      * (18) start

      * (30) stop

Failed Resource Actions:

  * rsc_Test_stop_0 on ha1 'error' (1): call=30, status='complete', exitreason='forced to fail stop operation', last-rc-change='2021-11-20 10:47:14 +03:00', queued=0ms, exec=27ms

With failure-timeout resource is restarted after expiration.

Cluster Summary:

  * Stack: corosync

  * Current DC: ha1 (version 2.1.0+20210816.c6a4f6e6c-1.1-2.1.0+20210816.c6a4f6e6c) - partition with quorum

  * Last updated: Sat Nov 20 10:53:51 2021

  * Last change:  Sat Nov 20 10:50:37 2021 by root via cibadmin on ha2

  * 3 nodes configured

  * 3 resource instances configured

Node List:

  * Online: [ ha1 ha2 qnetd ]

Full List of Resources:

  * Clone Set: cln_Test [rsc_Test]:

    * Started: [ ha1 ha2 ]

    * Stopped: [ qnetd ]

Operations:

  * Node: ha2:

    * rsc_Test: migration-threshold=1000000:

      * (18) probe

      * (18) probe

      * (19) monitor: interval="10000ms"

  * Node: ha1:

    * rsc_Test: migration-threshold=1000000:

      * (40) probe

      * (40) probe

      * (41) monitor: interval="10000ms"

Configuration:

node 1: ha1 \

	attributes pingd=1 \

	utilization cpu=20

node 2: ha2 \

	attributes pingd=1 \

	utilization cpu=20

node 3: qnetd

primitive rsc_Test ocf:_local:Dummy \

	meta failure-timeout=30s \

	op monitor interval=10s

clone cln_Test rsc_Test

location not_on_qnetd cln_Test -inf: qnetd

property cib-bootstrap-options: \

	cluster-infrastructure=corosync \

	cluster-name=ha \

	dc-version="2.1.0+20210816.c6a4f6e6c-1.1-2.1.0+20210816.c6a4f6e6c" \

	last-lrm-refresh=1637394576 \

	stonith-enabled=false \

	have-watchdog=true \

	stonith-watchdog-timeout=0 \

	placement-strategy=balanced