[ClusterLabs] resource start after network reconnected
    Andrei Borzenkov 
    arvidjaar at gmail.com
       
    Sat Nov 20 02:59:19 EST 2021
    
    
  
On 19.11.2021 19:26, john tillman wrote:
...
>>>
>>> If pacemaker tries to stop resources due to out of quorum condition, you
>>> could set suitable failure-timeout; this will be equivalent to using
>>> "pcs
>>> resource refresh". Keep in mind that pacemaker only checks for
>>> failure-timeout expiration every cluster-recheck-interval (15 minutes by
>>> default). This still is not directly related to network availability,
>>> but
>>> if network outage resulted in node going out of quorum, when network is
>>> back and node joined cluster again it will allow resources to be started
>>> on node.
>>>
>>
>> When quorum is lost I want all the resources to stop.  The cluster is
>> performing this step correctly for me.
>>
>> That cluster-recheck-interval would explain the intermittence I saw this
>> morning.  If I set that to 1 minute would that cause any gross negative
>> issues?
>>
> 
> 
> I tried setting cluster-recheck-interval to 1 minute and I saw no change
> to the resources after reconnecting the network.  They were still listed
> as However, "pcs resource refresh" started it, as usual in this scenario.
> 
> Anyone have any other ideas for a configuration setting that will
> effectively do whatever 'pcs resource refresh' is doing when quorum is
> restored?
> 
I already told you above and it most certainly works here.
Without failure-timeout resource is stuck in blocked state:
Cluster Summary:
  * Stack: corosync
  * Current DC: ha1 (version 2.1.0+20210816.c6a4f6e6c-1.1-2.1.0+20210816.c6a4f6e6c) - partition with quorum
  * Last updated: Sat Nov 20 10:48:48 2021
  * Last change:  Sat Nov 20 10:46:55 2021 by root via cibadmin on ha1
  * 3 nodes configured
  * 3 resource instances configured (1 BLOCKED from further action due to failure)
Node List:
  * Online: [ ha1 ha2 qnetd ]
Full List of Resources:
  * Clone Set: cln_Test [rsc_Test]:
    * rsc_Test	(ocf::_local:Dummy):	 FAILED ha1 (blocked)
    * Started: [ ha2 ]
    * Stopped: [ qnetd ]
Operations:
  * Node: ha2:
    * rsc_Test: migration-threshold=1000000:
      * (10) start
      * (11) monitor: interval="10000ms"
  * Node: ha1:
    * rsc_Test: migration-threshold=1000000 fail-count=1000000 last-failure='Sat Nov 20 10:47:14 2021':
      * (18) start
      * (30) stop
Failed Resource Actions:
  * rsc_Test_stop_0 on ha1 'error' (1): call=30, status='complete', exitreason='forced to fail stop operation', last-rc-change='2021-11-20 10:47:14 +03:00', queued=0ms, exec=27ms
With failure-timeout resource is restarted after expiration.
Cluster Summary:
  * Stack: corosync
  * Current DC: ha1 (version 2.1.0+20210816.c6a4f6e6c-1.1-2.1.0+20210816.c6a4f6e6c) - partition with quorum
  * Last updated: Sat Nov 20 10:53:51 2021
  * Last change:  Sat Nov 20 10:50:37 2021 by root via cibadmin on ha2
  * 3 nodes configured
  * 3 resource instances configured
Node List:
  * Online: [ ha1 ha2 qnetd ]
Full List of Resources:
  * Clone Set: cln_Test [rsc_Test]:
    * Started: [ ha1 ha2 ]
    * Stopped: [ qnetd ]
Operations:
  * Node: ha2:
    * rsc_Test: migration-threshold=1000000:
      * (18) probe
      * (18) probe
      * (19) monitor: interval="10000ms"
  * Node: ha1:
    * rsc_Test: migration-threshold=1000000:
      * (40) probe
      * (40) probe
      * (41) monitor: interval="10000ms"
Configuration:
node 1: ha1 \
	attributes pingd=1 \
	utilization cpu=20
node 2: ha2 \
	attributes pingd=1 \
	utilization cpu=20
node 3: qnetd
primitive rsc_Test ocf:_local:Dummy \
	meta failure-timeout=30s \
	op monitor interval=10s
clone cln_Test rsc_Test
location not_on_qnetd cln_Test -inf: qnetd
property cib-bootstrap-options: \
	cluster-infrastructure=corosync \
	cluster-name=ha \
	dc-version="2.1.0+20210816.c6a4f6e6c-1.1-2.1.0+20210816.c6a4f6e6c" \
	last-lrm-refresh=1637394576 \
	stonith-enabled=false \
	have-watchdog=true \
	stonith-watchdog-timeout=0 \
	placement-strategy=balanced
    
    
More information about the Users
mailing list