[ClusterLabs] resource start after network reconnected
Andrei Borzenkov
arvidjaar at gmail.com
Sat Nov 20 02:59:19 EST 2021
On 19.11.2021 19:26, john tillman wrote:
...
>>>
>>> If pacemaker tries to stop resources due to out of quorum condition, you
>>> could set suitable failure-timeout; this will be equivalent to using
>>> "pcs
>>> resource refresh". Keep in mind that pacemaker only checks for
>>> failure-timeout expiration every cluster-recheck-interval (15 minutes by
>>> default). This still is not directly related to network availability,
>>> but
>>> if network outage resulted in node going out of quorum, when network is
>>> back and node joined cluster again it will allow resources to be started
>>> on node.
>>>
>>
>> When quorum is lost I want all the resources to stop. The cluster is
>> performing this step correctly for me.
>>
>> That cluster-recheck-interval would explain the intermittence I saw this
>> morning. If I set that to 1 minute would that cause any gross negative
>> issues?
>>
>
>
> I tried setting cluster-recheck-interval to 1 minute and I saw no change
> to the resources after reconnecting the network. They were still listed
> as However, "pcs resource refresh" started it, as usual in this scenario.
>
> Anyone have any other ideas for a configuration setting that will
> effectively do whatever 'pcs resource refresh' is doing when quorum is
> restored?
>
I already told you above and it most certainly works here.
Without failure-timeout resource is stuck in blocked state:
Cluster Summary:
* Stack: corosync
* Current DC: ha1 (version 2.1.0+20210816.c6a4f6e6c-1.1-2.1.0+20210816.c6a4f6e6c) - partition with quorum
* Last updated: Sat Nov 20 10:48:48 2021
* Last change: Sat Nov 20 10:46:55 2021 by root via cibadmin on ha1
* 3 nodes configured
* 3 resource instances configured (1 BLOCKED from further action due to failure)
Node List:
* Online: [ ha1 ha2 qnetd ]
Full List of Resources:
* Clone Set: cln_Test [rsc_Test]:
* rsc_Test (ocf::_local:Dummy): FAILED ha1 (blocked)
* Started: [ ha2 ]
* Stopped: [ qnetd ]
Operations:
* Node: ha2:
* rsc_Test: migration-threshold=1000000:
* (10) start
* (11) monitor: interval="10000ms"
* Node: ha1:
* rsc_Test: migration-threshold=1000000 fail-count=1000000 last-failure='Sat Nov 20 10:47:14 2021':
* (18) start
* (30) stop
Failed Resource Actions:
* rsc_Test_stop_0 on ha1 'error' (1): call=30, status='complete', exitreason='forced to fail stop operation', last-rc-change='2021-11-20 10:47:14 +03:00', queued=0ms, exec=27ms
With failure-timeout resource is restarted after expiration.
Cluster Summary:
* Stack: corosync
* Current DC: ha1 (version 2.1.0+20210816.c6a4f6e6c-1.1-2.1.0+20210816.c6a4f6e6c) - partition with quorum
* Last updated: Sat Nov 20 10:53:51 2021
* Last change: Sat Nov 20 10:50:37 2021 by root via cibadmin on ha2
* 3 nodes configured
* 3 resource instances configured
Node List:
* Online: [ ha1 ha2 qnetd ]
Full List of Resources:
* Clone Set: cln_Test [rsc_Test]:
* Started: [ ha1 ha2 ]
* Stopped: [ qnetd ]
Operations:
* Node: ha2:
* rsc_Test: migration-threshold=1000000:
* (18) probe
* (18) probe
* (19) monitor: interval="10000ms"
* Node: ha1:
* rsc_Test: migration-threshold=1000000:
* (40) probe
* (40) probe
* (41) monitor: interval="10000ms"
Configuration:
node 1: ha1 \
attributes pingd=1 \
utilization cpu=20
node 2: ha2 \
attributes pingd=1 \
utilization cpu=20
node 3: qnetd
primitive rsc_Test ocf:_local:Dummy \
meta failure-timeout=30s \
op monitor interval=10s
clone cln_Test rsc_Test
location not_on_qnetd cln_Test -inf: qnetd
property cib-bootstrap-options: \
cluster-infrastructure=corosync \
cluster-name=ha \
dc-version="2.1.0+20210816.c6a4f6e6c-1.1-2.1.0+20210816.c6a4f6e6c" \
last-lrm-refresh=1637394576 \
stonith-enabled=false \
have-watchdog=true \
stonith-watchdog-timeout=0 \
placement-strategy=balanced
More information about the Users
mailing list