[ClusterLabs] resource start after network reconnected

Fri Nov 19 11:26:01 EST 2021

>> On 19.11.2021 17:36, john tillman wrote:
>>>> On 18.11.2021 22:33, john tillman wrote:
>>>>>
>>>>> Greetings all,
>>>>>
>>>>> preamble: RHEL8, PCS 0.10.8, COROSYNC 3.1.0, PACEMAKER 2.0.5
>>>>>
>>>>> I have a mysql resource, cloned, that is behaving the way I wanted.
>>>>> When
>>>>> the node it is on is unplugged from the network quorum is lost and
>>>>> the
>>>>> mysqld service stops.  Great.  Oh, and fencing is disabled.
>>>>>
>>>>> When the network connectivity is restored I'd like it to restart but
>>>>> it
>>>>> doesn't.  What needs to be done to make this happen automatically?
>>>>> Or
>>>>> what section of the doc should reread more thoroughly?
>>>>>
>>>>> When mysql is stopped because of the above, if I run "pcs resource
>>>>> refresh" it starts?  Any ideas why the "refresh" would do that?
>>>>>
>>>>
>>>> You provided zero information about your setup and how you configured
>>>> pacemaker to stop mysqld on network connectivity loss, so it is rather
>>>> hard to guess.
>>>>
>>>> Logs covering period when you unplug network, and later plug again,
>>>> could
>>>> be also helpful.
>>>>
>>>
>>> Fair point.  I didn't want to put too much into the first email.  There
>>> are 3 nodes but 2 nodes are actually used for processing and the 3rd
>>> node
>>> is there just for quorum purposes.  When quorum is lost my resources
>>> stop.
>>>  There are 3 resources: a VIP, MySQL service, and controld (a project
>>> specific service).
>>>
>>> And this problem has now become intermittent as 1 in 4 tests this
>>> morning
>>> succeeded in starting mysqld when the network was reconnected.  Figures
>>> :-/
>>>
>>> More info.  After reconnecting the network on spm238 the mysql resource
>>> was listed as:
>>>   * spmDB   (systemd:mysqld):   FAILED spm238 (blocked)
>>>
>>> This was cleared and mysqld started after issuing a "pcs resource
>>> refresh".
>>>
>>
>> pcs resource refresh deletes failure history so pacemaker tries to start
>> resource again. It is completely unrelated to network interface
>> conditions.
>>
>> "blocked" is default when resource stop operation fails and stonith is
>> disabled.
>>
>>> So as requested here's how I setup my cluster.  It's copied from an
>>> ansible playbook so there are some variables shown but should be easy
>>> enough to understand.  If not, I will gladly clarify anything.
>>>
>>> My 3 resources:
>>>
>>> pcs resource create spmVIP ocf:heartbeat:IPaddr2 ip={{ spmvip }}
>>> cidr_netmask=24 op monitor interval=10s
>>> pcs resource create spmControl systemd:controld op monitor interval=10s
>>> pcs resource create spmDB systemd:mysqld op monitor interval=10s clone
>>>
>>> My constraints:
>>> pcs constraint colocation add spmControl with spmVIP INFINITY
>>> pcs constraint colocation add spmVIP with spmDB-clone 200
>>> crm_resource -r spmVIP -p resource-stickiness -m -v 100
>>> crm_resource -r spmControl -p resource-stickiness -m -v 100
>>>
>>> Don't run resources on the quorum only node:
>>> pcs constraint location spmVIP avoids {{ QOnlynode }}=INFINITY
>>> pcs constraint location spmControl avoids {{ QOnlynode }}=INFINITY
>>> pcs constraint location spmDB-clone avoids {{ QOnlynode }}=INFINITY
>>>
>>
>> I have no idea what QOnlynode means here.
>>
>
> This is the quorum-only node of mine.  Resources are not run on it and the
> 3 constraints above are how I configured that.
>
>>> and stonith is false:
>>> pcs property set stonith-enabled=false
>>>
>>
>> I do not see anything in your configuration that would cause mysql to be
>> stopped on network connectivity issues. Either mysql does it on its own,
>> or pacemaker attempts to stop all resources on node when it goes out of
>> quorum.
>>
>> If mysql does it on its own, there is nothing that can be done from
>> pacemaker side. Pacemaker is not aware of network state at all and
>> certainly cannot initiate actions when network becomes available.
>>
>> If pacemaker tries to stop resources due to out of quorum condition, you
>> could set suitable failure-timeout; this will be equivalent to using
>> "pcs
>> resource refresh". Keep in mind that pacemaker only checks for
>> failure-timeout expiration every cluster-recheck-interval (15 minutes by
>> default). This still is not directly related to network availability,
>> but
>> if network outage resulted in node going out of quorum, when network is
>> back and node joined cluster again it will allow resources to be started
>> on node.
>>
>
> When quorum is lost I want all the resources to stop.  The cluster is
> performing this step correctly for me.
>
> That cluster-recheck-interval would explain the intermittence I saw this
> morning.  If I set that to 1 minute would that cause any gross negative
> issues?
>

I tried setting cluster-recheck-interval to 1 minute and I saw no change
to the resources after reconnecting the network.  They were still listed
as However, "pcs resource refresh" started it, as usual in this scenario.

Anyone have any other ideas for a configuration setting that will
effectively do whatever 'pcs resource refresh' is doing when quorum is
restored?

-John

> Is there another setting besides cluster-recheck-interval to consider
> adjusting to start mysql when quorum is returned?
>
> Thank you for the feedback.
>
> -John
>
>
>>> If you'd rather see the cib file I can supply that.
>>>
>>> With respect to logs, pacemaker.log has the most relevant info, right,
>>> but
>>> there's a lot.  It's 900+ lines from the time I unplug the network
>>> until
>>> mysql is restarted by the 'pcs resource refresh'.  Any suggestions for
>>> how
>>> to present the info here?  Maybe use grep for some key words and
>>> include
>>> those lines here?
>>>
>>>
>>>>> It is definitely that call to refresh that triggers the start because
>>>>> I've
>>>>> run a handful of tests and the time between reconnecting the network
>>>>> and
>>>>> pcs resource refresh call varied by as much as 10 minutes.
>>>>>
>>>>> Any suggestion would be appreciated.
>>>>>
>>>>> Regards,
>>>>> -John
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Manage your subscription:
>>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>
>>>>> ClusterLabs home: https://www.clusterlabs.org/
>>>>>
>>>>
>>>> _______________________________________________
>>>> Manage your subscription:
>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>
>>>> ClusterLabs home: https://www.clusterlabs.org/
>>>>
>>>>
>>>
>>>
>>> _______________________________________________
>>> Manage your subscription:
>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> ClusterLabs home: https://www.clusterlabs.org/
>>>
>>
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>>
>>
>
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
>