[ClusterLabs] resource start after network reconnected

Fri Nov 19 10:40:21 EST 2021

> On 19.11.2021 17:36, john tillman wrote:
>>> On 18.11.2021 22:33, john tillman wrote:
>>>>
>>>> Greetings all,
>>>>
>>>> preamble: RHEL8, PCS 0.10.8, COROSYNC 3.1.0, PACEMAKER 2.0.5
>>>>
>>>> I have a mysql resource, cloned, that is behaving the way I wanted.
>>>> When
>>>> the node it is on is unplugged from the network quorum is lost and the
>>>> mysqld service stops.  Great.  Oh, and fencing is disabled.
>>>>
>>>> When the network connectivity is restored I'd like it to restart but
>>>> it
>>>> doesn't.  What needs to be done to make this happen automatically?  Or
>>>> what section of the doc should reread more thoroughly?
>>>>
>>>> When mysql is stopped because of the above, if I run "pcs resource
>>>> refresh" it starts?  Any ideas why the "refresh" would do that?
>>>>
>>>
>>> You provided zero information about your setup and how you configured
>>> pacemaker to stop mysqld on network connectivity loss, so it is rather
>>> hard to guess.
>>>
>>> Logs covering period when you unplug network, and later plug again,
>>> could
>>> be also helpful.
>>>
>>
>> Fair point.  I didn't want to put too much into the first email.  There
>> are 3 nodes but 2 nodes are actually used for processing and the 3rd
>> node
>> is there just for quorum purposes.  When quorum is lost my resources
>> stop.
>>  There are 3 resources: a VIP, MySQL service, and controld (a project
>> specific service).
>>
>> And this problem has now become intermittent as 1 in 4 tests this
>> morning
>> succeeded in starting mysqld when the network was reconnected.  Figures
>> :-/
>>
>> More info.  After reconnecting the network on spm238 the mysql resource
>> was listed as:
>>   * spmDB   (systemd:mysqld):   FAILED spm238 (blocked)
>>
>> This was cleared and mysqld started after issuing a "pcs resource
>> refresh".
>>
>
> pcs resource refresh deletes failure history so pacemaker tries to start
> resource again. It is completely unrelated to network interface
> conditions.
>
> "blocked" is default when resource stop operation fails and stonith is
> disabled.
>
>> So as requested here's how I setup my cluster.  It's copied from an
>> ansible playbook so there are some variables shown but should be easy
>> enough to understand.  If not, I will gladly clarify anything.
>>
>> My 3 resources:
>>
>> pcs resource create spmVIP ocf:heartbeat:IPaddr2 ip={{ spmvip }}
>> cidr_netmask=24 op monitor interval=10s
>> pcs resource create spmControl systemd:controld op monitor interval=10s
>> pcs resource create spmDB systemd:mysqld op monitor interval=10s clone
>>
>> My constraints:
>> pcs constraint colocation add spmControl with spmVIP INFINITY
>> pcs constraint colocation add spmVIP with spmDB-clone 200
>> crm_resource -r spmVIP -p resource-stickiness -m -v 100
>> crm_resource -r spmControl -p resource-stickiness -m -v 100
>>
>> Don't run resources on the quorum only node:
>> pcs constraint location spmVIP avoids {{ QOnlynode }}=INFINITY
>> pcs constraint location spmControl avoids {{ QOnlynode }}=INFINITY
>> pcs constraint location spmDB-clone avoids {{ QOnlynode }}=INFINITY
>>
>
> I have no idea what QOnlynode means here.
>

This is the quorum-only node of mine.  Resources are not run on it and the
3 constraints above are how I configured that.

>> and stonith is false:
>> pcs property set stonith-enabled=false
>>
>
> I do not see anything in your configuration that would cause mysql to be
> stopped on network connectivity issues. Either mysql does it on its own,
> or pacemaker attempts to stop all resources on node when it goes out of
> quorum.
>
> If mysql does it on its own, there is nothing that can be done from
> pacemaker side. Pacemaker is not aware of network state at all and
> certainly cannot initiate actions when network becomes available.
>
> If pacemaker tries to stop resources due to out of quorum condition, you
> could set suitable failure-timeout; this will be equivalent to using "pcs
> resource refresh". Keep in mind that pacemaker only checks for
> failure-timeout expiration every cluster-recheck-interval (15 minutes by
> default). This still is not directly related to network availability, but
> if network outage resulted in node going out of quorum, when network is
> back and node joined cluster again it will allow resources to be started
> on node.
>

When quorum is lost I want all the resources to stop.  The cluster is
performing this step correctly for me.

That cluster-recheck-interval would explain the intermittence I saw this
morning.  If I set that to 1 minute would that cause any gross negative
issues?

Is there another setting besides cluster-recheck-interval to consider
adjusting to start mysql when quorum is returned?

Thank you for the feedback.

-John

>> If you'd rather see the cib file I can supply that.
>>
>> With respect to logs, pacemaker.log has the most relevant info, right,
>> but
>> there's a lot.  It's 900+ lines from the time I unplug the network until
>> mysql is restarted by the 'pcs resource refresh'.  Any suggestions for
>> how
>> to present the info here?  Maybe use grep for some key words and include
>> those lines here?
>>
>>
>>>> It is definitely that call to refresh that triggers the start because
>>>> I've
>>>> run a handful of tests and the time between reconnecting the network
>>>> and
>>>> pcs resource refresh call varied by as much as 10 minutes.
>>>>
>>>> Any suggestion would be appreciated.
>>>>
>>>> Regards,
>>>> -John
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Manage your subscription:
>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>
>>>> ClusterLabs home: https://www.clusterlabs.org/
>>>>
>>>
>>> _______________________________________________
>>> Manage your subscription:
>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> ClusterLabs home: https://www.clusterlabs.org/
>>>
>>>
>>
>>
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>>
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
>