[ClusterLabs] resource start after network reconnected

Fri Nov 19 09:57:46 EST 2021

On 19.11.2021 17:36, john tillman wrote:
>> On 18.11.2021 22:33, john tillman wrote:
>>>
>>> Greetings all,
>>>
>>> preamble: RHEL8, PCS 0.10.8, COROSYNC 3.1.0, PACEMAKER 2.0.5
>>>
>>> I have a mysql resource, cloned, that is behaving the way I wanted.
>>> When
>>> the node it is on is unplugged from the network quorum is lost and the
>>> mysqld service stops.  Great.  Oh, and fencing is disabled.
>>>
>>> When the network connectivity is restored I'd like it to restart but it
>>> doesn't.  What needs to be done to make this happen automatically?  Or
>>> what section of the doc should reread more thoroughly?
>>>
>>> When mysql is stopped because of the above, if I run "pcs resource
>>> refresh" it starts?  Any ideas why the "refresh" would do that?
>>>
>>
>> You provided zero information about your setup and how you configured
>> pacemaker to stop mysqld on network connectivity loss, so it is rather
>> hard to guess.
>>
>> Logs covering period when you unplug network, and later plug again, could
>> be also helpful.
>>
> 
> Fair point.  I didn't want to put too much into the first email.  There
> are 3 nodes but 2 nodes are actually used for processing and the 3rd node
> is there just for quorum purposes.  When quorum is lost my resources stop.
>  There are 3 resources: a VIP, MySQL service, and controld (a project
> specific service).
> 
> And this problem has now become intermittent as 1 in 4 tests this morning
> succeeded in starting mysqld when the network was reconnected.  Figures
> :-/
> 
> More info.  After reconnecting the network on spm238 the mysql resource
> was listed as:
>   * spmDB   (systemd:mysqld):   FAILED spm238 (blocked)
> 
> This was cleared and mysqld started after issuing a "pcs resource refresh".
> 

pcs resource refresh deletes failure history so pacemaker tries to start resource again. It is completely unrelated to network interface conditions.

"blocked" is default when resource stop operation fails and stonith is disabled. 

> So as requested here's how I setup my cluster.  It's copied from an
> ansible playbook so there are some variables shown but should be easy
> enough to understand.  If not, I will gladly clarify anything.
> 
> My 3 resources:
> 
> pcs resource create spmVIP ocf:heartbeat:IPaddr2 ip={{ spmvip }}
> cidr_netmask=24 op monitor interval=10s
> pcs resource create spmControl systemd:controld op monitor interval=10s
> pcs resource create spmDB systemd:mysqld op monitor interval=10s clone
> 
> My constraints:
> pcs constraint colocation add spmControl with spmVIP INFINITY
> pcs constraint colocation add spmVIP with spmDB-clone 200
> crm_resource -r spmVIP -p resource-stickiness -m -v 100
> crm_resource -r spmControl -p resource-stickiness -m -v 100
> 
> Don't run resources on the quorum only node:
> pcs constraint location spmVIP avoids {{ QOnlynode }}=INFINITY
> pcs constraint location spmControl avoids {{ QOnlynode }}=INFINITY
> pcs constraint location spmDB-clone avoids {{ QOnlynode }}=INFINITY
> 

I have no idea what QOnlynode means here.

> and stonith is false:
> pcs property set stonith-enabled=false
> 

I do not see anything in your configuration that would cause mysql to be stopped on network connectivity issues. Either mysql does it on its own, or pacemaker attempts to stop all resources on node when it goes out of quorum.

If mysql does it on its own, there is nothing that can be done from pacemaker side. Pacemaker is not aware of network state at all and certainly cannot initiate actions when network becomes available.

If pacemaker tries to stop resources due to out of quorum condition, you could set suitable failure-timeout; this will be equivalent to using "pcs resource refresh". Keep in mind that pacemaker only checks for failure-timeout expiration every cluster-recheck-interval (15 minutes by default). This still is not directly related to network availability, but if network outage resulted in node going out of quorum, when network is back and node joined cluster again it will allow resources to be started on node.

> If you'd rather see the cib file I can supply that.
> 
> With respect to logs, pacemaker.log has the most relevant info, right, but
> there's a lot.  It's 900+ lines from the time I unplug the network until
> mysql is restarted by the 'pcs resource refresh'.  Any suggestions for how
> to present the info here?  Maybe use grep for some key words and include
> those lines here?
> 
> 
>>> It is definitely that call to refresh that triggers the start because
>>> I've
>>> run a handful of tests and the time between reconnecting the network and
>>> pcs resource refresh call varied by as much as 10 minutes.
>>>
>>> Any suggestion would be appreciated.
>>>
>>> Regards,
>>> -John
>>>
>>>
>>>
>>> _______________________________________________
>>> Manage your subscription:
>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> ClusterLabs home: https://www.clusterlabs.org/
>>>
>>
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>>
>>
> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
>