[ClusterLabs] resource start after network reconnected

Fri Nov 19 09:36:34 EST 2021

> On 18.11.2021 22:33, john tillman wrote:
>>
>> Greetings all,
>>
>> preamble: RHEL8, PCS 0.10.8, COROSYNC 3.1.0, PACEMAKER 2.0.5
>>
>> I have a mysql resource, cloned, that is behaving the way I wanted.
>> When
>> the node it is on is unplugged from the network quorum is lost and the
>> mysqld service stops.  Great.  Oh, and fencing is disabled.
>>
>> When the network connectivity is restored I'd like it to restart but it
>> doesn't.  What needs to be done to make this happen automatically?  Or
>> what section of the doc should reread more thoroughly?
>>
>> When mysql is stopped because of the above, if I run "pcs resource
>> refresh" it starts?  Any ideas why the "refresh" would do that?
>>
>
> You provided zero information about your setup and how you configured
> pacemaker to stop mysqld on network connectivity loss, so it is rather
> hard to guess.
>
> Logs covering period when you unplug network, and later plug again, could
> be also helpful.
>

Fair point.  I didn't want to put too much into the first email.  There
are 3 nodes but 2 nodes are actually used for processing and the 3rd node
is there just for quorum purposes.  When quorum is lost my resources stop.
 There are 3 resources: a VIP, MySQL service, and controld (a project
specific service).

And this problem has now become intermittent as 1 in 4 tests this morning
succeeded in starting mysqld when the network was reconnected.  Figures
:-/

More info.  After reconnecting the network on spm238 the mysql resource
was listed as:
  * spmDB   (systemd:mysqld):   FAILED spm238 (blocked)

This was cleared and mysqld started after issuing a "pcs resource refresh".

So as requested here's how I setup my cluster.  It's copied from an
ansible playbook so there are some variables shown but should be easy
enough to understand.  If not, I will gladly clarify anything.

My 3 resources:

pcs resource create spmVIP ocf:heartbeat:IPaddr2 ip={{ spmvip }}
cidr_netmask=24 op monitor interval=10s
pcs resource create spmControl systemd:controld op monitor interval=10s
pcs resource create spmDB systemd:mysqld op monitor interval=10s clone

My constraints:
pcs constraint colocation add spmControl with spmVIP INFINITY
pcs constraint colocation add spmVIP with spmDB-clone 200
crm_resource -r spmVIP -p resource-stickiness -m -v 100
crm_resource -r spmControl -p resource-stickiness -m -v 100

Don't run resources on the quorum only node:
pcs constraint location spmVIP avoids {{ QOnlynode }}=INFINITY
pcs constraint location spmControl avoids {{ QOnlynode }}=INFINITY
pcs constraint location spmDB-clone avoids {{ QOnlynode }}=INFINITY

and stonith is false:
pcs property set stonith-enabled=false

If you'd rather see the cib file I can supply that.

With respect to logs, pacemaker.log has the most relevant info, right, but
there's a lot.  It's 900+ lines from the time I unplug the network until
mysql is restarted by the 'pcs resource refresh'.  Any suggestions for how
to present the info here?  Maybe use grep for some key words and include
those lines here?

>> It is definitely that call to refresh that triggers the start because
>> I've
>> run a handful of tests and the time between reconnecting the network and
>> pcs resource refresh call varied by as much as 10 minutes.
>>
>> Any suggestion would be appreciated.
>>
>> Regards,
>> -John
>>
>>
>>
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>>
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
>