[ClusterLabs] Issue with Pacemaker config related to VIP and an LSB resource

Wed Jun 16 00:55:42 EDT 2021

On 16.06.2021 01:49, Michael Romero wrote:
> Hello,
> 
> I currently have Pacemaker v2.0.3-3ubuntu4.2 running on two Ubuntu 20.04
> LTS systems. My config consists of two service groups, both of which have
> an LSB resource and a floating IP resource.   The LSB resource is
> configured with a monitor operation, so that
> "/etc/init.d/<lsb-resource-name> status" is ran in 30 second intervals. the
> "status portion of the script only returns a healthy exit code when it
> determines that the PID behind a PIDfile is active.  Additionally, I have
> also set an 'rsc_location' constraint so that the service group for VIP A
> prefers node A, and VIP B prefers node B, so that ideally with both nodes
> active and healthy, VIP A will always be running on node A, and B on node B.
> 
> 
> The problem that I'm having is that if I intentionally shutdown the service
> that my "/etc/init.d/<lsb-resource-name> status" script is checking
> against, I get the following behavior:
> - I shutdown backing service on node B.
> - Pacemaker performs a status check which returns a bad result.
> - Pacemaker then correctly migrates the VIP and the LSB resource for the
> now 'offline' service group from node B to node A
> - Pacemaker 'failure-timeout' interval expires.
> - Pacemaker shuts down the VIP B service group on node A.
> - Pacemaker attempts to start the VIP B service group on node B, which
> fails.
> - Pacemaker starts the VIP B service group on node A.
> - Pacemaker 'failure-timeout' interval expires.
> - Pacemaker shuts down the VIP B service group on node A.
> - Pacemaker attempts to start the VIP B service group on node B, which
> fails.
> - Pacemaker starts the VIP B service group on node A.
> - .... and so on
> 

Set positive resource stickiness so resource remains where it is active
even after failure-timeout. You may need to adjust the actual value
depending on how resource placement is computed in your case.

> What I would LIKE to happen is for pacemaker to attempt to run a "status"
> on node B, PRIOR to stopping the service group on node A and attempting to
> start the service group on node B.  Something like this behavior.
> - Pacemaker 'failure-timeout' interval expires.
> - Pacemaker checks the status of the LSB service (/etc/init.d/<lsb resource
> name> status) which returns a bad error code.
> - Pacemaker 'failure-timeout' interval expires.
> - Pacemaker checks the status of the LSB service (/etc/init.d/<lsb resource
> name> status) which returns a bad error code.
> 

You can also monitor inactive resource (which is off by default). It
sounds like you want your monitor to return something like
OCF_NOT_INSTALLED which is hard error and should prevent resource from
being started on this node.

LSB simply is not versatile enough to do it.

> At which point an administrator or an automated script could intervene and
> bring the backing service online, at which point we would have this
> behavior:
> - Pacemaker 'failure-timeout' interval expires.
> - Pacemaker checks the status of the LSB service (/etc/init.d/<lsb resource
> name> status) which returns a HEALTHY error code.

You seem to confuse two independent conditions - "resource is healthy
(started, active)" and "it is possible to start resource on this node".
The only possible output of LSB status is "started" or "stopped". There
is no information about "what would happen if I try to start it".

But in general I guess the idea of rechecking resource after failure
timeout once (similar to initial probe) sounds interesting. It could be
more robust in that resource agent could check whether resource start is
possible now at all and prevent unsuccessful attempt to migrate resource
back to original node.

> - Pacemaker shuts down the VIP B service group on node A.
> - Pacemaker starts the VIP B service group on node B.
> 
> I have attached an obfuscated pastebin of my current Pacemaker
> configuration, as well as a copy of the logs for the pacemaker service,
> when the initial failure occurs, and also capturing the repetitive failed
> attempts to start the LSB resource.
> 
> 
> Obfuscated "crm configure show"
> 
> https://pastebin.com/emAw8juQ
> 
> 
> Obfuscated "journalctl -fu pacemaker"
> 
> https://pastebin.com/kcnfCrjf
> 
> 
> 
> Please let me know if there is a configuration parameter I can place in my
> config that would tell Pacemaker to perform a status check on the LSB
> resource PRIOR to attempting to start the service group on it's preferred
> node.
> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
>