[ClusterLabs] Issue with Pacemaker config related to VIP and an LSB resource

Michael Romero mromero at convoso.com
Tue Jun 15 18:49:16 EDT 2021


Hello,

I currently have Pacemaker v2.0.3-3ubuntu4.2 running on two Ubuntu 20.04
LTS systems. My config consists of two service groups, both of which have
an LSB resource and a floating IP resource.   The LSB resource is
configured with a monitor operation, so that
"/etc/init.d/<lsb-resource-name> status" is ran in 30 second intervals. the
"status portion of the script only returns a healthy exit code when it
determines that the PID behind a PIDfile is active.  Additionally, I have
also set an 'rsc_location' constraint so that the service group for VIP A
prefers node A, and VIP B prefers node B, so that ideally with both nodes
active and healthy, VIP A will always be running on node A, and B on node B.


The problem that I'm having is that if I intentionally shutdown the service
that my "/etc/init.d/<lsb-resource-name> status" script is checking
against, I get the following behavior:
- I shutdown backing service on node B.
- Pacemaker performs a status check which returns a bad result.
- Pacemaker then correctly migrates the VIP and the LSB resource for the
now 'offline' service group from node B to node A
- Pacemaker 'failure-timeout' interval expires.
- Pacemaker shuts down the VIP B service group on node A.
- Pacemaker attempts to start the VIP B service group on node B, which
fails.
- Pacemaker starts the VIP B service group on node A.
- Pacemaker 'failure-timeout' interval expires.
- Pacemaker shuts down the VIP B service group on node A.
- Pacemaker attempts to start the VIP B service group on node B, which
fails.
- Pacemaker starts the VIP B service group on node A.
- .... and so on

What I would LIKE to happen is for pacemaker to attempt to run a "status"
on node B, PRIOR to stopping the service group on node A and attempting to
start the service group on node B.  Something like this behavior.
- Pacemaker 'failure-timeout' interval expires.
- Pacemaker checks the status of the LSB service (/etc/init.d/<lsb resource
name> status) which returns a bad error code.
- Pacemaker 'failure-timeout' interval expires.
- Pacemaker checks the status of the LSB service (/etc/init.d/<lsb resource
name> status) which returns a bad error code.

At which point an administrator or an automated script could intervene and
bring the backing service online, at which point we would have this
behavior:
- Pacemaker 'failure-timeout' interval expires.
- Pacemaker checks the status of the LSB service (/etc/init.d/<lsb resource
name> status) which returns a HEALTHY error code.
- Pacemaker shuts down the VIP B service group on node A.
- Pacemaker starts the VIP B service group on node B.

I have attached an obfuscated pastebin of my current Pacemaker
configuration, as well as a copy of the logs for the pacemaker service,
when the initial failure occurs, and also capturing the repetitive failed
attempts to start the LSB resource.


Obfuscated "crm configure show"

https://pastebin.com/emAw8juQ


Obfuscated "journalctl -fu pacemaker"

https://pastebin.com/kcnfCrjf



Please let me know if there is a configuration parameter I can place in my
config that would tell Pacemaker to perform a status check on the LSB
resource PRIOR to attempting to start the service group on it's preferred
node.

-- 
Michael Romero

Lead Infrastructure Engineer

Engineering | Convoso
562-338-9868
mromero at convoso.com
www.convoso.com
[image: linkedin] <https://linkedin.com/in/romerom>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20210615/579bf373/attachment-0001.htm>


More information about the Users mailing list