[ClusterLabs] How to make Pacemaker less trigger-happy

Mon Oct 28 13:26:23 EDT 2019

I'm seeing a couple different situations where Pacemaker (using PostgreSQL Automated Failover resource) ends up thinking that the master node is not responding, and fences it when in fact the node was up and running fine.  We are using a VMWare ESXi infrastructure, which is fairly overcommitted especially in our lower environments, and many times this correlates exactly with when a VMWare vMotion happens, which seems to cause some delay in the response to one of Pacemaker's health checks.  In other cases, I have seen logind get restarted by an apt update, and that seems to trigger a failover even though PostgreSQL never went down.

Looking for potential solutions to these - is there a way to increase the tolerance on # of failures or timeout length to avoid unnecessary failovers?

Thank you for any advice!
-- 
Casey