[ClusterLabs] How to make Pacemaker less trigger-happy

Fri Nov 1 20:30:26 EDT 2019

 Your e-mail ended up in my spam mailbox.

Yes, you can do the following:
1. Increase token + consensus timeouts (check the man for the proper ratio)
2. Always set the node/cluster in maintenance and stop the cluster stack before patching.

Best Regards,
Strahil Nikolov     В понеделник, 28 октомври 2019 г., 19:26:39 ч. Гринуич+2, Casey Allen Shobe <casey.allen.shobe at icloud.com> написа:  

 I'm seeing a couple different situations where Pacemaker (using PostgreSQL Automated Failover resource) ends up thinking that the master node is not responding, and fences it when in fact the node was up and running fine.  We are using a VMWare ESXi infrastructure, which is fairly overcommitted especially in our lower environments, and many times this correlates exactly with when a VMWare vMotion happens, which seems to cause some delay in the response to one of Pacemaker's health checks.  In other cases, I have seen logind get restarted by an apt update, and that seems to trigger a failover even though PostgreSQL never went down.

Looking for potential solutions to these - is there a way to increase the tolerance on # of failures or timeout length to avoid unnecessary failovers?

Thank you for any advice!
-- 
Casey
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20191102/14cc84ec/attachment.html>