[ClusterLabs] Users Digest, Vol 44, Issue 11

Wed Sep 5 20:59:23 EDT 2018

Greetings from a confused user;

We are running pacemaker as part of a load-balanced cluster of two members, both VMWare VMs, with both acting as stepping-stones to our DNS recursive resolvers (RR).  Simple use  - the /etc/resolver.conf on the *NIX boxes points at both IPs, and the cluster forwards to one of multiple RRs for DNS resolution.

Today, for an as-yet undetermined reason, one of the two members started failing to connect to the RRs. Intermittently. And quite annoyingly, as this has affected data center operations.  No matter what we've tried, one member fails intermittently, the other is fine.  
And we've tried - 
 - reboot of the affected member - it came back up clean and fine, but the issue remained.
 - fail the cluster, moving both IPs to the second member server; failover was successful, problem remained.
  -- this moved the entire cluster to a different VM on a different VMWare host server, so different NIC, etc...
- failed the cluster back to the original server; both IPs appears on the 'suspect' VM, and the problem remained
- restore the cluster; both IPs are on the proper VMs, but the one still fails intermittently while the second just chugs along.

Any ideas what could be causing this?  Is this something that could be caused by the cluster config?  Anybody ever seen anything similar?

Our current unsustainable workaround is to remove the IP for the affected member from the *NIX resolver.conf file.

I appreciate any reasonable suggestions.  (I am not the creator of the cluster, just the guy trying o figure it out. Unfortunately the creator and my mentor is dearly departed and, in times like this, sorely missed.)

Any replies will be read and responded to early tomorrow AM.  thanks for understanding.
--
Jeff Westgate