[ClusterLabs] Users Digest, Vol 44, Issue 11

Thu Jul 2 10:45:20 EDT 2020

On Thu, 2018-09-06 at 00:59 +0000, Jeffrey Westgate wrote:
> Greetings from a confused user;
> 
> We are running pacemaker as part of a load-balanced cluster of two
> members, both VMWare VMs, with both acting as stepping-stones to our
> DNS recursive resolvers (RR).  Simple use  - the /etc/resolver.conf
> on the *NIX boxes points at both IPs, and the cluster forwards to one
> of multiple RRs for DNS resolution.

I'm not sure about your specific issue, but generally it's a bad idea
to round-robin DNS servers due to TTL/caching issues. The client should
know it's contacting the same server at the same IP each time, to have
a correct idea of how long entries can be cached.

My personal preferred HA approach for DNS is:

* Put the DNS servers in containers or VMs that are the pacemaker
resources, each bound to a specific floating IP (even better, make the
container a bundle, or the VM a guest node, to run the DNS server as a
resource inside it for monitoring/restarting purposes)

* List the floating IPs as multiple DNS servers on the client side
(whether static like resolver.conf or DHCP) (this is for resolvers, you
could do the same for domain servers by listing them as multiple NS
records for the domains)

> Today, for an as-yet undetermined reason, one of the two members
> started failing to connect to the RRs. Intermittently. And quite 
> annoyingly, as this has affected data center operations.  No matter
> what we've tried, one member fails intermittently, the other is
> fine.  
> And we've tried - 
>  - reboot of the affected member - it came back up clean and fine,
> but the issue remained.
>  - fail the cluster, moving both IPs to the second member server;
> failover was successful, problem remained.
>   -- this moved the entire cluster to a different VM on a different
> VMWare host server, so different NIC, etc...
> - failed the cluster back to the original server; both IPs appears on
> the 'suspect' VM, and the problem remained
> - restore the cluster; both IPs are on the proper VMs, but the one
> still fails intermittently while the second just chugs along.

Sounds networking related ... could something else on the network be
claiming that IP? Or something wrong with the switch?

> Any ideas what could be causing this?  Is this something that could
> be caused by the cluster config?  Anybody ever seen anything similar?
> 
> Our current unsustainable workaround is to remove the IP for the
> affected member from the *NIX resolver.conf file.
> 
> I appreciate any reasonable suggestions.  (I am not the creator of
> the cluster, just the guy trying o figure it out. Unfortunately the
> creator and my mentor is dearly departed and, in times like this,
> sorely missed.)

My condolences ...

> Any replies will be read and responded to early tomorrow AM.  thanks
> for understanding.
> --
> Jeff Westgate
-- 
Ken Gaillot <kgaillot at redhat.com>