[ClusterLabs] Users Digest, Vol 44, Issue 11

Thu Jul 2 10:50:40 EDT 2020

LOL, somehow I clicked on an ancient message in my list folder ... well
the advice stands if anyone has a similar issue ;)

I plead a migraine, they make me miss little details like dates ...

On Thu, 2020-07-02 at 09:45 -0500, Ken Gaillot wrote:
> On Thu, 2018-09-06 at 00:59 +0000, Jeffrey Westgate wrote:
> > Greetings from a confused user;
> > 
> > We are running pacemaker as part of a load-balanced cluster of two
> > members, both VMWare VMs, with both acting as stepping-stones to
> > our
> > DNS recursive resolvers (RR).  Simple use  - the /etc/resolver.conf
> > on the *NIX boxes points at both IPs, and the cluster forwards to
> > one
> > of multiple RRs for DNS resolution.
> 
> I'm not sure about your specific issue, but generally it's a bad idea
> to round-robin DNS servers due to TTL/caching issues. The client
> should
> know it's contacting the same server at the same IP each time, to
> have
> a correct idea of how long entries can be cached.
> 
> My personal preferred HA approach for DNS is:
> 
> * Put the DNS servers in containers or VMs that are the pacemaker
> resources, each bound to a specific floating IP (even better, make
> the
> container a bundle, or the VM a guest node, to run the DNS server as
> a
> resource inside it for monitoring/restarting purposes)
> 
> * List the floating IPs as multiple DNS servers on the client side
> (whether static like resolver.conf or DHCP) (this is for resolvers,
> you
> could do the same for domain servers by listing them as multiple NS
> records for the domains)
> 
> > Today, for an as-yet undetermined reason, one of the two members
> > started failing to connect to the RRs. Intermittently. And quite 
> > annoyingly, as this has affected data center operations.  No matter
> > what we've tried, one member fails intermittently, the other is
> > fine.  
> > And we've tried - 
> >  - reboot of the affected member - it came back up clean and fine,
> > but the issue remained.
> >  - fail the cluster, moving both IPs to the second member server;
> > failover was successful, problem remained.
> >   -- this moved the entire cluster to a different VM on a different
> > VMWare host server, so different NIC, etc...
> > - failed the cluster back to the original server; both IPs appears
> > on
> > the 'suspect' VM, and the problem remained
> > - restore the cluster; both IPs are on the proper VMs, but the one
> > still fails intermittently while the second just chugs along.
> 
> Sounds networking related ... could something else on the network be
> claiming that IP? Or something wrong with the switch?
> 
> > Any ideas what could be causing this?  Is this something that could
> > be caused by the cluster config?  Anybody ever seen anything
> > similar?
> > 
> > Our current unsustainable workaround is to remove the IP for the
> > affected member from the *NIX resolver.conf file.
> > 
> > I appreciate any reasonable suggestions.  (I am not the creator of
> > the cluster, just the guy trying o figure it out. Unfortunately the
> > creator and my mentor is dearly departed and, in times like this,
> > sorely missed.)
> 
> My condolences ...
> 
> > Any replies will be read and responded to early tomorrow
> > AM.  thanks
> > for understanding.
> > --
> > Jeff Westgate
-- 
Ken Gaillot <kgaillot at redhat.com>