[ClusterLabs] Antw: Re: Antw: [EXT] Re: Order set troubles

Fri Mar 26 15:18:21 EDT 2021

On Fri, Mar 26, 2021 at 6:27 AM Andrei Borzenkov <arvidjaar at gmail.com>
wrote:

> On Fri, Mar 26, 2021 at 10:17 AM Ulrich Windl
> <Ulrich.Windl at rz.uni-regensburg.de> wrote:
> >
> > >>> Andrei Borzenkov <arvidjaar at gmail.com> schrieb am 26.03.2021 um
> 06:19 in
> > Nachricht <534274b3-a6de-5fac-0ae4-d02c305f1a3f at gmail.com>:
> > > On 25.03.2021 21:45, Reid Wahl wrote:
> > >> FWIW we have this KB article (I seem to remember Strahil is a Red Hat
> > >> customer):
> > >>   - How do I configure SAP HANA Scale-Up System Replication in a
> Pacemaker
> > >> cluster when the HANA filesystems are on NFS shares?(
> > >> https://access.redhat.com/solutions/5156571)
> > >>
> > >
> > > "How do I make the cluster resources recover when one node loses access
> > > to the NFS server?"
> > >
> > > If node loses access to NFS server then monitor operations for
> resources
> > > that depend on NFS availability will fail or timeout and pacemaker will
> > > recover (likely by rebooting this node). That's how similar
> > > configurations have been handled for the past 20 years in other HA
> > > managers. I am genuinely interested, have you encountered the case
> where
> > > it was not enough?
> >
> > That's a big problem with the SAP design (basically it's just too
> complex).
> > In the past I had written a kind of resource agent that worked without
> that
> > overly complex overhead, but since those days SAP has added much more
> > complexity.
> > If the NFS server is external, pacemaker could fence your nodes when the
> NFS
> > server is down as first the monitor operation will fail (hanging on
> NFS), the
> > the recover (stop/start) will fail (also hanging on NFS).
>
> And how exactly placing NFS resource under pacemaker control is going
> to change it?
>

I noted earlier based on the old case notes:

"Apparently there were situations in which the SAPHana resource wasn't
failing over when connectivity was lost with the NFS share that contained
the hdb* binaries and the HANA data. I don't remember the exact details
(whether demotion was failing, or whether it wasn't even trying to demote
on the primary and promote on the secondary, or what). Either way, I was
surprised that this procedure was necessary, but it seemed to be."

Strahil may be dealing with a similar situation, not sure. I get where
you're coming from -- I too would expect the application that depends on
NFS to simply fail when NFS connectivity is lost, which in turn leads to
failover and recovery. For whatever reason, due to some weirdness of the
SAPHana resource agent, that didn't happen.

> > Even when fencing the
> > node it would not help (resources cannot start) if the NFS server is
> still
> > down.
>
> And how exactly placing NFS resource under pacemaker control is going
> to change it?
>
> > So you may end up with all your nodes being fenced and the fail counts
> > disabling any automatic resource restart.
> >
>
> And how exactly placing NFS resource under pacemaker control is going
> to change it?
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
>

-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20210326/8ce61944/attachment.htm>