[ClusterLabs] Antw: Re: Antw: Re: Antw: [EXT] Re: Order set troubles

Mon Mar 29 04:11:22 EDT 2021

>>> Andrei Borzenkov <arvidjaar at gmail.com> schrieb am 27.03.2021 um 06:37 in
Nachricht <7c294034-56c3-baab-73c6-7909ab554555 at gmail.com>:
> On 26.03.2021 22:18, Reid Wahl wrote:
>> On Fri, Mar 26, 2021 at 6:27 AM Andrei Borzenkov <arvidjaar at gmail.com>
>> wrote:
>> 
>>> On Fri, Mar 26, 2021 at 10:17 AM Ulrich Windl
>>> <Ulrich.Windl at rz.uni‑regensburg.de> wrote:
>>>>
>>>>>>> Andrei Borzenkov <arvidjaar at gmail.com> schrieb am 26.03.2021 um
>>> 06:19 in
>>>> Nachricht <534274b3‑a6de‑5fac‑0ae4‑d02c305f1a3f at gmail.com>:
>>>>> On 25.03.2021 21:45, Reid Wahl wrote:
>>>>>> FWIW we have this KB article (I seem to remember Strahil is a Red Hat
>>>>>> customer):
>>>>>>   ‑ How do I configure SAP HANA Scale‑Up System Replication in a
>>> Pacemaker
>>>>>> cluster when the HANA filesystems are on NFS shares?(
>>>>>> https://access.redhat.com/solutions/5156571)
>>>>>>
>>>>>
>>>>> "How do I make the cluster resources recover when one node loses access
>>>>> to the NFS server?"
>>>>>
>>>>> If node loses access to NFS server then monitor operations for
>>> resources
>>>>> that depend on NFS availability will fail or timeout and pacemaker will
>>>>> recover (likely by rebooting this node). That's how similar
>>>>> configurations have been handled for the past 20 years in other HA
>>>>> managers. I am genuinely interested, have you encountered the case
>>> where
>>>>> it was not enough?
>>>>
>>>> That's a big problem with the SAP design (basically it's just too
>>> complex).
>>>> In the past I had written a kind of resource agent that worked without
>>> that
>>>> overly complex overhead, but since those days SAP has added much more
>>>> complexity.
>>>> If the NFS server is external, pacemaker could fence your nodes when the
>>> NFS
>>>> server is down as first the monitor operation will fail (hanging on
>>> NFS), the
>>>> the recover (stop/start) will fail (also hanging on NFS).
>>>
>>> And how exactly placing NFS resource under pacemaker control is going
>>> to change it?
>>>
>> 
>> I noted earlier based on the old case notes:
>> 
>> "Apparently there were situations in which the SAPHana resource wasn't
>> failing over when connectivity was lost with the NFS share that contained
>> the hdb* binaries and the HANA data. I don't remember the exact details
>> (whether demotion was failing, or whether it wasn't even trying to demote
>> on the primary and promote on the secondary, or what). Either way, I was
>> surprised that this procedure was necessary, but it seemed to be."
>> 
>> Strahil may be dealing with a similar situation, not sure. I get where
>> you're coming from ‑‑ I too would expect the application that depends on
>> NFS to simply fail when NFS connectivity is lost, which in turn leads to
>> failover and recovery. For whatever reason, due to some weirdness of the
>> SAPHana resource agent, that didn't happen.
>> 
> 
> Yes. The only reason to use this workaround would be if resource agent
> monitor still believes that application is up when required NFS is down.
> Which is a bug in resource agent or possibly in application itself.

I think it's getting philosophical now:
For example a web server using documents from an NFS server:
Is the webserver down, when access to NFS hangs? Would restarting ("recover")
the web server help in that situation?
Maybe the OCF_CHECK_LEVEL could be used: High levels could query whether that
resource is not only "running", but also that the resource is responding, etc.

> 
> While using this workaround in this case is perfectly reasonable, none
> of reasons listed in the message I was replying to are applicable.
> 
> So far the only reason OP wanted to do it was some obscure race
> condition on startup outside of pacemaker. In which case this workaround
> simply delays NFS mount, sidestepping race.
> 
> I also remember something about racing with dnsmasq, at which point I'd
> say that making cluster depend on availability of DNS is e‑h‑h‑h unwise.
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/