[ClusterLabs] Syncronous primary doesn't switch to async mode on replica power off

Fri Oct 6 06:10:27 EDT 2023

Approach with alert agent is working now. It requires to call "pcs resource
cleanup" by root using sudo, add  "sleep 120" before calling pcs utility in
alert agent script and increase alert agent timeout adequately.

But I don't like this workaround, it takes too long a time to switch
primary node to async. Timeout 60s is not enough, so I increased it twice.
But even the 60s timeout is too much.

Maybe experienced  pacemaker users can advise me on some configuration
options to solve this problem, I don't know about.

Best regards,
Sergey Cherukhin

пт, 6 окт. 2023 г. в 16:08, Klaus Wenninger <kwenning at redhat.com>:

>
>
> On Fri, Oct 6, 2023 at 8:46 AM Sergey Cherukhin <
> sergey.cherukhin at gmail.com> wrote:
>
>> Hello!
>>
>> I used Microsoft Outlook to send this message and it was sent in the
>> wrong format. I'm sorry. I won't do it again.
>>
>> I use Postgresql+Pacemaker+Corosync cluster with 2 Postgresql instances
>> in synchronous replication mode. Parameter “rep_mode” is set to "sync", and
>> when I shut down the replica normal way, the primary node  switches to the
>> async mode. But when I  shut down the replica by powering it off to emulate
>> power unit failure, primary remains in sync mode and clients hang on INSERT
>> operations  until "pcs resource cleanup" is performed.  I created an alert
>> agent to run "pcs resource cleanup" when any node is lost, but this
>> approach doesn’t work.
>>
>> What should I do to be sure the primary node will switch to async mode if
>> the replica becomes lost for any cause?
>>
>
> One idea might be running (a) small daemon(s) colocated with the
> Postgresql instance(s) that uses pacemaker-tooling to check
> for the state of the partner-node and if it isn't there switches to async
> mode. You can solve this as a small custom Resource-Agent.
> Actually it wouldn't even be necessary to have a persistently running
> process - could be done in the monitoring as well.
> Of course you could enhance monitoring of Postgresql Resource-Agent as
> that it supports this switching.
> As this would be quite a generic change it would probably be interesting
> for the community as well.
>
> On the other hand I would have considered this issue so generic that it is
> hard to believe that there is no ready made / tested
> solution around already.
>
> To get it more reactive (without setting the monitoring-interval to
> incredibly low values) using an alert-agent (as you already tried)
> but maybe directly switching to async-mode might be worthwhile trying.
> Did you investigate what did actually go wrong when you made experiments
> with the alert-agent? Interesting that the
> resource cleanup that obviously works from the cmdline doesn't do the
> trick when run as alert-agent - maybe an selinux issue ...
>
> Regards,
> Klaus
>
>>
>>
>> Best regards,
>> Sergey Cherukhin
>>
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20231006/5ddf02ae/attachment.htm>