[ClusterLabs] Syncronous primary doesn't switch to async mode on replica power off

Fri Oct 6 05:07:46 EDT 2023

On Fri, Oct 6, 2023 at 8:46 AM Sergey Cherukhin <sergey.cherukhin at gmail.com>
wrote:

> Hello!
>
> I used Microsoft Outlook to send this message and it was sent in the wrong
> format. I'm sorry. I won't do it again.
>
> I use Postgresql+Pacemaker+Corosync cluster with 2 Postgresql instances in
> synchronous replication mode. Parameter “rep_mode” is set to "sync", and
> when I shut down the replica normal way, the primary node  switches to the
> async mode. But when I  shut down the replica by powering it off to emulate
> power unit failure, primary remains in sync mode and clients hang on INSERT
> operations  until "pcs resource cleanup" is performed.  I created an alert
> agent to run "pcs resource cleanup" when any node is lost, but this
> approach doesn’t work.
>
> What should I do to be sure the primary node will switch to async mode if
> the replica becomes lost for any cause?
>

One idea might be running (a) small daemon(s) colocated with the Postgresql
instance(s) that uses pacemaker-tooling to check
for the state of the partner-node and if it isn't there switches to async
mode. You can solve this as a small custom Resource-Agent.
Actually it wouldn't even be necessary to have a persistently running
process - could be done in the monitoring as well.
Of course you could enhance monitoring of Postgresql Resource-Agent as that
it supports this switching.
As this would be quite a generic change it would probably be interesting
for the community as well.

On the other hand I would have considered this issue so generic that it is
hard to believe that there is no ready made / tested
solution around already.

To get it more reactive (without setting the monitoring-interval to
incredibly low values) using an alert-agent (as you already tried)
but maybe directly switching to async-mode might be worthwhile trying.
Did you investigate what did actually go wrong when you made experiments
with the alert-agent? Interesting that the
resource cleanup that obviously works from the cmdline doesn't do the trick
when run as alert-agent - maybe an selinux issue ...

Regards,
Klaus

>
>
> Best regards,
> Sergey Cherukhin
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20231006/506ef917/attachment.htm>