[ClusterLabs] PostgreSQL HA on EL9
Ken Gaillot
kgaillot at redhat.com
Wed Sep 13 13:13:55 EDT 2023
On Wed, 2023-09-13 at 16:45 +0000, Larry G. Mills via Users wrote:
> Hello Pacemaker community,
>
> I have several two-node postgres 14 clusters that I am migrating from
> EL7 (Scientific Linux 7) to EL9 (AlmaLinux 9.2).
>
> My configuration:
>
> Cluster size: two nodes
> Postgres version: 14
> Corosync version: 3.1.7-1.el9
> Pacemaker version: 2.1.5-9.el9_2
> pcs version: 0.11.4-7.el9_2
>
> The migration has mostly gone smoothly, but I did notice one non-
> trivial change in recovery behavior between EL7 and EL9. The
> recovery scenario is:
>
> With the cluster running normally with one primary DB (i.e. Promoted)
> and one standby (i.e. Unpromoted), reboot one of the cluster nodes
> without first shutting down the cluster on that node. The reboot is
> a “clean” system shutdown done via either the “reboot” or “shutdown”
> OS commands.
On my RHEL 9 test cluster, both "reboot" and "systemctl reboot" wait
for the cluster to stop everything.
I think in some environments "reboot" is equivalent to "systemctl
reboot --force" (kill all processes immediately), so maybe see if
"systemctl reboot" is better.
>
> On EL7, this scenario caused the cluster to shut itself down on the
> node before the OS shutdown completed, and the DB resource was
> stopped/shutdown before the OS stopped. On EL9, this is not the
> case, the DB resource is not stopped before the OS shutdown
> completes. This leads to errors being thrown when the cluster is
> started back up on the rebooted node similar to the following:
>
> * pgsql probe on mynode returned 'error' (Instance "pgsql"
> controldata indicates a running secondary instance, the instance has
> probably crashed)
>
> While this is not too serious for a standby DB instance, as the
> cluster is able to recover it back to the standby/Unpromoted state,
> if you reboot the Primary/Promoted DB node, the cluster is not able
> to recover it (because that DB still thinks it’s a primary), and the
> node is fenced.
>
> Is this an intended behavior for the versions of pacemaker/corosync
> that I’m running, or a regression? It may be possible to put an
> override into the systemd unit file for corosync to force the cluster
> to shutdown before the OS stops, but I’d rather not do that if
> there’s a better way to handle this recovery scenario.
>
> Thanks for any advice,
>
> Larry
--
Ken Gaillot <kgaillot at redhat.com>
More information about the Users
mailing list