[ClusterLabs] PostgreSQL HA on EL9

Wed Sep 13 13:13:55 EDT 2023

On Wed, 2023-09-13 at 16:45 +0000, Larry G. Mills via Users wrote:
> Hello Pacemaker community,
>  
> I have several two-node postgres 14 clusters that I am migrating from
> EL7 (Scientific Linux 7) to EL9 (AlmaLinux 9.2).
>  
> My configuration:
>  
> Cluster size: two nodes
> Postgres version: 14
> Corosync version: 3.1.7-1.el9  
> Pacemaker version: 2.1.5-9.el9_2
> pcs version: 0.11.4-7.el9_2
>  
> The migration has mostly gone smoothly, but I did notice one non-
> trivial change in recovery behavior between EL7 and EL9.  The
> recovery scenario is:
>  
> With the cluster running normally with one primary DB (i.e. Promoted)
> and one standby (i.e. Unpromoted), reboot one of the cluster nodes
> without first shutting down the cluster on that node.  The reboot is
> a “clean” system shutdown done via either the “reboot” or “shutdown”
> OS commands.

On my RHEL 9 test cluster, both "reboot" and "systemctl reboot" wait
for the cluster to stop everything.

I think in some environments "reboot" is equivalent to "systemctl
reboot --force" (kill all processes immediately), so maybe see if
"systemctl reboot" is better.

>  
> On EL7, this scenario caused the cluster to shut itself down on the
> node before the OS shutdown completed, and the DB resource was
> stopped/shutdown before the OS stopped.  On EL9, this is not the
> case, the DB resource is not stopped before the OS shutdown
> completes.  This leads to errors being thrown when the cluster is
> started back up on the rebooted node similar to the following:
> 
>   * pgsql probe on mynode returned 'error' (Instance "pgsql"
> controldata indicates a running secondary instance, the instance has
> probably crashed)
>  
> While this is not too serious for a standby DB instance, as the
> cluster is able to recover it back to the standby/Unpromoted state,
> if you reboot the Primary/Promoted DB node, the cluster is not able
> to recover it (because that DB still thinks it’s a primary), and the
> node is fenced.
>  
> Is this an intended behavior for the versions of pacemaker/corosync
> that I’m running, or a regression?   It may be possible to put an
> override into the systemd unit file for corosync to force the cluster
> to shutdown before the OS stops, but I’d rather not do that if
> there’s a better way to handle this recovery scenario.
>  
> Thanks for any advice,
>  
> Larry
-- 
Ken Gaillot <kgaillot at redhat.com>