[ClusterLabs] PAF / pgSQL fails after OS/system shutdown

Thu Nov 9 15:44:33 EST 2023

On 07/11/2023 17:57, lejeczek via Users wrote:
> hi guys
>
> Having 3-node pgSQL cluster with PAF - when all three 
> systems are shutdown at virtually the same time then PAF 
> fails to start when HA cluster is operational again.
>
> from status:
> ...
> Migration Summary:
>   * Node: ubusrv2 (2):
>     * PGSQL-PAF-5433: migration-threshold=1000000 
> fail-count=1000000 last-failure='Tue Nov  7 17:52:38 2023'
>   * Node: ubusrv3 (3):
>     * PGSQL-PAF-5433: migration-threshold=1000000 
> fail-count=1000000 last-failure='Tue Nov  7 17:52:38 2023'
>   * Node: ubusrv1 (1):
>     * PGSQL-PAF-5433: migration-threshold=1000000 
> fail-count=1000000 last-failure='Tue Nov  7 17:52:38 2023'
>
> Failed Resource Actions:
>   * PGSQL-PAF-5433_stop_0 on ubusrv2 'error' (1): call=90, 
> status='complete', exitreason='Unexpected state for 
> instance "PGSQL-PAF-5433" (returned 1)', 
> last-rc-change='Tue Nov  7 17:52:38 2023', queued=0ms, 
> exec=84ms
>   * PGSQL-PAF-5433_stop_0 on ubusrv3 'error' (1): call=82, 
> status='complete', exitreason='Unexpected state for 
> instance "PGSQL-PAF-5433" (returned 1)', 
> last-rc-change='Tue Nov  7 17:52:38 2023', queued=0ms, 
> exec=82ms
>   * PGSQL-PAF-5433_stop_0 on ubusrv1 'error' (1): call=86, 
> status='complete', exitreason='Unexpected state for 
> instance "PGSQL-PAF-5433" (returned 1)', 
> last-rc-change='Tue Nov  7 17:52:38 2023', queued=0ms, 
> exec=108ms
>
> and all three pgSQLs show virtually identical logs:
> ...
> 2023-11-07 16:54:45.532 UTC [24936] LOG:  starting 
> PostgreSQL 14.9 (Ubuntu 14.9-0ubuntu0.22.04.1) on 
> x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 
> 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bit
> 2023-11-07 16:54:45.532 UTC [24936] LOG:  listening on 
> IPv4 address "0.0.0.0", port 5433
> 2023-11-07 16:54:45.532 UTC [24936] LOG:  listening on 
> IPv6 address "::", port 5433
> 2023-11-07 16:54:45.535 UTC [24936] LOG:  listening on 
> Unix socket "/var/run/postgresql/.s.PGSQL.5433"
> 2023-11-07 16:54:45.547 UTC [24938] LOG:  database system 
> was interrupted while in recovery at log time 2023-11-07 
> 15:30:56 UTC
> 2023-11-07 16:54:45.547 UTC [24938] HINT:  If this has 
> occurred more than once some data might be corrupted and 
> you might need to choose an earlier recovery target.
> 2023-11-07 16:54:45.819 UTC [24938] LOG:  entering standby 
> mode
> 2023-11-07 16:54:45.824 UTC [24938] FATAL:  could not open 
> directory "/var/run/postgresql/14-paf.pg_stat_tmp": No 
> such file or directory
> 2023-11-07 16:54:45.825 UTC [24936] LOG:  startup process 
> (PID 24938) exited with exit code 1
> 2023-11-07 16:54:45.825 UTC [24936] LOG:  aborting startup 
> due to startup process failure
> 2023-11-07 16:54:45.826 UTC [24936] LOG:  database system 
> is shut down
>
> Is this "test" case's result, as I showed above, expected? 
> It reproduces every time.
> If not - what might it be I'm missing?
>
> many thanks, L.
>
Actually, the  resource fails to start on a node a single 
node - as opposed to entire cluster shutdown as I noted 
originally - which was powered down in an orderly fashion 
and powered back on.
That the the time of power-cycle the node was PAF resource 
master, it fails:
...
2023-11-09 20:35:04.439 UTC [17727] LOG:  starting 
PostgreSQL 14.9 (Ubuntu 14.9-0ubuntu0.22.04.1) on 
x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 
11.4.0-1ubuntu1~22.04) 11.4.0, 64-bit
2023-11-09 20:35:04.439 UTC [17727] LOG:  listening on IPv4 
address "0.0.0.0", port 5433
2023-11-09 20:35:04.439 UTC [17727] LOG:  listening on IPv6 
address "::", port 5433
2023-11-09 20:35:04.442 UTC [17727] LOG:  listening on Unix 
socket "/var/run/postgresql/.s.PGSQL.5433"
2023-11-09 20:35:04.452 UTC [17731] LOG:  database system 
was interrupted while in recovery at log time 2023-11-09 
20:25:21 UTC
2023-11-09 20:35:04.452 UTC [17731] HINT:  If this has 
occurred more than once some data might be corrupted and you 
might need to choose an earlier recovery target.
2023-11-09 20:35:04.809 UTC [17731] LOG:  entering standby mode
2023-11-09 20:35:04.813 UTC [17731] FATAL:  could not open 
directory "/var/run/postgresql/14-paf.pg_stat_tmp": No such 
file or directory
2023-11-09 20:35:04.814 UTC [17727] LOG:  startup process 
(PID 17731) exited with exit code 1
2023-11-09 20:35:04.814 UTC [17727] LOG:  aborting startup 
due to startup process failure
2023-11-09 20:35:04.815 UTC [17727] LOG:  database system is 
shut down

The master at the time node was shut down did get moved over 
to standby/slave node, properly,

I'm on Ubuntu with:

ii  corosync                  3.1.6-1ubuntu1   amd64        
cluster engine daemon and utilities
ii  pacemaker                 2.1.2-1ubuntu3.1 amd64        
cluster resource manager
ii  pacemaker-cli-utils       2.1.2-1ubuntu3.1 amd64        
cluster resource manager command line utilities
ii  pacemaker-common          2.1.2-1ubuntu3.1 all          
cluster resource manager common files
ii  pacemaker-resource-agents 2.1.2-1ubuntu3.1 all          
cluster resource manager general resource agents
ii  pcs                       0.10.11-2ubuntu3 all Pacemaker 
Configuration System

And here is the resource:
-> $ pcs resource config PGSQL-PAF-5433-clone
  Clone: PGSQL-PAF-5433-clone
   Meta Attrs: failure-timeout=20s master-max=1 notify=true 
promotable=true
   Resource: PGSQL-PAF-5433 (class=ocf provider=heartbeat 
type=pgsqlms)
    Attributes: bindir=/usr/lib/postgresql/14/bin 
datadir=/var/lib/postgresql/14/paf 
pgdata=/etc/postgresql/14/paf pgport=5433
    Operations: demote interval=0s timeout=120s 
(PGSQL-PAF-5433-demote-interval-0s)
                methods interval=0s timeout=5 
(PGSQL-PAF-5433-methods-interval-0s)
                monitor interval=15s role=Master timeout=10s 
(PGSQL-PAF-5433-monitor-interval-15s)
                monitor interval=16s role=Slave timeout=10s 
(PGSQL-PAF-5433-monitor-interval-16s)
                notify interval=0s timeout=60s 
(PGSQL-PAF-5433-notify-interval-0s)
                promote interval=0s timeout=30s 
(PGSQL-PAF-5433-promote-interval-0s)
                reload interval=0s timeout=20 
(PGSQL-PAF-5433-reload-interval-0s)
                start interval=0s timeout=60s 
(PGSQL-PAF-5433-start-interval-0s)
                stop interval=0s timeout=60s 
(PGSQL-PAF-5433-stop-interval-0s)

Is this my setup/config or there might actually be an issue 
with the PAF |& HA not handling node-OS shutdown?
all & any thoughts are much apreciated.
Thanks, L.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20231109/ddbaf439/attachment.htm>