[ClusterLabs] PAF fails to promote slave: Can not get current node LSN location

Sat Jul 6 06:30:34 EDT 2019

On Fri, Jul 5, 2019 at 5:09 PM Jehan-Guillaume de Rorthais <jgdr at dalibo.com>
wrote:

> It seems to me the problem comes from here:
>
>   Jul 03 19:31:38 [30151] ph-sql-03.prod.ams.i.rdmedia.com       crmd:
>  notice:
>     te_rsc_command:     Initiating notify operation
>     pgsqld_pre_notify_promote_0 on ph-sql-05 | action 67
>   Jul 03 19:32:38 [30148] ph-sql-03.prod.ams.i.rdmedia.com       lrmd:
> warning:
>     operation_finished: pgsqld_notify_0:30939 - timed out after 60000ms
>
> and here:
>
>   Jul 03 19:31:38 [11914] ph-sql-05.prod.ams.i.rdmedia.com       lrmd:
>  info:
>     log_execute:        executing - rsc:pgsqld action:notify call_id:38
>     pgsqlms(pgsqld)[20881]:
>   Jul 03 19:32:38 [11914] ph-sql-05.prod.ams.i.rdmedia.com       lrmd:
>     warning: operation_finished:        pgsqld_notify_0:20881 - timed out
> after
>     60000ms
>
> The pgsql election occurs during the pre-promote action where all remaining
> nodes set their LSN location. During the promote action, the designed
> primary
> checks its LSN location is the highest one. If it is not, it just cancel
> the
> promotion so the next round will elect the best one.
>
> Back to your issue. According to the logs, both standby nodes timed out
> during
> the pre-promote action. No LSN location has been set accros the whole
> cluster.
> I couldn't see any messages from the attrd daemon related to the
> lsn_location
> attribute or other cleanup actions.
>
> I couldn't even find the INFO message from pgsqlms giving its current
> status
> before actually setting it ("Current node TL#LSN: %s"). But this message
> appears soon after a...checkpoint. See:
> https://github.com/ClusterLabs/PAF/blob/master/script/pgsqlms#L2017
>
> Maybe you have very long checkpoints on both nodes that timed out the
> pre-promote action? Do you have some PostgreSQL logs showing some useful
> info
> around this timelapse?
>

Yes I have logs that show at what time checkpoints were taken. They're
attached. Currently checkpoint_timeout is set to 15 minutes. I already
lowered it from the previous value of 60 minutes. I could experiment with
even lower values for checkpoint_timeout as it seems the problem is that
forced checkpoints are taking too long (more than 60s). Perhaps I should
also increase the notify timeout?

Thank you very much for your time.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20190706/69ba645d/attachment-0001.html>
-------------- next part --------------
2019-07-03 19:30:10.887 CEST [1214]: LOG:  restartpoint complete: wrote 345241 buffers (32.9%); 0 WAL file(s) added, 0 removed, 9 recycled; write=810.270 s, sync=0.973 s, total=811.445 s; sync files=594, longest=0.064 s, average=0.001 s; distance=151568 kB, estimate=2426859 kB
2019-07-03 19:30:10.887 CEST [1214]: LOG:  recovery restart point at 444/DB777640
2019-07-03 19:30:10.887 CEST [1214]: DETAIL:  Last completed transaction was at log time 2019-07-03 19:29:45.48093+02.
2019-07-03 19:31:38.647 CEST [1214]: LOG:  restartpoint starting: immediate wait
2019-07-03 19:32:19.721 CEST [30315]: FATAL:  terminating walreceiver due to timeout
2019-07-03 19:32:19.813 P00   INFO: archive-get command begin 2.15: [00000005.history, pg_wal/RECOVERYHISTORY] --archive-async --no-compress --db-timeout=3600 --log-level-console=info --log-level-file=detail --pg1-host=10.100.130.20 --pg1-path=/var/lib/pgsql/11/data --pg2-path=/var/lib/pgsql/11/data --process-max=4 --repo1-path=/var/lib/pgbackrest/mnt --spool-path=/var/spool/pgbackrest --stanza=pgdb2
2019-07-03 19:32:19.821 P00   INFO: unable to find 00000005.history in the archive
2019-07-03 19:32:19.821 P00   INFO: archive-get command end: completed successfully (9ms)
2019-07-03 19:32:19.831 P00   INFO: archive-get command begin 2.15: [0000000400000445000000C7, pg_wal/RECOVERYXLOG] --archive-async --no-compress --db-timeout=3600 --log-level-console=info --log-level-file=detail --pg1-host=10.100.130.20 --pg1-path=/var/lib/pgsql/11/data --pg2-path=/var/lib/pgsql/11/data --process-max=4 --repo1-path=/var/lib/pgbackrest/mnt --spool-path=/var/spool/pgbackrest --stanza=pgdb2
2019-07-03 19:32:19.941 P00   INFO: unable to find 0000000400000445000000C7 in the archive
2019-07-03 19:32:19.941 P00   INFO: archive-get command end: completed successfully (110ms)
2019-07-03 19:32:23.955 CEST [1714]: FATAL:  could not connect to the primary server: could not connect to server: No route to host
2019-07-03 19:32:23.966 P00   INFO: archive-get command begin 2.15: [00000005.history, pg_wal/RECOVERYHISTORY] --archive-async --no-compress --db-timeout=3600 --log-level-console=info --log-level-file=detail --pg1-host=10.100.130.20 --pg1-path=/var/lib/pgsql/11/data --pg2-path=/var/lib/pgsql/11/data --process-max=4 --repo1-path=/var/lib/pgbackrest/mnt --spool-path=/var/spool/pgbackrest --stanza=pgdb2
2019-07-03 19:32:23.970 P00   INFO: unable to find 00000005.history in the archive
2019-07-03 19:32:23.970 P00   INFO: archive-get command end: completed successfully (5ms)
2019-07-03 19:32:24.835 P00   INFO: archive-get command begin 2.15: [0000000400000445000000C7, pg_wal/RECOVERYXLOG] --archive-async --no-compress --db-timeout=3600 --log-level-console=info --log-level-file=detail --pg1-host=10.100.130.20 --pg1-path=/var/lib/pgsql/11/data --pg2-path=/var/lib/pgsql/11/data --process-max=4 --repo1-path=/var/lib/pgbackrest/mnt --spool-path=/var/spool/pgbackrest --stanza=pgdb2
2019-07-03 19:32:24.938 P00   INFO: unable to find 0000000400000445000000C7 in the archive
2019-07-03 19:32:24.938 P00   INFO: archive-get command end: completed successfully (104ms)
2019-07-03 19:32:26.959 CEST [2580]: FATAL:  could not connect to the primary server: could not connect to server: No route to host
2019-07-03 19:32:26.971 P00   INFO: archive-get command begin 2.15: [00000005.history, pg_wal/RECOVERYHISTORY] --archive-async --no-compress --db-timeout=3600 --log-level-console=info --log-level-file=detail --pg1-host=10.100.130.20 --pg1-path=/var/lib/pgsql/11/data --pg2-path=/var/lib/pgsql/11/data --process-max=4 --repo1-path=/var/lib/pgbackrest/mnt --spool-path=/var/spool/pgbackrest --stanza=pgdb2
2019-07-03 19:32:26.976 P00   INFO: unable to find 00000005.history in the archive
2019-07-03 19:32:26.976 P00   INFO: archive-get command end: completed successfully (6ms)
2019-07-03 19:32:29.838 P00   INFO: archive-get command begin 2.15: [0000000400000445000000C7, pg_wal/RECOVERYXLOG] --archive-async --no-compress --db-timeout=3600 --log-level-console=info --log-level-file=detail --pg1-host=10.100.130.20 --pg1-path=/var/lib/pgsql/11/data --pg2-path=/var/lib/pgsql/11/data --process-max=4 --repo1-path=/var/lib/pgbackrest/mnt --spool-path=/var/spool/pgbackrest --stanza=pgdb2
2019-07-03 19:32:29.941 P00   INFO: unable to find 0000000400000445000000C7 in the archive
2019-07-03 19:32:29.941 P00   INFO: archive-get command end: completed successfully (104ms)
2019-07-03 19:32:29.965 CEST [3823]: FATAL:  could not connect to the primary server: could not connect to server: No route to host
2019-07-03 19:32:29.975 P00   INFO: archive-get command begin 2.15: [00000005.history, pg_wal/RECOVERYHISTORY] --archive-async --no-compress --db-timeout=3600 --log-level-console=info --log-level-file=detail --pg1-host=10.100.130.20 --pg1-path=/var/lib/pgsql/11/data --pg2-path=/var/lib/pgsql/11/data --process-max=4 --repo1-path=/var/lib/pgbackrest/mnt --spool-path=/var/spool/pgbackrest --stanza=pgdb2
2019-07-03 19:32:29.980 P00   INFO: unable to find 00000005.history in the archive
2019-07-03 19:32:29.980 P00   INFO: archive-get command end: completed successfully (6ms)
2019-07-03 19:32:34.844 P00   INFO: archive-get command begin 2.15: [0000000400000445000000C7, pg_wal/RECOVERYXLOG] --archive-async --no-compress --db-timeout=3600 --log-level-console=info --log-level-file=detail --pg1-host=10.100.130.20 --pg1-path=/var/lib/pgsql/11/data --pg2-path=/var/lib/pgsql/11/data --process-max=4 --repo1-path=/var/lib/pgbackrest/mnt --spool-path=/var/spool/pgbackrest --stanza=pgdb2
2019-07-03 19:32:34.946 P00   INFO: unable to find 0000000400000445000000C7 in the archive
2019-07-03 19:32:34.946 P00   INFO: archive-get command end: completed successfully (103ms)
2019-07-03 19:32:37.956 CEST [4903]: FATAL:  could not connect to the primary server: could not connect to server: No route to host
2019-07-03 19:32:37.967 P00   INFO: archive-get command begin 2.15: [00000005.history, pg_wal/RECOVERYHISTORY] --archive-async --no-compress --db-timeout=3600 --log-level-console=info --log-level-file=detail --pg1-host=10.100.130.20 --pg1-path=/var/lib/pgsql/11/data --pg2-path=/var/lib/pgsql/11/data --process-max=4 --repo1-path=/var/lib/pgbackrest/mnt --spool-path=/var/spool/pgbackrest --stanza=pgdb2
2019-07-03 19:32:37.971 P00   INFO: unable to find 00000005.history in the archive
2019-07-03 19:32:37.971 P00   INFO: archive-get command end: completed successfully (5ms)
2019-07-03 19:32:39.682 CEST [702]: LOG:  received fast shutdown request
2019-07-03 19:32:39.692 CEST [702]: LOG:  aborting any active transactions
2019-07-03 19:32:39.693 CEST [30963]: FATAL:  terminating connection due to administrator command
2019-07-03 19:32:39.693 CEST [30963]: STATEMENT:  CHECKPOINT
2019-07-03 19:32:39.693 CEST [30963]: LOG:  could not send data to client: Broken pipe
2019-07-03 19:32:39.693 CEST [30963]: STATEMENT:  CHECKPOINT
2019-07-03 19:32:39.694 CEST [5868]: FATAL:  terminating connection due to administrator command
2019-07-03 19:32:39.694 CEST [5868]: STATEMENT:  SELECT SUM(idx_scan::bigint) FROM pg_stat_all_tables WHERE     schemaname <> 'pg_catalog'     AND schemaname <> 'information_schema'     AND schemaname !~ '^pg_toast'
2019-07-03 19:32:39.699 CEST [5886]: FATAL:  the database system is shutting down
-------------- next part --------------
2019-07-03 19:29:52.438 CEST [23255]: LOG:  checkpoint complete: wrote 91743 buffers (8.7%); 0 WAL file(s) added, 0 removed, 66 recycled; write=809.647 s, sync=5.
2019-07-03 19:31:11.526 CEST [23255]: LOG:  checkpoint starting: time
-------------- next part --------------
2019-07-03 19:31:20.048 CEST [26823]: LOG:  restartpoint complete: wrote 362269 buffers (34.5%); 0 WAL file(s) added, 0 removed, 9 recycled; write=809.797 s, sync=0.823 s, total=810.785 s; sync files=597, longest=0.146 s, average=0.001 s; distance=151568 kB, estimate=2426859 kB
2019-07-03 19:31:20.048 CEST [26823]: LOG:  recovery restart point at 444/DB777640
2019-07-03 19:31:20.048 CEST [26823]: DETAIL:  Last completed transaction was at log time 2019-07-03 19:31:19.511618+02.
2019-07-03 19:31:38.628 CEST [26823]: LOG:  restartpoint starting: immediate wait
2019-07-03 19:32:19.715 CEST [9468]: FATAL:  terminating walreceiver due to timeout
2019-07-03 19:32:22.720 CEST [34253]: FATAL:  could not connect to the primary server: could not connect to server: No route to host
...
2019-07-03 19:33:52.948 CEST [24754]: FATAL:  could not connect to the primary server: could not connect to server: No route to host
2019-07-03 19:33:53.204 CEST [26574]: LOG:  received fast shutdown request
2019-07-03 19:33:53.212 CEST [26574]: LOG:  aborting any active transactions
2019-07-03 19:33:53.212 CEST [20902]: FATAL:  terminating connection due to administrator command
2019-07-03 19:33:53.212 CEST [20902]: STATEMENT:  CHECKPOINT
2019-07-03 19:33:53.212 CEST [20902]: LOG:  could not send data to client: Broken pipe
2019-07-03 19:33:53.212 CEST [20902]: STATEMENT:  CHECKPOINT
2019-07-03 19:33:53.240 CEST [24843]: FATAL:  the database system is shutting down
...
2019-07-03 19:38:02.296 CEST [27169]: FATAL:  the database system is shutting down
2019-07-03 19:38:20.448 CEST [26823]: LOG:  restartpoint complete: wrote 745698 buffers (71.1%); 0 WAL file(s) added, 16 removed, 50 recycled; write=400.770 s, sync=0.002 s, total=401.819 s; sync files=538, longest=0.000 s, average=0.000 s; distance=1077509 kB, estimate=2291924 kB
2019-07-03 19:38:20.448 CEST [26823]: LOG:  recovery restart point at 445/1D3B8D20
2019-07-03 19:38:20.448 CEST [26823]: DETAIL:  Last completed transaction was at log time 2019-07-03 19:31:19.511618+02.
2019-07-03 19:38:20.450 CEST [26823]: LOG:  shutting down
2019-07-03 19:38:21.560 CEST [26574]: LOG:  database system is shut down