[ClusterLabs] PAF fails to promote slave: Can not get current node LSN location

Mon Jul 8 05:56:28 EDT 2019

On Sat, 6 Jul 2019 12:30:34 +0200
Tiemen Ruiten <t.ruiten at tech-lab.io> wrote:

> On Fri, Jul 5, 2019 at 5:09 PM Jehan-Guillaume de Rorthais <jgdr at dalibo.com>
> wrote:
> 
> > It seems to me the problem comes from here:
> >
> >   Jul 03 19:31:38 [30151] ph-sql-03.prod.ams.i.rdmedia.com       crmd:
> >  notice:
> >     te_rsc_command:     Initiating notify operation
> >     pgsqld_pre_notify_promote_0 on ph-sql-05 | action 67
> >   Jul 03 19:32:38 [30148] ph-sql-03.prod.ams.i.rdmedia.com       lrmd:
> > warning:
> >     operation_finished: pgsqld_notify_0:30939 - timed out after 60000ms
> >
> > and here:
> >
> >   Jul 03 19:31:38 [11914] ph-sql-05.prod.ams.i.rdmedia.com       lrmd:
> >  info:
> >     log_execute:        executing - rsc:pgsqld action:notify call_id:38
> >     pgsqlms(pgsqld)[20881]:
> >   Jul 03 19:32:38 [11914] ph-sql-05.prod.ams.i.rdmedia.com       lrmd:
> >     warning: operation_finished:        pgsqld_notify_0:20881 - timed out
> > after
> >     60000ms
> >
> > The pgsql election occurs during the pre-promote action where all remaining
> > nodes set their LSN location. During the promote action, the designed
> > primary
> > checks its LSN location is the highest one. If it is not, it just cancel
> > the
> > promotion so the next round will elect the best one.
> >
> > Back to your issue. According to the logs, both standby nodes timed out
> > during
> > the pre-promote action. No LSN location has been set accros the whole
> > cluster.
> > I couldn't see any messages from the attrd daemon related to the
> > lsn_location
> > attribute or other cleanup actions.
> >
> > I couldn't even find the INFO message from pgsqlms giving its current
> > status
> > before actually setting it ("Current node TL#LSN: %s"). But this message
> > appears soon after a...checkpoint. See:
> > https://github.com/ClusterLabs/PAF/blob/master/script/pgsqlms#L2017
> >
> > Maybe you have very long checkpoints on both nodes that timed out the
> > pre-promote action? Do you have some PostgreSQL logs showing some useful
> > info
> > around this timelapse?
> >  
> 
> Yes I have logs that show at what time checkpoints were taken. They're
> attached. Currently checkpoint_timeout is set to 15 minutes. I already
> lowered it from the previous value of 60 minutes. I could experiment with
> even lower values for checkpoint_timeout as it seems the problem is that
> forced checkpoints are taking too long (more than 60s).

So it seems the theory is right. The "immediate wait" restartpoint is the
CHECKPOINT issued by pgsqlms during he pre-promote action. Timestamps match.

  2019-07-03 19:31:38.647 CEST [1214]: LOG: restartpoint starting: immediate
    wait

One minute later, we reach the LRM timeout and about one more second later, the
CRM decided to shutdown PgSQL, canceling the CHECKPOINT statement from session
30963:

  2019-07-03 19:32:39.693 CEST [30963]: FATAL:  terminating connection due to
    administrator command
  2019-07-03 19:32:39.693 CEST [30963]: STATEMENT: CHECKPOINT

> Perhaps I should also increase the notify timeout?

The checkpoints issued from pgsqlms are supposed to be "forced", which means
they are not throttled and should sync dirty buffers as fast as possible. Worst
case scenario is to sync the whole shared buffers. This shouldn't take a full
minute on modern servers, unless:

* you have a huge shared buffers
* your disk bandwidth or IOPS are saturated or very poor
* you have disks issues

A side note about checkpoints:

15 minutes for checkpoint_timeout seems reasonable to me, but it really depend
on your workload. You should just make sure most of your checkpoint are
timed triggered, but not occurring too frequently to improve write cache and
avoid full page writes. According to your logs, last timed checkpoint synced
about 33% of your shared buffers, 2.7GB, in 811s (checkpoint throttled at
3.2MB/s). This is perfectly fine.

The amount of data to sync during a failover really depend on your workload.
Syncing 2.7GB in 60s would require a bandwidth of 45MB/s. Syncing 8GB in 60s
would require 136MB/s. Both are quite easy to achieve on a decent server setup.

Back to the advice and conclusion:

Lowering the checkpoint_timeout will not help. You can always find a workload
that will fill your whole shared_buffers with dirty pages in less than this GUC
value. Moreover, syncing 8GB of data (worst case scenario) should not take
ages...

However, can raise the notify action timeout to something way larger. The
timeout should be set depending on your RTO (warning, you will have to sum
various actions timeouts). You can even raise the notify action to 300s or
larger if it makes sense to you. The point is to give enough time to the
CHECKPOINT to finish, no matter what.

Unfortunately, we still need this checkpoint to be able to compare safely the
TL of nodes :( Even the "pg_control_checkpoint()" function appearing in 9.6
doesn't help as it reads the controldata file from disk where data can be
outdated with the one in memory :/
The only way to avoid this might be to rely on a dedicated C extension, which
seems too invasive in my opinion.

If you need something faster, you might have to setup your hardware to match
your performance requirement.

Regards,