[ClusterLabs] PAF fails to promote slave: Can not get current node LSN location

Mon Jul 8 07:56:49 EDT 2019

Thank you for the clear explanation and advice.

Hardware is adequate: 8x SSD and 20 cores per node, but I should note that
the filesystem is ZFS (stripe of mirrors) and there seems to be evidence
that the way the WAL writer allocates space and ZFS' Copy-on-Write nature
don't play nice. A patch that adds several GUCs to improve the situation
(at least it's worth trying, there was some disagreement on the
pgsql-general list over whether it would be helpful in my situation) has
recently been merged: https://postgrespro.com/list/thread-id/2393057 but it
won't be available in the 11.x release . So while I'm waiting until I can
upgrade to PostgreSQL 12, I'll increase the notify timeout. A larger RTO is
much more preferrable over manual actions in the middle of the night!

Thanks again!

On Mon, Jul 8, 2019 at 11:56 AM Jehan-Guillaume de Rorthais <jgdr at dalibo.com>
wrote:

> On Sat, 6 Jul 2019 12:30:34 +0200
> Tiemen Ruiten <t.ruiten at tech-lab.io> wrote:
>
> > On Fri, Jul 5, 2019 at 5:09 PM Jehan-Guillaume de Rorthais <
> jgdr at dalibo.com>
> > wrote:
> >
> > > It seems to me the problem comes from here:
> > >
> > >   Jul 03 19:31:38 [30151] ph-sql-03.prod.ams.i.rdmedia.com       crmd:
> > >  notice:
> > >     te_rsc_command:     Initiating notify operation
> > >     pgsqld_pre_notify_promote_0 on ph-sql-05 | action 67
> > >   Jul 03 19:32:38 [30148] ph-sql-03.prod.ams.i.rdmedia.com       lrmd:
> > > warning:
> > >     operation_finished: pgsqld_notify_0:30939 - timed out after 60000ms
> > >
> > > and here:
> > >
> > >   Jul 03 19:31:38 [11914] ph-sql-05.prod.ams.i.rdmedia.com       lrmd:
> > >  info:
> > >     log_execute:        executing - rsc:pgsqld action:notify call_id:38
> > >     pgsqlms(pgsqld)[20881]:
> > >   Jul 03 19:32:38 [11914] ph-sql-05.prod.ams.i.rdmedia.com       lrmd:
> > >     warning: operation_finished:        pgsqld_notify_0:20881 - timed
> out
> > > after
> > >     60000ms
> > >
> > > The pgsql election occurs during the pre-promote action where all
> remaining
> > > nodes set their LSN location. During the promote action, the designed
> > > primary
> > > checks its LSN location is the highest one. If it is not, it just
> cancel
> > > the
> > > promotion so the next round will elect the best one.
> > >
> > > Back to your issue. According to the logs, both standby nodes timed out
> > > during
> > > the pre-promote action. No LSN location has been set accros the whole
> > > cluster.
> > > I couldn't see any messages from the attrd daemon related to the
> > > lsn_location
> > > attribute or other cleanup actions.
> > >
> > > I couldn't even find the INFO message from pgsqlms giving its current
> > > status
> > > before actually setting it ("Current node TL#LSN: %s"). But this
> message
> > > appears soon after a...checkpoint. See:
> > > https://github.com/ClusterLabs/PAF/blob/master/script/pgsqlms#L2017
> > >
> > > Maybe you have very long checkpoints on both nodes that timed out the
> > > pre-promote action? Do you have some PostgreSQL logs showing some
> useful
> > > info
> > > around this timelapse?
> > >
> >
> > Yes I have logs that show at what time checkpoints were taken. They're
> > attached. Currently checkpoint_timeout is set to 15 minutes. I already
> > lowered it from the previous value of 60 minutes. I could experiment with
> > even lower values for checkpoint_timeout as it seems the problem is that
> > forced checkpoints are taking too long (more than 60s).
>
> So it seems the theory is right. The "immediate wait" restartpoint is the
> CHECKPOINT issued by pgsqlms during he pre-promote action. Timestamps
> match.
>
>   2019-07-03 19:31:38.647 CEST [1214]: LOG: restartpoint starting:
> immediate
>     wait
>
> One minute later, we reach the LRM timeout and about one more second
> later, the
> CRM decided to shutdown PgSQL, canceling the CHECKPOINT statement from
> session
> 30963:
>
>   2019-07-03 19:32:39.693 CEST [30963]: FATAL:  terminating connection due
> to
>     administrator command
>   2019-07-03 19:32:39.693 CEST [30963]: STATEMENT: CHECKPOINT
>
> > Perhaps I should also increase the notify timeout?
>
> The checkpoints issued from pgsqlms are supposed to be "forced", which
> means
> they are not throttled and should sync dirty buffers as fast as possible.
> Worst
> case scenario is to sync the whole shared buffers. This shouldn't take a
> full
> minute on modern servers, unless:
>
> * you have a huge shared buffers
> * your disk bandwidth or IOPS are saturated or very poor
> * you have disks issues
>
> A side note about checkpoints:
>
> 15 minutes for checkpoint_timeout seems reasonable to me, but it really
> depend
> on your workload. You should just make sure most of your checkpoint are
> timed triggered, but not occurring too frequently to improve write cache
> and
> avoid full page writes. According to your logs, last timed checkpoint
> synced
> about 33% of your shared buffers, 2.7GB, in 811s (checkpoint throttled at
> 3.2MB/s). This is perfectly fine.
>
> The amount of data to sync during a failover really depend on your
> workload.
> Syncing 2.7GB in 60s would require a bandwidth of 45MB/s. Syncing 8GB in
> 60s
> would require 136MB/s. Both are quite easy to achieve on a decent server
> setup.
>
> Back to the advice and conclusion:
>
> Lowering the checkpoint_timeout will not help. You can always find a
> workload
> that will fill your whole shared_buffers with dirty pages in less than
> this GUC
> value. Moreover, syncing 8GB of data (worst case scenario) should not take
> ages...
>
> However, can raise the notify action timeout to something way larger. The
> timeout should be set depending on your RTO (warning, you will have to sum
> various actions timeouts). You can even raise the notify action to 300s or
> larger if it makes sense to you. The point is to give enough time to the
> CHECKPOINT to finish, no matter what.
>
> Unfortunately, we still need this checkpoint to be able to compare safely
> the
> TL of nodes :( Even the "pg_control_checkpoint()" function appearing in 9.6
> doesn't help as it reads the controldata file from disk where data can be
> outdated with the one in memory :/
> The only way to avoid this might be to rely on a dedicated C extension,
> which
> seems too invasive in my opinion.
>
> If you need something faster, you might have to setup your hardware to
> match
> your performance requirement.
>
> Regards,
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20190708/f55d0161/attachment.html>