<div dir="ltr"><div dir="ltr">Thank you for the clear explanation and advice. <div><br></div><div>Hardware is adequate: 8x SSD and 20 cores per node, but I should note that the filesystem is ZFS (stripe of mirrors) and there seems to be evidence that the way the WAL writer allocates space and ZFS' Copy-on-Write nature don't play nice. A patch that adds several GUCs to improve the situation (at least it's worth trying, there was some disagreement on the pgsql-general list over whether it would be helpful in my situation) has recently been merged: <a href="https://postgrespro.com/list/thread-id/2393057" target="_blank">https://postgrespro.com/list/thread-id/2393057</a> but it won't be available in the 11.x release . So while I'm waiting until I can upgrade to PostgreSQL 12, I'll increase the notify timeout. A larger RTO is much more preferrable over manual actions in the middle of the night!</div><div><br></div><div>Thanks again!</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Jul 8, 2019 at 11:56 AM Jehan-Guillaume de Rorthais <<a href="mailto:jgdr@dalibo.com" target="_blank">jgdr@dalibo.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On Sat, 6 Jul 2019 12:30:34 +0200<br>

Tiemen Ruiten <<a href="mailto:t.ruiten@tech-lab.io" target="_blank">t.ruiten@tech-lab.io</a>> wrote:<br>

<br>

> On Fri, Jul 5, 2019 at 5:09 PM Jehan-Guillaume de Rorthais <<a href="mailto:jgdr@dalibo.com" target="_blank">jgdr@dalibo.com</a>><br>

> wrote:<br>

> <br>

> > It seems to me the problem comes from here:<br>

> ><br>

> >   Jul 03 19:31:38 [30151] <a href="http://ph-sql-03.prod.ams.i.rdmedia.com" rel="noreferrer" target="_blank">ph-sql-03.prod.ams.i.rdmedia.com</a>       crmd:<br>

> >  notice:<br>

> >     te_rsc_command:     Initiating notify operation<br>

> >     pgsqld_pre_notify_promote_0 on ph-sql-05 | action 67<br>

> >   Jul 03 19:32:38 [30148] <a href="http://ph-sql-03.prod.ams.i.rdmedia.com" rel="noreferrer" target="_blank">ph-sql-03.prod.ams.i.rdmedia.com</a>       lrmd:<br>

> > warning:<br>

> >     operation_finished: pgsqld_notify_0:30939 - timed out after 60000ms<br>

> ><br>

> > and here:<br>

> ><br>

> >   Jul 03 19:31:38 [11914] <a href="http://ph-sql-05.prod.ams.i.rdmedia.com" rel="noreferrer" target="_blank">ph-sql-05.prod.ams.i.rdmedia.com</a>       lrmd:<br>

> >  info:<br>

> >     log_execute:        executing - rsc:pgsqld action:notify call_id:38<br>

> >     pgsqlms(pgsqld)[20881]:<br>

> >   Jul 03 19:32:38 [11914] <a href="http://ph-sql-05.prod.ams.i.rdmedia.com" rel="noreferrer" target="_blank">ph-sql-05.prod.ams.i.rdmedia.com</a>       lrmd:<br>

> >     warning: operation_finished:        pgsqld_notify_0:20881 - timed out<br>

> > after<br>

> >     60000ms<br>

> ><br>

> > The pgsql election occurs during the pre-promote action where all remaining<br>

> > nodes set their LSN location. During the promote action, the designed<br>

> > primary<br>

> > checks its LSN location is the highest one. If it is not, it just cancel<br>

> > the<br>

> > promotion so the next round will elect the best one.<br>

> ><br>

> > Back to your issue. According to the logs, both standby nodes timed out<br>

> > during<br>

> > the pre-promote action. No LSN location has been set accros the whole<br>

> > cluster.<br>

> > I couldn't see any messages from the attrd daemon related to the<br>

> > lsn_location<br>

> > attribute or other cleanup actions.<br>

> ><br>

> > I couldn't even find the INFO message from pgsqlms giving its current<br>

> > status<br>

> > before actually setting it ("Current node TL#LSN: %s"). But this message<br>

> > appears soon after a...checkpoint. See:<br>

> > <a href="https://github.com/ClusterLabs/PAF/blob/master/script/pgsqlms#L2017" rel="noreferrer" target="_blank">https://github.com/ClusterLabs/PAF/blob/master/script/pgsqlms#L2017</a><br>

> ><br>

> > Maybe you have very long checkpoints on both nodes that timed out the<br>

> > pre-promote action? Do you have some PostgreSQL logs showing some useful<br>

> > info<br>

> > around this timelapse?<br>

> >  <br>

> <br>

> Yes I have logs that show at what time checkpoints were taken. They're<br>

> attached. Currently checkpoint_timeout is set to 15 minutes. I already<br>

> lowered it from the previous value of 60 minutes. I could experiment with<br>

> even lower values for checkpoint_timeout as it seems the problem is that<br>

> forced checkpoints are taking too long (more than 60s).<br>

<br>

So it seems the theory is right. The "immediate wait" restartpoint is the<br>

CHECKPOINT issued by pgsqlms during he pre-promote action. Timestamps match.<br>

<br>

  2019-07-03 19:31:38.647 CEST [1214]: LOG: restartpoint starting: immediate<br>

    wait<br>

<br>

One minute later, we reach the LRM timeout and about one more second later, the<br>

CRM decided to shutdown PgSQL, canceling the CHECKPOINT statement from session<br>

30963:<br>

<br>

  2019-07-03 19:32:39.693 CEST [30963]: FATAL:  terminating connection due to<br>

    administrator command<br>

  2019-07-03 19:32:39.693 CEST [30963]: STATEMENT: CHECKPOINT<br>

<br>

> Perhaps I should also increase the notify timeout?<br>

<br>

The checkpoints issued from pgsqlms are supposed to be "forced", which means<br>

they are not throttled and should sync dirty buffers as fast as possible. Worst<br>

case scenario is to sync the whole shared buffers. This shouldn't take a full<br>

minute on modern servers, unless:<br>

<br>

* you have a huge shared buffers<br>

* your disk bandwidth or IOPS are saturated or very poor<br>

* you have disks issues<br>

<br>

A side note about checkpoints:<br>

<br>

15 minutes for checkpoint_timeout seems reasonable to me, but it really depend<br>

on your workload. You should just make sure most of your checkpoint are<br>

timed triggered, but not occurring too frequently to improve write cache and<br>

avoid full page writes. According to your logs, last timed checkpoint synced<br>

about 33% of your shared buffers, 2.7GB, in 811s (checkpoint throttled at<br>

3.2MB/s). This is perfectly fine.<br>

<br>

The amount of data to sync during a failover really depend on your workload.<br>

Syncing 2.7GB in 60s would require a bandwidth of 45MB/s. Syncing 8GB in 60s<br>

would require 136MB/s. Both are quite easy to achieve on a decent server setup.<br>

<br>

Back to the advice and conclusion:<br>

<br>

Lowering the checkpoint_timeout will not help. You can always find a workload<br>

that will fill your whole shared_buffers with dirty pages in less than this GUC<br>

value. Moreover, syncing 8GB of data (worst case scenario) should not take<br>

ages...<br>

<br>

However, can raise the notify action timeout to something way larger. The<br>

timeout should be set depending on your RTO (warning, you will have to sum<br>

various actions timeouts). You can even raise the notify action to 300s or<br>

larger if it makes sense to you. The point is to give enough time to the<br>

CHECKPOINT to finish, no matter what.<br>

<br>

Unfortunately, we still need this checkpoint to be able to compare safely the<br>

TL of nodes :( Even the "pg_control_checkpoint()" function appearing in 9.6<br>

doesn't help as it reads the controldata file from disk where data can be<br>

outdated with the one in memory :/<br>

The only way to avoid this might be to rely on a dedicated C extension, which<br>

seems too invasive in my opinion.<br>

<br>

If you need something faster, you might have to setup your hardware to match<br>

your performance requirement.<br>

<br>

Regards,<br>

</blockquote></div>

</div>