[ClusterLabs] PAF fails to promote slave: Can not get current node LSN location

Mon Jul 8 10:58:58 EDT 2019

On Mon, 8 Jul 2019 13:56:49 +0200
Tiemen Ruiten <t.ruiten at tech-lab.io> wrote:

> Thank you for the clear explanation and advice.
> 
> Hardware is adequate: 8x SSD and 20 cores per node, but I should note that
> the filesystem is ZFS (stripe of mirrors) and there seems to be evidence
> that the way the WAL writer allocates space and ZFS' Copy-on-Write nature
> don't play nice. A patch that adds several GUCs to improve the situation

Wait, how better performances on WAL writes will help you there? Checkpoints
does not writes to WAL, it actually sync data from shared buffers to data
files (heap, toast, index, internal stuffs, etc). Write performances to WAL is
related to the number of xact you can achieve per seconds (if you have
synchronous_commit >= local), not your checkpoint writes.

> (at least it's worth trying, there was some disagreement on the
> pgsql-general list over whether it would be helpful in my situation)

Do you have a link to this thread ?

> has recently been merged: https://postgrespro.com/list/thread-id/2393057 but
> it won't be available in the 11.x release . So while I'm waiting until I can
> upgrade to PostgreSQL 12, I'll increase the notify timeout. 

Do not hold your breath until you upgrade to 12, I'm not convinced (but I
might miss something) this patch is useful to you.

> A larger RTO is much more preferrable over manual actions in the middle of
> the night!

sure...but it depend on the usecase :)

> Thanks again!

You're welcome and good luck!

> On Mon, Jul 8, 2019 at 11:56 AM Jehan-Guillaume de Rorthais <jgdr at dalibo.com>
> wrote:
> 
> > On Sat, 6 Jul 2019 12:30:34 +0200
> > Tiemen Ruiten <t.ruiten at tech-lab.io> wrote:
> >  
> > > On Fri, Jul 5, 2019 at 5:09 PM Jehan-Guillaume de Rorthais <
> > jgdr at dalibo.com>
> > > wrote:
> > >  
> > > > It seems to me the problem comes from here:
> > > >
> > > >   Jul 03 19:31:38 [30151] ph-sql-03.prod.ams.i.rdmedia.com       crmd:
> > > >  notice:
> > > >     te_rsc_command:     Initiating notify operation
> > > >     pgsqld_pre_notify_promote_0 on ph-sql-05 | action 67
> > > >   Jul 03 19:32:38 [30148] ph-sql-03.prod.ams.i.rdmedia.com       lrmd:
> > > > warning:
> > > >     operation_finished: pgsqld_notify_0:30939 - timed out after 60000ms
> > > >
> > > > and here:
> > > >
> > > >   Jul 03 19:31:38 [11914] ph-sql-05.prod.ams.i.rdmedia.com       lrmd:
> > > >  info:
> > > >     log_execute:        executing - rsc:pgsqld action:notify call_id:38
> > > >     pgsqlms(pgsqld)[20881]:
> > > >   Jul 03 19:32:38 [11914] ph-sql-05.prod.ams.i.rdmedia.com       lrmd:
> > > >     warning: operation_finished:        pgsqld_notify_0:20881 - timed  
> > out  
> > > > after
> > > >     60000ms
> > > >
> > > > The pgsql election occurs during the pre-promote action where all  
> > remaining  
> > > > nodes set their LSN location. During the promote action, the designed
> > > > primary
> > > > checks its LSN location is the highest one. If it is not, it just  
> > cancel  
> > > > the
> > > > promotion so the next round will elect the best one.
> > > >
> > > > Back to your issue. According to the logs, both standby nodes timed out
> > > > during
> > > > the pre-promote action. No LSN location has been set accros the whole
> > > > cluster.
> > > > I couldn't see any messages from the attrd daemon related to the
> > > > lsn_location
> > > > attribute or other cleanup actions.
> > > >
> > > > I couldn't even find the INFO message from pgsqlms giving its current
> > > > status
> > > > before actually setting it ("Current node TL#LSN: %s"). But this  
> > message  
> > > > appears soon after a...checkpoint. See:
> > > > https://github.com/ClusterLabs/PAF/blob/master/script/pgsqlms#L2017
> > > >
> > > > Maybe you have very long checkpoints on both nodes that timed out the
> > > > pre-promote action? Do you have some PostgreSQL logs showing some  
> > useful  
> > > > info
> > > > around this timelapse?
> > > >  
> > >
> > > Yes I have logs that show at what time checkpoints were taken. They're
> > > attached. Currently checkpoint_timeout is set to 15 minutes. I already
> > > lowered it from the previous value of 60 minutes. I could experiment with
> > > even lower values for checkpoint_timeout as it seems the problem is that
> > > forced checkpoints are taking too long (more than 60s).  
> >
> > So it seems the theory is right. The "immediate wait" restartpoint is the
> > CHECKPOINT issued by pgsqlms during he pre-promote action. Timestamps
> > match.
> >
> >   2019-07-03 19:31:38.647 CEST [1214]: LOG: restartpoint starting:
> > immediate
> >     wait
> >
> > One minute later, we reach the LRM timeout and about one more second
> > later, the
> > CRM decided to shutdown PgSQL, canceling the CHECKPOINT statement from
> > session
> > 30963:
> >
> >   2019-07-03 19:32:39.693 CEST [30963]: FATAL:  terminating connection due
> > to
> >     administrator command
> >   2019-07-03 19:32:39.693 CEST [30963]: STATEMENT: CHECKPOINT
> >  
> > > Perhaps I should also increase the notify timeout?  
> >
> > The checkpoints issued from pgsqlms are supposed to be "forced", which
> > means
> > they are not throttled and should sync dirty buffers as fast as possible.
> > Worst
> > case scenario is to sync the whole shared buffers. This shouldn't take a
> > full
> > minute on modern servers, unless:
> >
> > * you have a huge shared buffers
> > * your disk bandwidth or IOPS are saturated or very poor
> > * you have disks issues
> >
> > A side note about checkpoints:
> >
> > 15 minutes for checkpoint_timeout seems reasonable to me, but it really
> > depend
> > on your workload. You should just make sure most of your checkpoint are
> > timed triggered, but not occurring too frequently to improve write cache
> > and
> > avoid full page writes. According to your logs, last timed checkpoint
> > synced
> > about 33% of your shared buffers, 2.7GB, in 811s (checkpoint throttled at
> > 3.2MB/s). This is perfectly fine.
> >
> > The amount of data to sync during a failover really depend on your
> > workload.
> > Syncing 2.7GB in 60s would require a bandwidth of 45MB/s. Syncing 8GB in
> > 60s
> > would require 136MB/s. Both are quite easy to achieve on a decent server
> > setup.
> >
> > Back to the advice and conclusion:
> >
> > Lowering the checkpoint_timeout will not help. You can always find a
> > workload
> > that will fill your whole shared_buffers with dirty pages in less than
> > this GUC
> > value. Moreover, syncing 8GB of data (worst case scenario) should not take
> > ages...
> >
> > However, can raise the notify action timeout to something way larger. The
> > timeout should be set depending on your RTO (warning, you will have to sum
> > various actions timeouts). You can even raise the notify action to 300s or
> > larger if it makes sense to you. The point is to give enough time to the
> > CHECKPOINT to finish, no matter what.
> >
> > Unfortunately, we still need this checkpoint to be able to compare safely
> > the
> > TL of nodes :( Even the "pg_control_checkpoint()" function appearing in 9.6
> > doesn't help as it reads the controldata file from disk where data can be
> > outdated with the one in memory :/
> > The only way to avoid this might be to rely on a dedicated C extension,
> > which
> > seems too invasive in my opinion.
> >
> > If you need something faster, you might have to setup your hardware to
> > match
> > your performance requirement.
> >
> > Regards,
> >  

-- 
Jehan-Guillaume de Rorthais
Dalibo