[ClusterLabs] PAF fails to promote slave: Can not get current node LSN location
Jehan-Guillaume de Rorthais
jgdr at dalibo.com
Wed Jul 10 08:47:18 EDT 2019
On Tue, 9 Jul 2019 22:22:47 +0200
Tiemen Ruiten <t.ruiten at tech-lab.io> wrote:
> On Tue, Jul 9, 2019 at 4:21 PM Jehan-Guillaume de Rorthais <jgdr at dalibo.com>
> wrote:
>
> > On Tue, 9 Jul 2019 13:22:06 +0200
> > Tiemen Ruiten <t.ruiten at tech-lab.io> wrote:
> >
> > > On Mon, Jul 8, 2019 at 10:01 PM Jehan-Guillaume de Rorthais <
> > jgdr at dalibo.com>
> > ...
> > > > I dig in xlog.c today. Maybe I can write a small extension to get the
> > > > timeline
> > > > from shared memory directly and make pgsqlms use it if it detects it.
> > So
> > > > people
> > > > can decide if they feel like it is too invasive or really needed for
> > > > their usecase. Maybe in next release. What do you think? Would it be
> > > > useful to
> > > > you?
> > > >
> > >
> > > Yes, that would be a really useful addition IMO. I would definitely use
> > it.
> > > If we can avoid taking a checkpoint that will save precious minutes
> > during
> > > a failover and the risk of timeouts would be drastically reduced. Would
> > be
> > > happy to test it if you want!
> >
> > OK, thanks. Not sure when I'll have time to work on this. But I'll stay in
> > touch with you then.
> >
>
> Great!
>
>
> >
> > I have to work on the v12 support as well :/
> >
> > > > > I managed to improve the average time checkpoints are taking already
> > from
> > > > > what I mentioned in that thread, mainly by decreasing
> > checkpoint_timeout
> > > > > and setting full_page_writes = off; ostensibly not necessary on
> > ZFS.
> > > >
> > > > The "full_page_writes" helps lowering the amount of WAL produced. Not
> > the
> > > > amount of writes to sync during the checkpoint. But I am sure it helps
> > for
> > > > your performances :)
> > >
> > > If I'm saturating the IO capacity of my system during a forced checkpoint
> > > and full_page_writes = off reduces IO by reducing the amount of WAL, then
> > > it should help in an indirect way?
> >
> > The master is supposed to be gone during a failover, neither in reads or
> > writes.
>
>
> OK, I didn't consider this.
>
>
> > The checkpoint occurs on each standby to force sync their
> > controldata. The checkpoint itself does not writes to WALs or read them.
> > Am I
> > forgetting something obvious?
> >
> > Maybe you can have some writes if the standby need to sync last received
> > WALs and some reads if the standby was lagging on replay...But it
> > shouldn't be
> > much...
> >
>
> I double-checked monitoring data: there was approximately one minute of
> replication lag on one slave and two minutes of replication lag on the
> other slave when the original issue occurred.
what lag? current primary LSN versus sent, received, synced or replayed?
> By the way, I'm still seeing worrying amounts of replication lag on both
> slaves at times (usually not on both at the same time) so that's really
> puzzling: all hardware and configuration is identical.
Same question. What metric do you look at exactly?
> Anyway, that's something for another thread/mailinglist I suppose :)
Indeed. this should be discussed on pgsql-general rather than on clusterlabs :)
More information about the Users
mailing list