[ClusterLabs] PAF fails to promote slave: Can not get current node LSN location

Tiemen Ruiten t.ruiten at tech-lab.io
Tue Jul 9 16:22:47 EDT 2019


On Tue, Jul 9, 2019 at 4:21 PM Jehan-Guillaume de Rorthais <jgdr at dalibo.com>
wrote:

> On Tue, 9 Jul 2019 13:22:06 +0200
> Tiemen Ruiten <t.ruiten at tech-lab.io> wrote:
>
> > On Mon, Jul 8, 2019 at 10:01 PM Jehan-Guillaume de Rorthais <
> jgdr at dalibo.com>
> ...
> > > I dig in xlog.c today. Maybe I can write a small extension to get the
> > > timeline
> > > from shared memory directly and make pgsqlms use it if it detects it.
> So
> > > people
> > > can decide if they feel like it is too invasive or really needed for
> > > their usecase. Maybe in next release. What do you think? Would it be
> > > useful to
> > > you?
> > >
> >
> > Yes, that would be a really useful addition IMO. I would definitely use
> it.
> > If we can avoid taking a checkpoint that will save precious minutes
> during
> > a failover and the risk of timeouts would be drastically reduced. Would
> be
> > happy to test it if you want!
>
> OK, thanks. Not sure when I'll have time to work on this. But I'll stay in
> touch with you then.
>

Great!


>
> I have to work on the v12 support as well :/
>
> > > > I managed to improve the average time checkpoints are taking already
> from
> > > > what I mentioned in that thread, mainly by decreasing
> checkpoint_timeout
> > > > and setting full_page_writes = off; ostensibly not necessary on
> ZFS.
> > >
> > > The "full_page_writes" helps lowering the amount of WAL produced. Not
> the
> > > amount of writes to sync during the checkpoint. But I am sure it helps
> for
> > > your performances :)
> >
> > If I'm saturating the IO capacity of my system during a forced checkpoint
> > and full_page_writes = off reduces IO by reducing the amount of WAL, then
> > it should help in an indirect way?
>
> The master is supposed to be gone during a failover, neither in reads or
> writes.


OK, I didn't consider this.


> The checkpoint occurs on each standby to force sync their
> controldata. The checkpoint itself does not writes to WALs or read them.
> Am I
> forgetting something obvious?
>
> Maybe you can have some writes if the standby need to sync last received
> WALs and some reads if the standby was lagging on replay...But it
> shouldn't be
> much...
>

I double-checked monitoring data: there was approximately one minute of
replication lag on one slave and two minutes of replication lag on the
other slave when the original issue occurred. By the way, I'm still seeing
worrying amounts of replication lag on both slaves at times (usually not on
both at the same time) so that's really puzzling: all hardware and
configuration is identical. Anyway, that's something for another
thread/mailinglist I suppose :)


>
> --
> Jehan-Guillaume de Rorthais
> Dalibo
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20190709/70ab3828/attachment.html>


More information about the Users mailing list