[ClusterLabs] Replicated PGSQL woes

Wed Oct 19 08:12:00 EDT 2016

On Wed, 19 Oct 2016 19:44:14 +0900
Keisuke MORI <keisuke.mori+ha at gmail.com> wrote:

> 2016-10-14 18:39 GMT+09:00 Jehan-Guillaume de Rorthais <jgdr at dalibo.com>:
> > On Thu, 13 Oct 2016 14:11:06 -0800
> > Israel Brewster <israel at ravnalaska.net> wrote:
> >  
> >> On Oct 13, 2016, at 1:56 PM, Jehan-Guillaume de Rorthais <jgdr at dalibo.com>
> >> wrote:  
> >> >
> >> > On Thu, 13 Oct 2016 10:05:33 -0800
> >> > Israel Brewster <israel at ravnalaska.net> wrote:
> >> >  
> >> >> On Oct 13, 2016, at 9:41 AM, Ken Gaillot <kgaillot at redhat.com> wrote:  
> >> >>>
> >> >>> On 10/13/2016 12:04 PM, Israel Brewster wrote:  
> >> > [...]
> >> >  
> >> >>>> But whatever- this is a cluster, it doesn't really matter which node
> >> >>>> things are running on, as long as they are running. So the cluster is
> >> >>>> working - postgresql starts, the master process is on the same node as
> >> >>>> the IP, you can connect, etc, everything looks good. Obviously the
> >> >>>> next thing to try is failover - should the master node fail, the
> >> >>>> slave node should be promoted to master. So I try testing this by
> >> >>>> shutting down the cluster on the primary server: "pcs cluster stop"
> >> >>>> ...and nothing happens. The master shuts down (uncleanly, I might add
> >> >>>> - it leaves behind a lock file that prevents it from starting again
> >> >>>> until I manually remove said lock file), but the slave is never
> >> >>>> promoted to  
> >> >>>
> >> >>> This definitely needs to be corrected. What creates the lock file, and
> >> >>> how is that entity managed?  
> >> >>
> >> >> The lock file entity is created/managed by the postgresql process
> >> >> itself. On launch, postgres creates the lock file to say it is running,
> >> >> and deletes said lock file when it shuts down. To my understanding, its
> >> >> role in life is to prevent a restart after an unclean shutdown so the
> >> >> admin is reminded to make sure that the data is in a consistent state
> >> >> before starting the server again.  
> >> >
> >> > What is the name of this lock file? Where is it?
> >> >
> >> > PostgreSQL does not create lock file. It creates a "postmaster.pid" file,
> >> > but it does not forbid a startup if the new process doesn't find another
> >> > process with the pid and shm shown in the postmaster.pid.
> >> >
> >> > As far as I know, the pgsql resource agent create such a lock file on
> >> > promote and delete it on graceful stop. If the PostgreSQL instance
> >> > couldn't be stopped correctly, the lock files stays and the RA refuse to
> >> > start it the next time.  
> >>
> >> Ah, you're right. Looking auth the RA I see where it creates the file in
> >> question. The delete appears to be in the pgsql_real_stop() function (which
> >> makes sense), wrapped in an if block that checks for $1 being master and
> >> $OCF_RESKEY_CRM_meta_notify_slave_uname being a space. Throwing a little
> >> debugging code in there I see that when it hits that block on a cluster
> >> stop, $OCF_RESKEY_CRM_meta_notify_slave_uname is centtest1.ravnalaska.net
> >> <http://centtest1.ravnalaska.net/>, not a space, so the lock file is not
> >> removed:
> >>
> >>     if  [ "$1" = "master" -a "$OCF_RESKEY_CRM_meta_notify_slave_uname" = "
> >> " ]; then ocf_log info "Removing $PGSQL_LOCK."
> >>         rm -f $PGSQL_LOCK
> >>     fi
> >>
> >> It doesn't look like there is anywhere else where the file would be
> >> removed.  
> >
> > This is quite wrong to me for two reasons (I'll try to be clear):
> >
> > 1) the resource agent (RA) make sure the timeline (TL) will not be
> > incremented during promotion.
> >
> > As there is no documentation about that, I'm pretty sure this contortion
> > comes from limitations in very old versions of PostgreSQL (<= 9.1):
> >
> >   * a slave wasn't able to cross a timeline (TL) from streaming replication,
> >     only from WAL archives. That means crossing a TL was requiring to
> > restart the slave or cutting the streaming rep temporary to force it to get
> > back to the archives
> >   * moreover, it was possible a standby miss some transactions on after a
> > clean master shutdown. That means the old master couldn't get back to the
> >     cluster as a slave safely, as the TL is still the same...
> >
> > See slide 35->37:
> > http://www.slideshare.net/takmatsuo/2012929-pg-study-16012253
> >
> > In my understanding, that's why we make sure there's no slave around before
> > shutting down the master: should the master go back later cleanly, we make
> > sure no one could be promoted in the meantime.  
> 
> Yes, that is correct but  the issue described in the slide is not
> relevant to the Timeline ID issue,

Really ? That was pretty much the point of this slide as I understand it. But
as I didn't attend to this conference, I don't have the vocal explanation and I
might be wrong. But anyway, consider this:

  * slave are not connected
  * master receive some transactions
  * a clean shutdown occurs on the master
  * a slave is promoted **without TL increment**
  * the old master is started as a slave with a recovery.conf file pointing to
    the new master

Because the timeline did not change during slave promotion, we will have a
corruption on the old master (and missing transactions, even in sync rep as
the RA is able to relax it to async rep when slaves are gone).

If the timeline is actually incremented, old master will not be able to catchup
with the new master if its current LSN is ahead of the TL fork LSN, which is
the case in the scenario I described here.

So it seems to me this is quite related to the TL issue here. But again, I might
be wrong, I might miss something or miss another issue. I would be **very**,
very interested to get some more details about the issue described in this
slide, could you share them?

> and the issue in the slide could still possibly happen in the recent
> PostgreSQL release too, as far as I understand.

As I wrote, I might miss something and I really would like to you to
explain the issue and why it might still happen in recent PostgreSQL
releases.

First because it might help on the PostgreSQL dev itself, second because I work
on PAF obviously.

> > Note that considering this issue and how the RA tries to avoid it, this
> > test on slave being shutdown before master is quite weak anyway...
> >
> > Last but not least, the two PostgreSQL limitations the RA is messing with
> > have been fixed long time ago in 9.3:
> >   *
> > https://www.postgresql.org/docs/current/static/release-9-3.html#AEN138909 *
> > https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=985bd7d49726c9f178558491d31a570d47340459
> >
> > ...but it requires PostgreSQL 9.3+ for the timeline issue. By the way, I
> > suspect this is related to the "restart_on_promote" parameter of the RA.  
> 
> Yes, "restart_on_promote" parameter was introduced upon users requests
> to avoid the Timeline ID issue when PostgreSQL 9.1 (I've never used
> the option though), and  could be deprecated as of 9.3+,

Deprecated in favor of what? Always increment the TL?

> but that's a different issue from the lock file and I think that the lock file
> handling is still valid.

I look forward to have some more input about this lock file then :)

Thank you Keisuke!

Regards,
-- 
Jehan-Guillaume de Rorthais
Dalibo