[ClusterLabs] Pacemaker: pgsql

Fri Sep 27 18:10:11 EDT 2019

On Fri, 27 Sep 2019 12:14:09 -0500
Ken Gaillot <kgaillot at redhat.com> wrote:

> On Fri, 2019-09-27 at 19:03 +0530, Shital A wrote:
> > 
> > 
> > On Tue, 24 Sep 2019, 22:20 Shital A, <brightuser2019 at gmail.com>
> > wrote:  
> > > Hello,
> > > 
> > > We have setup active-passive cluster using streaming replication on
> > > Rhel7.5. We are testing pacemaker for automated failover.
> > > We are seeing below issues with the setup :
> > > 
> > > 1. When a failover is triggered when data is being added to the
> > > primary by killing primary (killall -9 postgres), the standby
> > > doesnt come up in sync.
> > > On pacemaker, the crm_mon -Afr shows standby in disconnected and
> > > HS:alone state.
> > > 
> > > On postgres, we see below error:
> > > 
> > > < 2019-09-20 17:07:46.266 IST > LOG:  entering standby mode
> > > < 2019-09-20 17:07:46.267 IST > LOG:  database system was not
> > > properly shut down; automatic recovery in progress
> > > < 2019-09-20 17:07:46.270 IST > LOG:  redo starts at 1/680A2188
> > > < 2019-09-20 17:07:46.370 IST > LOG:  consistent recovery state
> > > reached at 1/6879D9F8
> > > < 2019-09-20 17:07:46.370 IST > LOG:  database system is ready to
> > > accept read only connections
> > > cp: cannot stat
> > > '/var/lib/pgsql/9.6/data/archivedir/000000010000000100000068': No
> > > such file or directory
> > > < 2019-09-20 17:07:46.751 IST > LOG:  statement: select
> > > pg_is_in_recovery()
> > > < 2019-09-20 17:07:46.782 IST > LOG:  statement: show
> > > synchronous_standby_names
> > > < 2019-09-20 17:07:50.993 IST > LOG:  statement: select
> > > pg_is_in_recovery()
> > > < 2019-09-20 17:07:53.395 IST > LOG:  started streaming WAL from
> > > primary at 1/68000000 on timeline 1
> > > < 2019-09-20 17:07:53.436 IST > LOG:  invalid contrecord length
> > > 2662 at 1/6879D9F8
> > > < 2019-09-20 17:07:53.438 IST > FATAL:  terminating walreceiver
> > > process due to administrator command
> > > cp: cannot stat
> > > '/var/lib/pgsql/9.6/data/archivedir/00000002.history': No such file
> > > or directory
> > > cp: cannot stat
> > > '/var/lib/pgsql/9.6/data/archivedir/000000010000000100000068': No
> > > such file or directory
> > > 
> > > When we try to restart postgres on the standby, using pg_ctl
> > > restart, the standby start syncing.
> > > 
> > > 
> > > 2. After standby syncs using pg_ctl restart as mentioned above, we
> > > found out that 1-2 records are missing on the standby.
> > > 
> > > Need help to check:
> > > 1. why the standby starts in disconnect, HS:alone state? 
> > > 
> > > f you have faced this issue/have knowledge, please let us know.
> > > 
> > > Thanks.  
> > 
> > 
> > Hello,
> > 
> > I didn't  receive any reply on this issue.wondering whether there are
> > no opinions or whether pacemaker with pgsql is not recommended?.

I did not read your mail because my experience with the pgsql resource agent
is quite old and I lost interest to it. Now that I focus on the details of your
original mail, something looks strange to me: how a standby could lost records?

In normal situation, a standby is more or less a clone from the primary, no
matter how you kill it. At worst, the clone is just lagging behind, but it can
not "lost records".

Are you able to reproduce this behavior outside Pacemaker? Just build your
primary and standby, wait for them to replicate, create some activity, then kill
the primary and restart it. If you lost records, then provide more infos about
your whole procedure to reproduce it and investigate what's going on.

> There are quite a few pacemaker+pgsql users active on this list, but
> they may not have time to respond at the moment. Most are using the PAF
> agent rather than the pgsql agent (see 
> https://github.com/ClusterLabs/PAF ).

This might not be a bug from the pgsql RA. I would bet on the procedure on a
first guess. But indeed, OP might want to have a look on the PAF resource
agent.

Regards,