[ClusterLabs] pg cluster secondary not syncing after failover

Fri Jul 15 14:38:47 UTC 2016

On 07/15/2016 08:58 AM, Peter Brunnengräber wrote:
> Hello all,
>   My apologies for cross-posting this from the postgresql admins list.  I am beginning to think this may have more to do with the postgresql cluster script.
> 
>   I'm having an issue with a postgresql 9.2 cluster after failover and hope someone might be able to help me out.  I have been attempting to follow the guide provided at ClusterLabs(1) page for after a failover, but not having much luck and I don't quite understand where the issue is.  I'm running on debian wheezy.
> 
>   I have my crm_mon output below.  One server is PRI and operating normally after taking over.  I have pg setup to do the wal archiving via rsync to the opposite node.  <archive_command = 'rsync -a %p test-node2:/db/data/postgresql/9.2/pg_archive/%f'>  The rsync is working and I do see WAL files going to the other host appropriately.
> 
>   Node2 was the PRI... So after node1 that was previously in HA:sync promoted last night to PRI and node2 is stopped.  The WAL files are arriving from node1 on node2.  I cleaned-up the /tmp/PGSQL.lock file and proceed with a pg_basebackup restore from node1.  This all went well without error in the node1 postgresql log.
> 
>   After running a crm cleanup on the msPostgresql resource, node2 keeps showing 'LATEST' but gets hung up at HS:alone.  Plus I don't understand why the xlog-loc of node2 shows 0000001EB9053DD8 which is farther ahead of node1's master-baseline of 0000001EB2000080.  I saw the 'cannot stat ... 000000010000001E000000BB' error, but that seems to always happen for the current xlog filename.  Manually copying the missing WAL file from the PRI does not help.
> 
>   And if I wasn't confused enough, the pg log on node2 says "streaming replication successfully connected to primary" and the pg_stat_replication query on node1 shows connected, but ASYNC.
> 
> 
> Any ideas?

Hopefully, someone with pgsql experience can comment -- I can only give
some general pointers.

The cluster software versions in wheezy are considered quite old at this
point, though I'm not aware of anything in particular that would affect
this scenario.

Pacemaker was dropped from jessie due to an unfortunate missed deadline,
but the Debian HA team has gotten recent versions of everything working
and deployed to jessie-backports (as well as stretch and sid), so it is
easy to get a cluster going on jessie now.

You might also compare your installed pgsql resource agent against the
latest upstream to see if any changes might be relevant:

https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/pgsql

> 
> Very much appreciated!
> -With kind regards,
>  Peter Brunnengräber
> 
> 
> 
> References:
> (1) http://clusterlabs.org/wiki/PgSQL_Replicated_Cluster#after_fail-over
> 
> 
> ###
> ============
> Last updated: Wed Jul 13 14:51:53 2016
> Last change: Wed Jul 13 14:49:17 2016 via crmd on test-node2
> Stack: openais
> Current DC: test-node1 - partition with quorum
> Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
> 2 Nodes configured, 2 expected votes
> 4 Resources configured.
> ============
> 
> Online: [ test-node1 test-node2 ]
> 
> Full list of resources:
> 
>  Resource Group: g_master
>      ClusterIP-Net1     (ocf::heartbeat:IPaddr2):       Started test-node1
>      ReplicationIP-Net2 (ocf::heartbeat:IPaddr2):       Started test-node1
>  Master/Slave Set: msPostgresql [pgsql]
>      Masters: [ test-node1 ]
>      Slaves: [ test-node2 ]
> 
> Node Attributes:
> * Node test-node1:
>     + master-pgsql:0                    : 1000
>     + master-pgsql:1                    : 1000
>     + pgsql-data-status                 : LATEST
>     + pgsql-master-baseline             : 0000001EB2000080
>     + pgsql-status                      : PRI
> * Node test-node2:
>     + master-pgsql:0                    : -INFINITY
>     + master-pgsql:1                    : -INFINITY
>     + pgsql-data-status                 : LATEST
>     + pgsql-status                      : HS:alone
>     + pgsql-xlog-loc                    : 0000001EB9053DD8
> 
> Migration summary:
> * Node test-node2:
> * Node test-node1:
> 
> 
> #### Node2
> 2016-07-13 14:55:09 UTC LOG:  database system was interrupted; last known up at 2016-07-13 14:54:27 UTC
> 2016-07-13 14:55:09 UTC LOG:  creating missing WAL directory "pg_xlog/archive_status"
> cp: cannot stat `/db/data/postgresql/9.2/pg_archive/00000002.history': No such file or directory
> 2016-07-13 14:55:09 UTC LOG:  entering standby mode
> 2016-07-13 14:55:09 UTC LOG:  restored log file "000000010000001E000000BA" from archive
> 2016-07-13 14:55:09 UTC FATAL:  the database system is starting up
> 2016-07-13 14:55:09 UTC LOG:  redo starts at 1E/BA000020
> 2016-07-13 14:55:09 UTC LOG:  consistent recovery state reached at 1E/BA05FED8
> 2016-07-13 14:55:09 UTC LOG:  database system is ready to accept read only connections
> cp: cannot stat `/db/data/postgresql/9.2/pg_archive/000000010000001E000000BB': No such file or directory
> cp: cannot stat `/db/data/postgresql/9.2/pg_archive/00000002.history': No such file or directory
> 2016-07-13 14:55:09 UTC LOG:  streaming replication successfully connected to primary
> 
> 
> #### Node1
> postgres=# select application_name,upper(state),upper(sync_state) from pg_stat_replication;
> +------------------+-----------+-------+
> | application_name |   upper   | upper |
> +------------------+-----------+-------+
> | test-node2       | STREAMING | ASYNC |
> +------------------+-----------+-------+
> (1 row)
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>