[ClusterLabs] pg cluster secondary not syncing after failover

Fri Jul 15 13:58:30 UTC 2016

Hello all,
  My apologies for cross-posting this from the postgresql admins list.  I am beginning to think this may have more to do with the postgresql cluster script.

  I'm having an issue with a postgresql 9.2 cluster after failover and hope someone might be able to help me out.  I have been attempting to follow the guide provided at ClusterLabs(1) page for after a failover, but not having much luck and I don't quite understand where the issue is.  I'm running on debian wheezy.

  I have my crm_mon output below.  One server is PRI and operating normally after taking over.  I have pg setup to do the wal archiving via rsync to the opposite node.  <archive_command = 'rsync -a %p test-node2:/db/data/postgresql/9.2/pg_archive/%f'>  The rsync is working and I do see WAL files going to the other host appropriately.

  Node2 was the PRI... So after node1 that was previously in HA:sync promoted last night to PRI and node2 is stopped.  The WAL files are arriving from node1 on node2.  I cleaned-up the /tmp/PGSQL.lock file and proceed with a pg_basebackup restore from node1.  This all went well without error in the node1 postgresql log.

  After running a crm cleanup on the msPostgresql resource, node2 keeps showing 'LATEST' but gets hung up at HS:alone.  Plus I don't understand why the xlog-loc of node2 shows 0000001EB9053DD8 which is farther ahead of node1's master-baseline of 0000001EB2000080.  I saw the 'cannot stat ... 000000010000001E000000BB' error, but that seems to always happen for the current xlog filename.  Manually copying the missing WAL file from the PRI does not help.

  And if I wasn't confused enough, the pg log on node2 says "streaming replication successfully connected to primary" and the pg_stat_replication query on node1 shows connected, but ASYNC.

Any ideas?

Very much appreciated!
-With kind regards,
 Peter Brunnengräber

References:
(1) http://clusterlabs.org/wiki/PgSQL_Replicated_Cluster#after_fail-over

###
============
Last updated: Wed Jul 13 14:51:53 2016
Last change: Wed Jul 13 14:49:17 2016 via crmd on test-node2
Stack: openais
Current DC: test-node1 - partition with quorum
Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
2 Nodes configured, 2 expected votes
4 Resources configured.
============

Online: [ test-node1 test-node2 ]

Full list of resources:

 Resource Group: g_master
     ClusterIP-Net1     (ocf::heartbeat:IPaddr2):       Started test-node1
     ReplicationIP-Net2 (ocf::heartbeat:IPaddr2):       Started test-node1
 Master/Slave Set: msPostgresql [pgsql]
     Masters: [ test-node1 ]
     Slaves: [ test-node2 ]

Node Attributes:
* Node test-node1:
    + master-pgsql:0                    : 1000
    + master-pgsql:1                    : 1000
    + pgsql-data-status                 : LATEST
    + pgsql-master-baseline             : 0000001EB2000080
    + pgsql-status                      : PRI
* Node test-node2:
    + master-pgsql:0                    : -INFINITY
    + master-pgsql:1                    : -INFINITY
    + pgsql-data-status                 : LATEST
    + pgsql-status                      : HS:alone
    + pgsql-xlog-loc                    : 0000001EB9053DD8

Migration summary:
* Node test-node2:
* Node test-node1:

#### Node2
2016-07-13 14:55:09 UTC LOG:  database system was interrupted; last known up at 2016-07-13 14:54:27 UTC
2016-07-13 14:55:09 UTC LOG:  creating missing WAL directory "pg_xlog/archive_status"
cp: cannot stat `/db/data/postgresql/9.2/pg_archive/00000002.history': No such file or directory
2016-07-13 14:55:09 UTC LOG:  entering standby mode
2016-07-13 14:55:09 UTC LOG:  restored log file "000000010000001E000000BA" from archive
2016-07-13 14:55:09 UTC FATAL:  the database system is starting up
2016-07-13 14:55:09 UTC LOG:  redo starts at 1E/BA000020
2016-07-13 14:55:09 UTC LOG:  consistent recovery state reached at 1E/BA05FED8
2016-07-13 14:55:09 UTC LOG:  database system is ready to accept read only connections
cp: cannot stat `/db/data/postgresql/9.2/pg_archive/000000010000001E000000BB': No such file or directory
cp: cannot stat `/db/data/postgresql/9.2/pg_archive/00000002.history': No such file or directory
2016-07-13 14:55:09 UTC LOG:  streaming replication successfully connected to primary

#### Node1
postgres=# select application_name,upper(state),upper(sync_state) from pg_stat_replication;
+------------------+-----------+-------+
| application_name |   upper   | upper |
+------------------+-----------+-------+
| test-node2       | STREAMING | ASYNC |
+------------------+-----------+-------+
(1 row)