[ClusterLabs] Replicated PGSQL woes
Israel Brewster
israel at ravnalaska.net
Fri Oct 14 16:25:51 UTC 2016
On Oct 14, 2016, at 12:30 AM, Keisuke MORI <keisuke.mori+ha at gmail.com> wrote:
>
> 2016-10-14 2:04 GMT+09:00 Israel Brewster <israel at ravnalaska.net>:
>> Summary: Two-node cluster setup with latest pgsql resource agent. Postgresql
>> starts initially, but failover never happens.
>
>> Oct 13 08:29:47 CentTest1 pgsql(pgsql_96)[19602]: INFO: Master does not
>> exist.
>> Oct 13 08:29:47 CentTest1 pgsql(pgsql_96)[19602]: WARNING: My data is
>> out-of-date. status=DISCONNECT
>> Oct 13 08:29:51 CentTest1 pgsql(pgsql_96)[19730]: INFO: Master does not
>> exist.
>> Oct 13 08:29:51 CentTest1 pgsql(pgsql_96)[19730]: WARNING: My data is
>> out-of-date. status=DISCONNECT
>>
>> Those last two lines repeat indefinitely, but there is no indication that
>> the cluster ever tries to promote centtest1 to master. Even if I completely
>> shut down the cluster, and bring it back up only on centtest1, pacemaker
>> refuses to start postgresql on centtest1 as a master.
>
> This is because the data on centtest1 is considered "out-of-date"-ed
> (as it says :) and and promoting the node to master might corrupt your
> database.
Ok, that makes sense. So the problem is why the cluster thinks the data is out-of-date
>
>>
>> What can I do to fix this? What troubleshooting steps can I follow? Thanks.
>>
>
> It seems that the latest data should be only on centtest2 so the
> recovering steps should be something like:
> - start centtest2 as master
> - take the basebackup from centtest2 to centtest1
> - start centtest1 as slave
> - make sure the replications is working properly
I've done that. Several times. The replication works properly with either node as the master. Initially I had started centtest1 as master, because that's where I was planning to *have* the master, however when pacemaker keep insisting on starting centtest2 as the master, I also tried setting things up that way. No luck: everything works fine, but no failover.
>
> see below for details.
> http://clusterlabs.org/wiki/PgSQL_Replicated_Cluster <http://clusterlabs.org/wiki/PgSQL_Replicated_Cluster>
Yep, that's where I started from on this little adventure :-)
>
>
> Also, it would be helpful to check 'pgsql-data-status' and
> 'pgsql-status' attributes displayed by 'crm_mon -A' to diagnose
> whether the replications is going well or not.
>
> The slave node should have the attributes like below, otherwise the
> replications is going something wrong and the node will never be
> promoted because it does not have the proper data.
>
> ```
> * Node node2:
> + master-pgsql : 100
> + pgsql-data-status : STREAMING|SYNC
> + pgsql-status : HS:sync
> ```
Now THAT is interesting. I get this:
Node Attributes:
* Node centtest1.ravnalaska.net:
+ master-pgsql_96 : -INFINITY
+ pgsql_96-data-status : DISCONNECT
+ pgsql_96-status : HS:alone
* Node centtest2.ravnalaska.net:
+ master-pgsql_96 : 1000
+ pgsql_96-data-status : LATEST
+ pgsql_96-master-baseline : 00000000070171D0
+ pgsql_96-status : PRI
...Which seems to indicate that pacemaker doesn't think centtest1 is connected to or replicating centtest2 (if I am interpreting that correctly). And yet, it is: From postgres itself:
[root at CentTest2 ~]# /usr/pgsql-9.6/bin/psql -h centtest2 -U postgres
psql (9.6.0)
Type "help" for help.
postgres=# SELECT * FROM pg_replication_slots;
slot_name | plugin | slot_type | datoid | database | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn
-----------------+--------+-----------+--------+----------+--------+------------+------+--------------+-------------+---------------------
centtest_2_slot | | physical | | | t | 27230 | 1685 | | 0/7017438 |
(1 row)
postgres=#
Notice that "active" is true, indicating that the slot is connected and, well, active. Plus, from the postgresql log on centtest1:
< 2016-10-14 08:19:38.278 AKDT > LOG: entering standby mode
< 2016-10-14 08:19:38.285 AKDT > LOG: consistent recovery state reached at 0/7017358
< 2016-10-14 08:19:38.285 AKDT > LOG: redo starts at 0/7017358
< 2016-10-14 08:19:38.285 AKDT > LOG: invalid record length at 0/7017438: wanted 24, got 0
< 2016-10-14 08:19:38.286 AKDT > LOG: database system is ready to accept read only connections
< 2016-10-14 08:19:38.292 AKDT > LOG: started streaming WAL from primary at 0/7000000 on timeline 1
And furthermore, if I insert/change records on centtest2, those changes *do* show up on centtest1. So everything I can see says postgresql on centtest1 *is* connected and replicating properly, but the data status shows DISCONNECT and the service status shows HS:alone. So obviously something is wrong here.
-----------------------------------------------
Israel Brewster
Systems Analyst II
Ravn Alaska
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293
-----------------------------------------------
>
>
>
> --
> Keisuke MORI
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20161014/70f9e5ae/attachment.htm>
More information about the Users
mailing list