<html><head><meta http-equiv="Content-Type" content="text/html charset=us-ascii"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class="">On Oct 14, 2016, at 12:30 AM, Keisuke MORI <<a href="mailto:keisuke.mori+ha@gmail.com" class="">keisuke.mori+ha@gmail.com</a>> wrote:<br class=""><div><blockquote type="cite" class=""><br class="Apple-interchange-newline"><div class=""><div class="">2016-10-14 2:04 GMT+09:00 Israel Brewster <<a href="mailto:israel@ravnalaska.net" class="">israel@ravnalaska.net</a>>:<br class=""><blockquote type="cite" class="">Summary: Two-node cluster setup with latest pgsql resource agent. Postgresql<br class="">starts initially, but failover never happens.<br class=""></blockquote><br class=""><blockquote type="cite" class="">Oct 13 08:29:47 CentTest1 pgsql(pgsql_96)[19602]: INFO: Master does not<br class="">exist.<br class="">Oct 13 08:29:47 CentTest1 pgsql(pgsql_96)[19602]: WARNING: My data is<br class="">out-of-date. status=DISCONNECT<br class="">Oct 13 08:29:51 CentTest1 pgsql(pgsql_96)[19730]: INFO: Master does not<br class="">exist.<br class="">Oct 13 08:29:51 CentTest1 pgsql(pgsql_96)[19730]: WARNING: My data is<br class="">out-of-date. status=DISCONNECT<br class=""><br class="">Those last two lines repeat indefinitely, but there is no indication that<br class="">the cluster ever tries to promote centtest1 to master. Even if I completely<br class="">shut down the cluster, and bring it back up only on centtest1, pacemaker<br class="">refuses to start postgresql on centtest1 as a master.<br class=""></blockquote><br class="">This is because the data on centtest1 is considered "out-of-date"-ed<br class="">(as it says :) and and promoting the node to master might corrupt your<br class="">database.<br class=""></div></div></blockquote><div><br class=""></div><div>Ok, that makes sense. So the problem is why the cluster thinks the data is out-of-date</div><br class=""><blockquote type="cite" class=""><div class=""><div class=""><br class=""><blockquote type="cite" class=""><br class="">What can I do to fix this? What troubleshooting steps can I follow? Thanks.<br class=""><br class=""></blockquote><br class="">It seems that the latest data should be only on centtest2 so the<br class="">recovering steps should be something like:<br class=""> - start centtest2 as master<br class=""> - take the basebackup from centtest2 to centtest1<br class=""> - start centtest1 as slave<br class=""> - make sure the replications is working properly<br class=""></div></div></blockquote><div><br class=""></div><div>I've done that. Several times. The replication works properly with either node as the master. Initially I had started centtest1 as master, because that's where I was planning to *have* the master, however when pacemaker keep insisting on starting centtest2 as the master, I also tried setting things up that way. No luck: everything works fine, but no failover.</div><br class=""><blockquote type="cite" class=""><div class=""><div class=""><br class="">see below for details.<br class=""><a href="http://clusterlabs.org/wiki/PgSQL_Replicated_Cluster" class="">http://clusterlabs.org/wiki/PgSQL_Replicated_Cluster</a></div></div></blockquote><br class=""></div><div>Yep, that's where I started from on this little adventure :-)</div><div><br class=""></div><div><blockquote type="cite" class=""><div class=""><div class=""><br class=""><br class="">Also, it would be helpful to check 'pgsql-data-status' and<br class="">'pgsql-status' attributes displayed by 'crm_mon -A' to diagnose<br class="">whether the replications is going well or not.<br class=""><br class="">The slave node should have the attributes like below, otherwise the<br class="">replications is going something wrong and the node will never be<br class="">promoted because it does not have the proper data.<br class=""><br class="">```<br class="">* Node node2:<br class=""> + master-pgsql : 100<br class=""> + pgsql-data-status : STREAMING|SYNC<br class=""> + pgsql-status : HS:sync<br class="">```<br class=""></div></div></blockquote><div><br class=""></div><div>Now THAT is interesting. I get this:</div><div><br class=""></div><div>Node Attributes:<br class="">* Node <a href="http://centtest1.ravnalaska.net" class="">centtest1.ravnalaska.net</a>:<br class=""> + master-pgsql_96 : -INFINITY <br class=""> + pgsql_96-data-status : DISCONNECT<br class=""> + pgsql_96-status : HS:alone <br class="">* Node <a href="http://centtest2.ravnalaska.net" class="">centtest2.ravnalaska.net</a>:<br class=""> + master-pgsql_96 : 1000<br class=""> + pgsql_96-data-status : LATEST <br class=""> + pgsql_96-master-baseline : 00000000070171D0<br class=""> + pgsql_96-status : PRI<br class=""><br class="">...Which seems to indicate that pacemaker doesn't think centtest1 is connected to or replicating centtest2 (if I am interpreting that correctly). And yet, it is: From postgres itself:</div><div><br class=""></div><div>[root@CentTest2 ~]# /usr/pgsql-9.6/bin/psql -h centtest2 -U postgres<br class="">psql (9.6.0)<br class="">Type "help" for help.<br class=""><br class="">postgres=# SELECT * FROM pg_replication_slots;<br class=""> slot_name | plugin | slot_type | datoid | database | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn <br class="">-----------------+--------+-----------+--------+----------+--------+------------+------+--------------+-------------+---------------------<br class=""> centtest_2_slot | | physical | | | t | 27230 | 1685 | | 0/7017438 | <br class="">(1 row)<br class=""><br class="">postgres=# </div><div><br class=""></div><div>Notice that "active" is true, indicating that the slot is connected and, well, active. Plus, from the postgresql log on centtest1:</div><div><br class=""></div><div>< 2016-10-14 08:19:38.278 AKDT > LOG: entering standby mode<br class="">< 2016-10-14 08:19:38.285 AKDT > LOG: consistent recovery state reached at 0/7017358<br class="">< 2016-10-14 08:19:38.285 AKDT > LOG: redo starts at 0/7017358<br class="">< 2016-10-14 08:19:38.285 AKDT > LOG: invalid record length at 0/7017438: wanted 24, got 0<br class="">< 2016-10-14 08:19:38.286 AKDT > LOG: database system is ready to accept read only connections<br class="">< 2016-10-14 08:19:38.292 AKDT > LOG: started streaming WAL from primary at 0/7000000 on timeline 1</div><div><br class=""></div><div>And furthermore, if I insert/change records on centtest2, those changes *do* show up on centtest1. So everything I can see says postgresql on centtest1 *is* connected and replicating properly, but the data status shows DISCONNECT and the service status shows HS:alone. So obviously something is wrong here.</div><div><br class=""></div><div><br class=""></div><div><div style="text-align: -webkit-auto; font-variant-ligatures: normal; font-variant-position: normal; font-variant-numeric: normal; font-variant-alternates: normal; font-variant-east-asian: normal; line-height: normal; orphans: 2; widows: 2; word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class=""><div style="font-family: Helvetica, sans-serif;" class=""><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: 'Times New Roman', serif;" class=""><span style="font-size: 9pt; font-family: Helvetica, sans-serif;" class="">-----------------------------------------------<o:p class=""></o:p></span></div></div><div style="font-family: Helvetica, sans-serif;" class=""><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: 'Times New Roman', serif;" class=""><span style="font-size: 9pt; font-family: Helvetica, sans-serif;" class="">Israel Brewster<o:p class=""></o:p></span></div></div><div style="font-family: Helvetica, sans-serif;" class=""><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: 'Times New Roman', serif;" class=""><span style="font-size: 9pt; font-family: Helvetica, sans-serif;" class="">Systems Analyst II<o:p class=""></o:p></span></div></div><div style="font-family: Helvetica, sans-serif;" class=""><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: 'Times New Roman', serif;" class=""><span style="font-size: 9pt; font-family: Helvetica, sans-serif;" class="">Ravn Alaska<o:p class=""></o:p></span></div></div><div style="font-family: Helvetica, sans-serif;" class=""><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: 'Times New Roman', serif;" class=""><span style="font-size: 9pt; font-family: Helvetica, sans-serif;" class="">5245 Airport Industrial Rd<o:p class=""></o:p></span></div></div><div style="font-family: Helvetica, sans-serif;" class=""><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: 'Times New Roman', serif;" class=""><span style="font-size: 9pt; font-family: Helvetica, sans-serif;" class="">Fairbanks, AK 99709<o:p class=""></o:p></span></div></div><div style="font-family: Helvetica, sans-serif;" class=""><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: 'Times New Roman', serif;" class=""><span style="font-size: 9pt; font-family: Helvetica, sans-serif;" class="">(907) 450-7293<o:p class=""></o:p></span></div></div><div style="font-family: Helvetica, sans-serif;" class=""><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: 'Times New Roman', serif;" class=""><span style="font-size: 9pt; font-family: Helvetica, sans-serif;" class="">-----------------------------------------------</span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: 'Times New Roman', serif;" class=""><span style="font-size: 9pt; font-family: Helvetica, sans-serif;" class=""></span></div></div></div></div><br class=""><blockquote type="cite" class=""><div class=""><div class=""><br class=""><br class=""><br class="">-- <br class="">Keisuke MORI<br class=""><br class="">_______________________________________________<br class="">Users mailing list: <a href="mailto:Users@clusterlabs.org" class="">Users@clusterlabs.org</a><br class=""><a href="http://clusterlabs.org/mailman/listinfo/users" class="">http://clusterlabs.org/mailman/listinfo/users</a><br class=""><br class="">Project Home: http://www.clusterlabs.org<br class="">Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf<br class="">Bugs: http://bugs.clusterlabs.org<br class=""></div></div></blockquote></div><br class=""></body></html>