<html><head><meta http-equiv="Content-Type" content="text/html charset=us-ascii"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class="">On Oct 13, 2016, at 1:56 PM, Jehan-Guillaume de Rorthais <<a href="mailto:jgdr@dalibo.com" class="">jgdr@dalibo.com</a>> wrote:<br class=""><div><blockquote type="cite" class=""><br class="Apple-interchange-newline"><div class=""><div class="">On Thu, 13 Oct 2016 10:05:33 -0800<br class="">Israel Brewster <<a href="mailto:israel@ravnalaska.net" class="">israel@ravnalaska.net</a>> wrote:<br class=""><br class=""><blockquote type="cite" class="">On Oct 13, 2016, at 9:41 AM, Ken Gaillot <<a href="mailto:kgaillot@redhat.com" class="">kgaillot@redhat.com</a>> wrote:<br class=""><blockquote type="cite" class=""><br class="">On 10/13/2016 12:04 PM, Israel Brewster wrote: <br class=""></blockquote></blockquote>[...]<br class=""><br class=""><blockquote type="cite" class=""><blockquote type="cite" class=""><blockquote type="cite" class="">But whatever- this is a cluster, it doesn't really matter which node<br class="">things are running on, as long as they are running. So the cluster is<br class="">working - postgresql starts, the master process is on the same node as<br class="">the IP, you can connect, etc, everything looks good. Obviously the next<br class="">thing to try is failover - should the master node fail, the slave node<br class="">should be promoted to master. So I try testing this by shutting down the<br class="">cluster on the primary server: "pcs cluster stop"<br class="">...and nothing happens. The master shuts down (uncleanly, I might add -<br class="">it leaves behind a lock file that prevents it from starting again until<br class="">I manually remove said lock file), but the slave is never promoted to <br class=""></blockquote><br class="">This definitely needs to be corrected. What creates the lock file, and<br class="">how is that entity managed? <br class=""></blockquote><br class="">The lock file entity is created/managed by the postgresql process itself. On<br class="">launch, postgres creates the lock file to say it is running, and deletes said<br class="">lock file when it shuts down. To my understanding, its role in life is to<br class="">prevent a restart after an unclean shutdown so the admin is reminded to make<br class="">sure that the data is in a consistent state before starting the server again.<br class=""></blockquote><br class="">What is the name of this lock file? Where is it?<br class=""><br class="">PostgreSQL does not create lock file. It creates a "postmaster.pid" file, but<br class="">it does not forbid a startup if the new process doesn't find another process<br class="">with the pid and shm shown in the postmaster.pid.<br class=""><br class="">As far as I know, the pgsql resource agent create such a lock file on promote<br class="">and delete it on graceful stop. If the PostgreSQL instance couldn't be stopped<br class="">correctly, the lock files stays and the RA refuse to start it the next time.<br class=""></div></div></blockquote><div><br class=""></div><div>Ah, you're right. Looking auth the RA I see where it creates the file in question. The delete appears to be in the pgsql_real_stop() function (which makes sense), wrapped in an if block that checks for $1 being master and $OCF_RESKEY_CRM_meta_notify_slave_uname being a space. Throwing a little debugging code in there I see that when it hits that block on a cluster stop, $OCF_RESKEY_CRM_meta_notify_slave_uname is <a href="http://centtest1.ravnalaska.net" class="">centtest1.ravnalaska.net</a>, not a space, so the lock file is not removed:</div><div><br class=""></div><div> if [ "$1" = "master" -a "$OCF_RESKEY_CRM_meta_notify_slave_uname" = " " ]; then</div><div> ocf_log info "Removing $PGSQL_LOCK."</div><div> rm -f $PGSQL_LOCK</div><div> fi </div><div><br class=""></div><div>It doesn't look like there is anywhere else where the file would be removed.</div><br class=""><blockquote type="cite" class=""><div class=""><div class=""><br class="">[...]<br class=""><blockquote type="cite" class=""><blockquote type="cite" class=""><blockquote type="cite" class="">What can I do to fix this? What troubleshooting steps can I follow? Thanks.<br class=""></blockquote></blockquote></blockquote><br class="">I can not find the result of the stop operation in your log files, maybe the<br class="">log from CentTest2 would be more useful.</div></div></blockquote><div><br class=""></div><div>Sure. I was looking at centtest1 because I was trying to figure out why it wouldn't promote, but if centtest2 never really stopped (properly) that could explain things. Here's the log from 2 when calling pcs cluster stop:</div><div><br class=""></div><div>Oct 13 14:05:14 CentTest2 attrd[9424]: notice: Sending flush op to all hosts for: standby (true)<br class="">Oct 13 14:05:14 CentTest2 attrd[9424]: notice: Sent update 26: standby=true<br class="">Oct 13 14:05:14 CentTest2 pacemaker: Waiting for shutdown of managed resources<br class="">Oct 13 14:05:14 CentTest2 crmd[9426]: notice: Operation pgsql_96_notify_0: ok (node=<a href="http://centtest2.ravnalaska.net" class="">centtest2.ravnalaska.net</a>, call=21, rc=0, cib-update=0, confirmed=true)<br class="">Oct 13 14:05:14 CentTest2 attrd[9424]: notice: Sending flush op to all hosts for: master-pgsql_96 (-INFINITY)<br class="">Oct 13 14:05:14 CentTest2 attrd[9424]: notice: Sent update 28: master-pgsql_96=-INFINITY<br class="">Oct 13 14:05:14 CentTest2 attrd[9424]: notice: Sending flush op to all hosts for: pgsql_96-master-baseline (<null>)<br class="">Oct 13 14:05:14 CentTest2 attrd[9424]: notice: Sent delete 30: node=<a href="http://centtest2.ravnalaska.net" class="">centtest2.ravnalaska.net</a>, attr=pgsql_96-master-baseline, id=<n/a>, set=(null), section=status<br class="">Oct 13 14:05:14 CentTest2 attrd[9424]: notice: Sent delete 32: node=<a href="http://centtest2.ravnalaska.net" class="">centtest2.ravnalaska.net</a>, attr=pgsql_96-master-baseline, id=<n/a>, set=(null), section=status<br class="">Oct 13 14:05:14 CentTest2 pgsql(pgsql_96)[5107]: INFO: Stopping PostgreSQL on demote.<br class="">Oct 13 14:05:14 CentTest2 pgsql(pgsql_96)[5107]: INFO: stop_escalate(or stop_escalate_in_slave) time is adjusted to 50 based on the configured timeout.<br class="">Oct 13 14:05:14 CentTest2 pgsql(pgsql_96)[5107]: INFO: server shutting down<br class="">Oct 13 14:05:15 CentTest2 pgsql(pgsql_96)[5107]: INFO: PostgreSQL is down<br class="">Oct 13 14:05:15 CentTest2 pgsql(pgsql_96)[5107]: INFO: Changing pgsql_96-status on <a href="http://centtest2.ravnalaska.net" class="">centtest2.ravnalaska.net</a> : PRI->STOP.<br class="">Oct 13 14:05:15 CentTest2 attrd[9424]: notice: Sending flush op to all hosts for: pgsql_96-status (STOP)<br class="">Oct 13 14:05:15 CentTest2 attrd[9424]: notice: Sent update 34: pgsql_96-status=STOP<br class="">Oct 13 14:05:15 CentTest2 crmd[9426]: notice: Operation pgsql_96_demote_0: ok (node=<a href="http://centtest2.ravnalaska.net" class="">centtest2.ravnalaska.net</a>, call=23, rc=0, cib-update=20, confirmed=true)<br class="">Oct 13 14:05:15 CentTest2 crmd[9426]: notice: Operation pgsql_96_notify_0: ok (node=<a href="http://centtest2.ravnalaska.net" class="">centtest2.ravnalaska.net</a>, call=24, rc=0, cib-update=0, confirmed=true)<br class="">Oct 13 14:05:15 CentTest2 IPaddr2(virtual_ip)[5360]: INFO: IP status = ok, IP_CIP=<br class="">Oct 13 14:05:15 CentTest2 crmd[9426]: notice: Operation virtual_ip_stop_0: ok (node=<a href="http://centtest2.ravnalaska.net" class="">centtest2.ravnalaska.net</a>, call=26, rc=0, cib-update=21, confirmed=true)<br class="">Oct 13 14:05:15 CentTest2 crmd[9426]: notice: Operation pgsql_96_notify_0: ok (node=<a href="http://centtest2.ravnalaska.net" class="">centtest2.ravnalaska.net</a>, call=27, rc=0, cib-update=0, confirmed=true)<br class="">Oct 13 14:05:15 CentTest2 pgsql(pgsql_96)[5497]: INFO: PostgreSQL is already stopped.<br class="">Oct 13 14:05:15 CentTest2 crmd[9426]: notice: Operation pgsql_96_stop_0: ok (node=<a href="http://centtest2.ravnalaska.net" class="">centtest2.ravnalaska.net</a>, call=28, rc=0, cib-update=22, confirmed=true)<br class="">Oct 13 14:05:16 CentTest2 pacemaker: Leaving fence domain<br class="">Oct 13 14:05:16 CentTest2 pacemaker: Stopping fenced 9227<br class="">Oct 13 14:05:16 CentTest2 pacemaker: Signaling Pacemaker Cluster Manager to terminate<br class="">Oct 13 14:05:16 CentTest2 pacemakerd[9415]: notice: Invoking handler for signal 15: Terminated<br class="">Oct 13 14:05:16 CentTest2 pacemakerd[9415]: notice: Shutting down Pacemaker<br class="">Oct 13 14:05:16 CentTest2 pacemaker: Waiting for cluster services to unload<br class="">Oct 13 14:05:16 CentTest2 pacemakerd[9415]: notice: Stopping crmd: Sent -15 to process 9426<br class="">Oct 13 14:05:16 CentTest2 crmd[9426]: notice: Invoking handler for signal 15: Terminated<br class="">Oct 13 14:05:16 CentTest2 crmd[9426]: notice: Requesting shutdown, upper limit is 1200000ms<br class="">Oct 13 14:05:16 CentTest2 attrd[9424]: notice: Update relayed from <a href="http://centtest1.ravnalaska.net" class="">centtest1.ravnalaska.net</a><br class="">Oct 13 14:05:16 CentTest2 attrd[9424]: notice: Sending flush op to all hosts for: shutdown (1476396316)<br class="">Oct 13 14:05:16 CentTest2 attrd[9424]: notice: Sent update 37: shutdown=1476396316<br class="">Oct 13 14:05:16 CentTest2 crmd[9426]: notice: State transition S_NOT_DC -> S_STOPPING [ input=I_STOP cause=C_HA_MESSAGE origin=route_message ]<br class="">Oct 13 14:05:16 CentTest2 crmd[9426]: notice: Stopped 0 recurring operations at shutdown... waiting (0 ops remaining)<br class="">Oct 13 14:05:16 CentTest2 crmd[9426]: notice: Disconnected from the LRM<br class="">Oct 13 14:05:16 CentTest2 crmd[9426]: notice: Disconnecting from Corosync<br class="">Oct 13 14:05:16 CentTest2 pengine[9425]: notice: Invoking handler for signal 15: Terminated<br class="">Oct 13 14:05:16 CentTest2 pacemakerd[9415]: notice: Stopping pengine: Sent -15 to process 9425<br class="">Oct 13 14:05:16 CentTest2 pacemakerd[9415]: notice: Stopping attrd: Sent -15 to process 9424<br class="">Oct 13 14:05:16 CentTest2 attrd[9424]: notice: Invoking handler for signal 15: Terminated<br class="">Oct 13 14:05:16 CentTest2 attrd[9424]: notice: Exiting...<br class="">Oct 13 14:05:16 CentTest2 pacemakerd[9415]: notice: Stopping lrmd: Sent -15 to process 9423<br class="">Oct 13 14:05:16 CentTest2 lrmd[9423]: notice: Invoking handler for signal 15: Terminated<br class="">Oct 13 14:05:16 CentTest2 pacemakerd[9415]: notice: Stopping stonith-ng: Sent -15 to process 9422<br class="">Oct 13 14:05:16 CentTest2 stonith-ng[9422]: notice: Invoking handler for signal 15: Terminated<br class="">Oct 13 14:05:16 CentTest2 pacemakerd[9415]: notice: Stopping cib: Sent -15 to process 9421<br class="">Oct 13 14:05:16 CentTest2 cib[9421]: warning: new_event_notification (9421-9422-12): Broken pipe (32)<br class="">Oct 13 14:05:16 CentTest2 cib[9421]: warning: Notification of client crmd/efe7e6d2-3a51-4e0d-a4c5-cb986a8711d8 failed<br class="">Oct 13 14:05:16 CentTest2 cib[9421]: notice: Invoking handler for signal 15: Terminated<br class="">Oct 13 14:05:16 CentTest2 cib[9421]: notice: Disconnecting from Corosync<br class="">Oct 13 14:05:16 CentTest2 cib[9421]: notice: Disconnecting from Corosync<br class="">Oct 13 14:05:16 CentTest2 pacemakerd[9415]: notice: Shutdown complete<br class="">Oct 13 14:05:17 CentTest2 kernel: dlm: closing connection to node 2<br class="">Oct 13 14:05:17 CentTest2 kernel: dlm: closing connection to node 1<br class="">Oct 13 14:05:18 CentTest2 corosync[9163]: [SERV ] Unloading all Corosync service engines.<br class="">Oct 13 14:05:18 CentTest2 corosync[9163]: [SERV ] Service engine unloaded: corosync extended virtual synchrony service<br class="">Oct 13 14:05:18 CentTest2 corosync[9163]: [SERV ] Service engine unloaded: corosync configuration service<br class="">Oct 13 14:05:18 CentTest2 corosync[9163]: [SERV ] Service engine unloaded: corosync cluster closed process group service v1.01<br class="">Oct 13 14:05:18 CentTest2 corosync[9163]: [SERV ] Service engine unloaded: corosync cluster config database access v1.01<br class="">Oct 13 14:05:18 CentTest2 corosync[9163]: [SERV ] Service engine unloaded: corosync profile loading service<br class="">Oct 13 14:05:18 CentTest2 corosync[9163]: [SERV ] Service engine unloaded: openais checkpoint service B.01.01<br class="">Oct 13 14:05:18 CentTest2 corosync[9163]: [SERV ] Service engine unloaded: corosync CMAN membership service 2.90<br class="">Oct 13 14:05:18 CentTest2 corosync[9163]: [SERV ] Service engine unloaded: corosync cluster quorum service v0.1<br class="">Oct 13 14:05:18 CentTest2 corosync[9163]: [MAIN ] Corosync Cluster Engine exiting with status 0 at main.c:2055.<br class=""><br class=""></div><br class=""><blockquote type="cite" class=""><div class=""><div class=""> but I can find this:<br class=""><br class=""> Oct 13 08:29:41 CentTest1 pengine[30095]: notice: Scheduling Node<br class=""> <a href="http://centtest2.ravnalaska.net" class="">centtest2.ravnalaska.net</a> for shutdown<br class=""> ...<br class=""> Oct 13 08:29:41 CentTest1 pengine[30095]: notice: Scheduling Node<br class=""> <a href="http://centtest2.ravnalaska.net" class="">centtest2.ravnalaska.net</a> for shutdown<br class=""><br class="">Which means the stop operation probably raised an error, leading to a fencing<br class="">of the node. In this circumstance, I bet PostgreSQL wasn't able to stop<br class="">correctly and the lock file stayed in place.<br class=""><br class="">Could you please show us your full cluster setup?<br class=""></div></div></blockquote><div><br class=""></div><div>Sure: how? pcs status shows this, but I suspect that's not what you are asking about:</div><div><br class=""></div>Cluster name: cluster_test<br class="">Last updated: Thu Oct 13 14:09:29 2016<span class="Apple-tab-span" style="white-space:pre"> </span>Last change: Thu Oct 13 12:26:55 2016 by root via crmd on <a href="http://centtest1.ravnalaska.net" class="">centtest1.ravnalaska.net</a><br class="">Stack: cman<br class="">Current DC: <a href="http://centtest1.ravnalaska.net" class="">centtest1.ravnalaska.net</a> (version 1.1.14-8.el6_8.1-70404b0) - partition with quorum<br class="">2 nodes and 3 resources configured<br class=""><br class="">Online: [ <a href="http://centtest1.ravnalaska.net" class="">centtest1.ravnalaska.net</a> <a href="http://centtest2.ravnalaska.net" class="">centtest2.ravnalaska.net</a> ]<br class=""><br class="">Full list of resources:<br class=""><br class=""> virtual_ip<span class="Apple-tab-span" style="white-space:pre"> </span>(ocf::heartbeat:IPaddr2):<span class="Apple-tab-span" style="white-space:pre"> </span>Started <a href="http://centtest2.ravnalaska.net" class="">centtest2.ravnalaska.net</a><br class=""> Master/Slave Set: msPostgresql [pgsql_96]<br class=""> Masters: [ <a href="http://centtest2.ravnalaska.net" class="">centtest2.ravnalaska.net</a> ]<br class=""> Slaves: [ <a href="http://centtest1.ravnalaska.net" class="">centtest1.ravnalaska.net</a> ]<br class=""><br class="">PCSD Status:<br class=""> <a href="http://centtest1.ravnalaska.net" class="">centtest1.ravnalaska.net</a>: Online<br class=""> <a href="http://centtest2.ravnalaska.net" class="">centtest2.ravnalaska.net</a>: Online</div><div><br class=""><blockquote type="cite" class=""><div class=""><div class=""><br class="">By the way, did you had a look to the PAF project? <br class=""><br class=""> <a href="http://dalibo.github.io/PAF/" class="">http://dalibo.github.io/PAF/</a><br class=""> <a href="http://dalibo.github.io/PAF/documentation.html" class="">http://dalibo.github.io/PAF/documentation.html</a><br class=""><br class="">The v1.1 version for EL6 is not ready yet, but you might want to give it a<br class="">try: <a href="https://github.com/dalibo/PAF/tree/v1.1" class="">https://github.com/dalibo/PAF/tree/v1.1</a><br class=""><br class="">I would recommend EL7 and PAF 2.0, published, packaged, ready to use.<br class=""></div></div></blockquote><div><br class=""></div>First I've heard of it. I'll give it a look.</div><div><br class=""></div><div>Thanks!<br class=""><div><br class=""></div><br class=""><blockquote type="cite" class=""><div class=""><div class="">Regards,<br class=""><br class="">-- <br class="">Jehan-Guillaume (ioguix) de Rorthais<br class="">Dalibo<br class=""></div></div></blockquote></div><br class=""></body></html>