[ClusterLabs] Replicated PGSQL woes

Thu Oct 13 21:56:06 UTC 2016

On Thu, 13 Oct 2016 10:05:33 -0800
Israel Brewster <israel at ravnalaska.net> wrote:

> On Oct 13, 2016, at 9:41 AM, Ken Gaillot <kgaillot at redhat.com> wrote:
> > 
> > On 10/13/2016 12:04 PM, Israel Brewster wrote:  
[...]

> >> But whatever- this is a cluster, it doesn't really matter which node
> >> things are running on, as long as they are running. So the cluster is
> >> working - postgresql starts, the master process is on the same node as
> >> the IP, you can connect, etc, everything looks good. Obviously the next
> >> thing to try is failover - should the master node fail, the slave node
> >> should be promoted to master. So I try testing this by shutting down the
> >> cluster on the primary server: "pcs cluster stop"
> >> ...and nothing happens. The master shuts down (uncleanly, I might add -
> >> it leaves behind a lock file that prevents it from starting again until
> >> I manually remove said lock file), but the slave is never promoted to  
> > 
> > This definitely needs to be corrected. What creates the lock file, and
> > how is that entity managed?  
> 
> The lock file entity is created/managed by the postgresql process itself. On
> launch, postgres creates the lock file to say it is running, and deletes said
> lock file when it shuts down. To my understanding, its role in life is to
> prevent a restart after an unclean shutdown so the admin is reminded to make
> sure that the data is in a consistent state before starting the server again.

What is the name of this lock file? Where is it?

PostgreSQL does not create lock file. It creates a "postmaster.pid" file, but
it does not forbid a startup if the new process doesn't find another process
with the pid and shm shown in the postmaster.pid.

As far as I know, the pgsql resource agent create such a lock file on promote
and delete it on graceful stop. If the PostgreSQL instance couldn't be stopped
correctly, the lock files stays and the RA refuse to start it the next time.

[...]
> >> What can I do to fix this? What troubleshooting steps can I follow? Thanks.

I can not find the result of the stop operation in your log files, maybe the
log from CentTest2 would be more useful. but I can find this:

  Oct 13 08:29:41 CentTest1 pengine[30095]:   notice: Scheduling Node
  centtest2.ravnalaska.net for shutdown
  ...
  Oct 13 08:29:41 CentTest1 pengine[30095]:   notice: Scheduling Node
  centtest2.ravnalaska.net for shutdown

Which means the stop operation probably raised an error, leading to a fencing
of the node. In this circumstance, I bet PostgreSQL wasn't able to stop
correctly and the lock file stayed in place.

Could you please show us your full cluster setup?

By the way, did you had a look to the PAF project? 

  http://dalibo.github.io/PAF/
  http://dalibo.github.io/PAF/documentation.html

The v1.1 version for EL6 is not ready yet, but you might want to give it a
try: https://github.com/dalibo/PAF/tree/v1.1

I would recommend EL7 and PAF 2.0, published, packaged, ready to use.

Regards,

-- 
Jehan-Guillaume (ioguix) de Rorthais
Dalibo