[ClusterLabs] [EXTERNAL] Re: Postgres PAF setup

Mon Apr 30 15:21:01 UTC 2018

Sorry for the delay.  Here is my corosync.log file.
I have tried making the changes that you requested but still no good.  I know when I configured the cluster to use pgsqld instead of pgsqlms  I could at least get the cluster to start. But it was starting the cluster as a master on both nodes.

Thanks for your help...

Andrew A Edenburn
General Motors
Hyperscale Computing & Core Engineering
Mobile Phone: +01-810-410-6008
30009 Van Dyke Ave
Warren, MI. 48090-9026
Cube: 2w05-21
mailto:andrew.edenburn at gm.com
Web Connect SoftPhone 586-986-4864

-----Original Message-----
From: Jehan-Guillaume (ioguix) de Rorthais [mailto:ioguix at free.fr]
Sent: Tuesday, April 24, 2018 11:09 AM
To: Andrew Edenburn <andrew.edenburn at gm.com>
Cc: pgsql-general at postgresql.org; users at clusterlabs.org
Subject: [EXTERNAL] Re: Postgres PAF setup

On Mon, 23 Apr 2018 18:09:43 +0000
Andrew Edenburn <andrew.edenburn at gm.com> wrote:

> I am having issues with my PAF setup.  I am new to Postgres and have
> setup the cluster as seen below. I am getting this error when trying
> to start my cluster resources.
> [...]
>
> cleanup and clear is not fixing any issues and I am not seeing
> anything in the logs.  Any help would be greatly appreciated.

This lack a lot of information.

According to the PAF ressource agent, your instances are in an "unexpected state" on both nodes while PAF was actually trying to stop it.

Pacemaker might decide to stop a ressource if the start operation fails.
Stopping it when the start failed give some chances to the resource agent to stop the resource gracefully if still possible.

I suspect you have some setup mistake on both nodes, maybe the exact same one...

You should probably provide your full logs from pacemaker/corosync with timing information so we can check all the messages coming from PAF from the very beginning of the startup attempt.

>         have-watchdog=false \

you should probably consider to setup watchdog in your cluster.

>         stonith-enabled=false \

This is really bad. Your cluster will NOT work as expected. PAF **requires** Stonith to be enabled and to properly working. Without it, soon or later, you will experience some unexpected reaction from the cluster (freezing all actions, etc).

>         no-quorum-policy=ignore \

You should not ignore quorum, even in a two node cluster. See "two_node"
parameter in the manual of corosync.conf.

>         migration-threshold=1 \
> rsc_defaults rsc_defaults-options: \
>         migration-threshold=5 \

The later is the supported way to set migration-threshold. Your "migration-threshold=1" should not be a cluster property but a default ressource option.

> My pcs Config
> Corosync Nodes:
> dcmilphlum223 dcmilphlum224
> Pacemaker Nodes:
> dcmilphlum223 dcmilphlum224
>
> Resources:
> Master: pgsql-ha
>   Meta Attrs: notify=true target-role=Stopped

This target-role might have been set by the cluster because it can not fence nodes (which might be easier to deal with in your situation btw). That means the cluster will keep this resource down because of previous errors.

> recovery_template=/pgsql/data/pg7000/recovery.conf.pcmk

You should probably not put your recovery.conf.pcmk in your PGDATA. Both files are different between each nodes. As you might want to rebuild the standby or old master after some failures, you would have to correct it each time. Keep it outside of the PGDATA to avoid this useless step.

> dcmilphlum224: pgsqld-data-status=LATEST

I suppose this comes from the "pgsql" resource agent, definitely not from PAF...

Regards,

Nothing in this message is intended to constitute an electronic signature unless a specific statement to the contrary is included in this message.

Confidentiality Note: This message is intended only for the person or entity to which it is addressed. It may contain confidential and/or privileged material. Any review, transmission, dissemination or other use, or taking of any action in reliance upon this message by persons or entities other than the intended recipient is prohibited and may be unlawful. If you received this message in error, please contact the sender and delete it from your computer.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: corosync.log
Type: application/octet-stream
Size: 399338 bytes
Desc: corosync.log
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20180430/6686f991/attachment-0001.obj>