[ClusterLabs] Frequent PAF log messages - Forbidding promotion on <node> in state "startup"

Tue May 15 05:50:08 EDT 2018

On Mon, 14 May 2018 19:08:47 +0000
"Shobe, Casey" <Casey.Shobe at sling.com> wrote:

> > We do not trigger error for such scenario because it would require the
> > cluster to react...and there's really no way the cluster can solve such
> > issue. So we just put a negative score, which is already quite strange to
> > be noticed in most situation.  
> 
> Where is this negative score to be noticed?

I usually use "crm_mon -frnAo"

* f: show failcounts
* r: show all resources, even inactive ones
* n: group by node instead of resource
* A: show node attributes <- this one should show you the scores
* o: show operation history

Note that you can switch this argument interactively when crm_mon is already
running. Hit 'h' for help.

[...]
> > I advice you to put the recovery.conf.pcmk outside of the PGDATA and use
> > resource parameter "recovery_template". It would save you one step to deal
> > with the recovery.conf. But this is the simplest procedure, yes.  
> 
> I do this (minus the .pcmk suffix) already, but was just being overly
> paranoid about avoiding a multi-master situation.  I guess there is no need
> for me to manually copy in the recovery.conf.

When cloning the primary, it shouldn't have a "recovery.conf" existing. It may
have a "recovery.done", but this is not a problem.

When cloning from a standby, I can understand you might want to be over
paranoid and delete the recovery.conf file.

But in either case, on resource start, PAF will create the
"PGDATA/recovery.conf" file based on your template anyway. No need to create it
yourself.

> > Should you keep the cluster up on this node for some other resources, you
> > could temporary exclude your pgsql-ha from this node so the cluster stop
> > considering it for this particular node while you rebuild your standby.
> > Here is some inspiration:
> > https://clusterlabs.github.io/PAF/CentOS-7-admin-cookbook.html#forbidding-a-paf-resource-on-a-node  
> 
> I was just reading that page before I saw this E-mail.  Another question I
> had though - is how I could deploy a change to the PostgreSQL configuration
> that requires a restart of the service, with minimal service interruption.
> For the moment, I'm assuming I need to, on each standby node, do a `pcs
> cluster stop; pcs cluster start` on each standby, then the same on the master
> which should cause a failover to one of the standby nodes.

According to the pcs manpage, you can restart a resource on one node using:

  pcs resource restart <resource id> <node>

> If I need to change max_connections, though, I'm really not sure what to do,
> since the standby nodes will refuse to replicate from a master with a
> different max_connections setting.

You are missing a subtle detail here: standby will refuse to start if its
max_connections is lower than on the primary.

So you can change your max_connections:

* to a higher value starting from standby then the primary
* to a lower value starting from the primary then the standby

> On a related note, is there perhaps a pcs command that would issue a sighup
> to the master postgres process across all nodes, for when I change a
> configuration option that only requires a reload?

No. There are old discussions and patch about such feature in pacemaker, but
nothing end up in core. See:
https://lists.clusterlabs.org/pipermail/pacemaker/2014-February/044686.html

Note that PAF use a dummy function for the reload action anyway. But we could
easily add a "pg_ctl reload" to it if pcs (or crmsh) would allow to trigger it
manually.

Here, again, you can rely on ansible, salt, ssh command, etc. Either use
"pg_ctl -D <PGDATA> reload" or a simple query like "SELECT pg_reload_conf()".

> I was hoping optimistically that pcs+paf included more administrative
> functionality, since the systemctl commands such as reload can no longer be
> used.

It would be nice, I agree.

> Thank you for your assistance!

You are very welcome.