[ClusterLabs] Frequent PAF log messages - Forbidding promotion on <node> in state "startup"

Mon May 14 18:50:08 UTC 2018

On Mon, 14 May 2018 16:43:52 +0000
"Shobe, Casey" <Casey.Shobe at sling.com> wrote:

> Thanks, I should have seen that.  I just assumed that everything was working
> fine because `pcs status` shows no errors.

We do not trigger error for such scenario because it would require the cluster
to react...and there's really no way the cluster can solve such issue. So we
just put a negative score, which is already quite strange to be noticed in most
situation.

> This leads me to another question - is there a way to trigger a rebuild of a
> slave with pcs?

Nope. pcs/pacemaker has no such things. You can either write a strong a
detailed manual procedure or try some automation tools, eg. ansible, salt, etc.

>  Or do I need to use `pcs cluster stop`, then manually do a
> new pg_basebackup, copy in the recovery.conf, and `pcs cluster start` for
> each standby node needing rebuilt?

I advice you to put the recovery.conf.pcmk outside of the PGDATA and use
resource parameter "recovery_template". It would save you one step to deal with
the recovery.conf. But this is the simplest procedure, yes.

Should you keep the cluster up on this node for some other resources, you could
temporary exclude your pgsql-ha from this node so the cluster stop considering
it for this particular node while you rebuild your standby. Here is some
inspiration:
https://clusterlabs.github.io/PAF/CentOS-7-admin-cookbook.html#forbidding-a-paf-resource-on-a-node

> > On May 13, 2018, at 5:58 AM, Jehan-Guillaume de Rorthais <jgdr at dalibo.com>
> > wrote:
> > 
> > This message originated outside of DISH and was sent by: jgdr at dalibo.com
> > 
> > On Fri, 11 May 2018 16:25:18 +0000
> > "Shobe, Casey" <Casey.Shobe at sling.com> wrote:
> >   
> >> I'm using PAF and my corosync log ends up filled with messages like this
> >> (about 3 times per minute for each standby node):
> >> 
> >> pgsqlms(postgresql-10-main)[26822]:     2018/05/11_06:47:08  INFO:
> >> Forbidding promotion on "d-gp2-dbp63-1" in state "startup"
> >> pgsqlms(postgresql-10-main)[26822]:     2018/05/11_06:47:08  INFO:
> >> Forbidding promotion on "d-gp2-dbp63-2" in state "startup"
> >> 
> >> What is the cause of this logging and does it indicate something is wrong
> >> with my setup?  
> > 
> > Yes, something is wrong with your setup. When a PostgreSQL standby is
> > starting up, it tries to hook replication with the primary instance: this
> > is the "startup" state. As soon as it is connected, it start replicating
> > and tries to catchup with the master location, this is the "catchup" state.
> > As soon as the standby is in sync with the master, it enters in "streaming"
> > state. See column "state" in the doc:
> > https://www.postgresql.org/docs/current/static/monitoring-stats.html#PG-STAT-REPLICATION-VIEW
> > 
> > If you have one standby stuck in "startup" state, that means it was able to
> > connect to the master but is not replicating with it for some reason
> > (different/incompatible/non catchable timeline?).
> > 
> > Look for errors in your PostgreSQL logs on the primary and the standby.
> > 
> >   
> 

-- 
Jehan-Guillaume de Rorthais
Dalibo