[ClusterLabs] Fwd: Postgres pacemaker cluster failure

Sat Apr 27 18:27:13 EDT 2019

On Sat, 27 Apr 2019 09:15:29 +0300
Andrei Borzenkov <arvidjaar at gmail.com> wrote:

> 27.04.2019 1:04, Danka Ivanović пишет:
> > Hi, here is a complete cluster configuration:
> > 
> > node 1: master
> > node 2: secondary
> > primitive AWSVIP awsvip \
> >         params secondary_private_ip=10.x.x.x api_delay=5
> > primitive PGSQL pgsqlms \
> >         params pgdata="/var/lib/postgresql/9.5/main"
> > bindir="/usr/lib/postgresql/9.5/bin" pghost="/var/run/postgresql/"
> > recovery_template="/etc/postgresql/9.5/main/recovery.conf.pcmk"
> > start_opts="-c config_file=/etc/postgresql/9.5/main/postgresql.conf" \
> >         op start timeout=60s interval=0 \
> >         op stop timeout=60s interval=0 \
> >         op promote timeout=15s interval=0 \
> >         op demote timeout=120s interval=0 \
> >         op monitor interval=15s timeout=10s role=Master \
> >         op monitor interval=16s timeout=10s role=Slave \
> >         op notify timeout=60 interval=0
> > primitive fencing-postgres-ha-2 stonith:external/ec2 \
> >         params port=master \
> >         op start interval=0s timeout=60s \
> >         op monitor interval=360s timeout=60s \
> >         op stop interval=0s timeout=60s
> > primitive fencing-test-rsyslog stonith:external/ec2 \
> >         params port=secondary \
> >         op start interval=0s timeout=60s \
> >         op monitor interval=360s timeout=60s \
> >         op stop interval=0s timeout=60s
> > ms PGSQL-HA PGSQL \
> >         meta notify=true
> > colocation IPAWSIP-WITH-MASTER inf: AWSVIP PGSQL-HA:Master
> > order demote-then-stop-ip Mandatory: _rsc_set_ PGSQL-HA:demote AWSVIP:stop
> > symmetrical=false
> > location loc-fence-master fencing-postgres-ha-2 -inf: master
> > location loc-fence-secondary fencing-test-rsyslog -inf: secondary
> > order promote-then-ip Mandatory: _rsc_set_ PGSQL-HA:promote AWSVIP:start
> > symmetrical=false
> > property cib-bootstrap-options: \
> >         have-watchdog=false \
> >         dc-version=1.1.14-70404b0 \
> >         cluster-infrastructure=corosync \
> >         cluster-name=psql-ha \
> >         stonith-enabled=true \
> >         no-quorum-policy=ignore \
> >         last-lrm-refresh=1556315444 \
> >         maintenance-mode=false
> > rsc_defaults rsc-options: \
> >         resource-stickiness=10 \
> >         migration-threshold=2
> > 
> > I tried to start manually postgres to be sure it is ok. There are no error
> > in postgres log. I also tried with different meta parameters, but always
> > with notify=true.
> > I also tried this:
> > ms PGSQL-HA PGSQL \
> > meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1
> > notify=true interleave=true
> > I have followed this link:
> > https://clusterlabs.github.io/PAF/Quick_Start-Debian-9-crm.html
> > When stonith is enabled and working I imported all other resources and
> > constraints all together in the same time.
> > 
> > On Fri, 26 Apr 2019 at 13:46, Jehan-Guillaume de Rorthais <jgdr at dalibo.com>
> > wrote:
> >   
> >> Hi,
> >>
> >> On Thu, 25 Apr 2019 18:57:55 +0200
> >> Danka Ivanović <danka.ivanovic at gmail.com> wrote:
> >>  
> >>> Apr 25 16:39:50 [4213] master       lrmd:   notice:
> >>> operation_finished:   PGSQL_monitor_0:5849:stderr [ ocf-exit-reason:You
> >>> must set meta parameter notify=true for your master resource ]  
> >>
> >> Resource agent pgsqlms refuse to start PgSQL because your configuration
> >> lacks
> >> the "notify=true" attribute in your master definition.
> >>  
> 
> PAF pgsqlms contains:
> 
>     # check notify=true
>     $ans = qx{ $CRM_RESOURCE --resource "$OCF_RESOURCE_INSTANCE" \\
>                  --meta --get-parameter notify 2>/dev/null };
>     chomp $ans;
>     unless ( lc($ans) =~ /^true$|^on$|^yes$|^y$|^1$/ ) {
>         ocf_exit_reason(
>             'You must set meta parameter notify=true for your master
> resource'
>         );
>         exit $OCF_ERR_INSTALLED;
>     }
> 
> but that is wrong - "notify" is set on ms definition, while
> $OCF_RESOURCE_INSTANCE refers to individual clone member. There is no
> notify option on PGSQL primitive.

Interesting...and disturbing. I wonder why I never faced a bug related to this
after so many tests in various OS and a bunch of running clusters in various
environments. Plus, it hasn't been reported sooner by anyone.

Is it possible the clone members inherit this from the master definition or
"crm_resource" to look at this higher level?

If I set a meta attribute at master level, it appears on clones as well:

  > crm_resource --resource pgsql-ha --meta --get-parameter=clone-max
  pgsql-ha is active on more than one node, returning the default value for
  clone-max 
  Attribute 'clone-max' not found for 'pgsql-ha' 
  Error performing operation: No such device or address

  > crm_resource --resource pgsqld --meta --get-parameter=clone-max
  Attribute 'clone-max' not found for 'pgsqld:0'
  Error performing operation: No such device or address

  > crm_resource --resource=pgsql-ha --meta --set-parameter=clone-max \
    --parameter-value=3

  Set 'pgsql-ha' option: id=pgsql-ha-meta_attributes-clone-max
  set=pgsql-ha-meta_attributes name=clone-max=3

  > crm_resource --resource pgsql-ha --meta --get-parameter=clone-max
  pgsql-ha is active on more than one node, returning the default value for
  clone-max 
  3

  > crm_resource --resource pgsqld --meta --get-parameter=clone-max
  3

If this behavior is not expected, maybe Danka's Pacemaker versions act
differently because of this?

> Why does not it check OCF_RESKEY_CRM_meta_notify?

I was just not aware of this env variable. Sadly, it is not documented
anywhere :(

I'll do some tests with it. It will save a call to crm_resource and all the
machinery and sounds safer...

Thanks for the hint!