[ClusterLabs] Fwd: Postgres pacemaker cluster failure
Ken Gaillot
kgaillot at redhat.com
Mon Apr 29 11:05:47 EDT 2019
On Sun, 2019-04-28 at 00:27 +0200, Jehan-Guillaume de Rorthais wrote:
> On Sat, 27 Apr 2019 09:15:29 +0300
> Andrei Borzenkov <arvidjaar at gmail.com> wrote:
>
> > 27.04.2019 1:04, Danka Ivanović пишет:
> > > Hi, here is a complete cluster configuration:
> > >
> > > node 1: master
> > > node 2: secondary
> > > primitive AWSVIP awsvip \
> > > params secondary_private_ip=10.x.x.x api_delay=5
> > > primitive PGSQL pgsqlms \
> > > params pgdata="/var/lib/postgresql/9.5/main"
> > > bindir="/usr/lib/postgresql/9.5/bin"
> > > pghost="/var/run/postgresql/"
> > > recovery_template="/etc/postgresql/9.5/main/recovery.conf.pcmk"
> > > start_opts="-c
> > > config_file=/etc/postgresql/9.5/main/postgresql.conf" \
> > > op start timeout=60s interval=0 \
> > > op stop timeout=60s interval=0 \
> > > op promote timeout=15s interval=0 \
> > > op demote timeout=120s interval=0 \
> > > op monitor interval=15s timeout=10s role=Master \
> > > op monitor interval=16s timeout=10s role=Slave \
> > > op notify timeout=60 interval=0
> > > primitive fencing-postgres-ha-2 stonith:external/ec2 \
> > > params port=master \
> > > op start interval=0s timeout=60s \
> > > op monitor interval=360s timeout=60s \
> > > op stop interval=0s timeout=60s
> > > primitive fencing-test-rsyslog stonith:external/ec2 \
> > > params port=secondary \
> > > op start interval=0s timeout=60s \
> > > op monitor interval=360s timeout=60s \
> > > op stop interval=0s timeout=60s
> > > ms PGSQL-HA PGSQL \
> > > meta notify=true
> > > colocation IPAWSIP-WITH-MASTER inf: AWSVIP PGSQL-HA:Master
> > > order demote-then-stop-ip Mandatory: _rsc_set_ PGSQL-HA:demote
> > > AWSVIP:stop
> > > symmetrical=false
> > > location loc-fence-master fencing-postgres-ha-2 -inf: master
> > > location loc-fence-secondary fencing-test-rsyslog -inf: secondary
> > > order promote-then-ip Mandatory: _rsc_set_ PGSQL-HA:promote
> > > AWSVIP:start
> > > symmetrical=false
> > > property cib-bootstrap-options: \
> > > have-watchdog=false \
> > > dc-version=1.1.14-70404b0 \
> > > cluster-infrastructure=corosync \
> > > cluster-name=psql-ha \
> > > stonith-enabled=true \
> > > no-quorum-policy=ignore \
> > > last-lrm-refresh=1556315444 \
> > > maintenance-mode=false
> > > rsc_defaults rsc-options: \
> > > resource-stickiness=10 \
> > > migration-threshold=2
> > >
> > > I tried to start manually postgres to be sure it is ok. There are
> > > no error
> > > in postgres log. I also tried with different meta parameters, but
> > > always
> > > with notify=true.
> > > I also tried this:
> > > ms PGSQL-HA PGSQL \
> > > meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1
> > > notify=true interleave=true
> > > I have followed this link:
> > > https://clusterlabs.github.io/PAF/Quick_Start-Debian-9-crm.html
> > > When stonith is enabled and working I imported all other
> > > resources and
> > > constraints all together in the same time.
> > >
> > > On Fri, 26 Apr 2019 at 13:46, Jehan-Guillaume de Rorthais <
> > > jgdr at dalibo.com>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > On Thu, 25 Apr 2019 18:57:55 +0200
> > > > Danka Ivanović <danka.ivanovic at gmail.com> wrote:
> > > >
> > > > > Apr 25 16:39:50 [4213] master lrmd: notice:
> > > > > operation_finished: PGSQL_monitor_0:5849:stderr [ ocf-exit-
> > > > > reason:You
> > > > > must set meta parameter notify=true for your master resource
> > > > > ]
> > > >
> > > > Resource agent pgsqlms refuse to start PgSQL because your
> > > > configuration
> > > > lacks
> > > > the "notify=true" attribute in your master definition.
> > > >
> >
> > PAF pgsqlms contains:
> >
> > # check notify=true
> > $ans = qx{ $CRM_RESOURCE --resource "$OCF_RESOURCE_INSTANCE" \\
> > --meta --get-parameter notify 2>/dev/null };
> > chomp $ans;
> > unless ( lc($ans) =~ /^true$|^on$|^yes$|^y$|^1$/ ) {
> > ocf_exit_reason(
> > 'You must set meta parameter notify=true for your
> > master
> > resource'
> > );
> > exit $OCF_ERR_INSTALLED;
> > }
> >
> > but that is wrong - "notify" is set on ms definition, while
> > $OCF_RESOURCE_INSTANCE refers to individual clone member. There is
> > no
> > notify option on PGSQL primitive.
>
> Interesting...and disturbing. I wonder why I never faced a bug
> related to this
> after so many tests in various OS and a bunch of running clusters in
> various
> environments. Plus, it hasn't been reported sooner by anyone.
>
> Is it possible the clone members inherit this from the master
> definition or
> "crm_resource" to look at this higher level?
That's correct. For clone/master/group/bundle resources, setting meta-
attributes on the collective resource makes them effective for the
inner resources as well. So I don't think that's causing any issues
here.
> If I set a meta attribute at master level, it appears on clones as
> well:
>
> > crm_resource --resource pgsql-ha --meta --get-parameter=clone-max
> pgsql-ha is active on more than one node, returning the default
> value for
> clone-max
> Attribute 'clone-max' not found for 'pgsql-ha'
> Error performing operation: No such device or address
>
> > crm_resource --resource pgsqld --meta --get-parameter=clone-max
> Attribute 'clone-max' not found for 'pgsqld:0'
> Error performing operation: No such device or address
>
> > crm_resource --resource=pgsql-ha --meta --set-parameter=clone-max
> \
> --parameter-value=3
>
> Set 'pgsql-ha' option: id=pgsql-ha-meta_attributes-clone-max
> set=pgsql-ha-meta_attributes name=clone-max=3
>
> > crm_resource --resource pgsql-ha --meta --get-parameter=clone-max
> pgsql-ha is active on more than one node, returning the default
> value for
> clone-max
> 3
>
> > crm_resource --resource pgsqld --meta --get-parameter=clone-max
> 3
>
> If this behavior is not expected, maybe Danka's Pacemaker versions
> act
> differently because of this?
>
> > Why does not it check OCF_RESKEY_CRM_meta_notify?
>
> I was just not aware of this env variable. Sadly, it is not
> documented
> anywhere :(
It's not a Pacemaker-created value like the other notify variables --
all user-specified meta-attributes are passed that way. We do need to
document that.
>
> I'll do some tests with it. It will save a call to crm_resource and
> all the
> machinery and sounds safer...
>
> Thanks for the hint!
--
Ken Gaillot <kgaillot at redhat.com>
More information about the Users
mailing list