[ClusterLabs] multiple resources - pgsqlms - and IP(s)

Fri Jan 6 19:36:19 EST 2023

"

On Fri, Jan 6, 2023 at 3:26 PM Jehan-Guillaume de Rorthais via Users
<users at clusterlabs.org> wrote:
>
> On Wed, 4 Jan 2023 11:15:06 +0100
> Tomas Jelinek <tojeline at redhat.com> wrote:
>
> > Dne 04. 01. 23 v 8:29 Reid Wahl napsal(a):
> > > On Tue, Jan 3, 2023 at 10:53 PM lejeczek via Users
> > > <users at clusterlabs.org> wrote:
> > >>
> > >>
> > >>
> > >> On 03/01/2023 21:44, Ken Gaillot wrote:
> > >>> On Tue, 2023-01-03 at 18:18 +0100, lejeczek via Users wrote:
> > >>>> On 03/01/2023 17:03, Jehan-Guillaume de Rorthais wrote:
> > >>>>> Hi,
> > >>>>>
> > >>>>> On Tue, 3 Jan 2023 16:44:01 +0100
> > >>>>> lejeczek via Users <users at clusterlabs.org> wrote:
> > >>>>>
> > >>>>>> To get/have Postgresql cluster with 'pgsqlms' resource, such
> > >>>>>> cluster needs a 'master' IP - what do you guys do when/if
> > >>>>>> you have multiple resources off this agent?
> > >>>>>> I wonder if it is possible to keep just one IP and have all
> > >>>>>> those resources go to it - probably 'scoring' would be very
> > >>>>>> tricky then, or perhaps not?
> > >>>>> That would mean all promoted pgsql MUST be on the same node at any
> > >>>>> time.
> > >>>>> If one of your instance got some troubles and need to failover,
> > >>>>> *ALL* of them
> > >>>>> would failover.
> > >>>>>
> > >>>>> This imply not just a small failure time window for one instance,
> > >>>>> but for all
> > >>>>> of them, all the users.
> > >>>>>
> > >>>>>> Or you do separate IP for each 'pgsqlms' resource - the
> > >>>>>> easiest way out?
> > >>>>> That looks like a better option to me, yes.
> > >>>>>
> > >>>>> Regards,
> > >>>> Not related - Is this an old bug?:
> > >>>>
> > >>>> -> $ pcs resource create pgsqld-apps ocf:heartbeat:pgsqlms
> > >>>> bindir=/usr/bin pgdata=/apps/pgsql/data op start timeout=60s
> > >>>> op stop timeout=60s op promote timeout=30s op demote
> > >>>> timeout=120s op monitor interval=15s timeout=10s
> > >>>> role="Master" op monitor interval=16s timeout=10s
> > >>>> role="Slave" op notify timeout=60s meta promotable=true
> > >>>> notify=true master-max=1 --disable
> > >>>> Error: Validation result from agent (use --force to override):
> > >>>>      ocf-exit-reason:You must set meta parameter notify=true
> > >>>> for your master resource
> > >>>> Error: Errors have occurred, therefore pcs is unable to continue
> > >>> pcs now runs an agent's validate-all action before creating a resource.
> > >>> In this case it's detecting a real issue in your command. The options
> > >>> you have after "meta" are clone options, not meta options of the
> > >>> resource being cloned. If you just change "meta" to "clone" it should
> > >>> work.
> > >> Nope. Exact same error message.
> > >> If I remember correctly there was a bug specifically
> > >> pertained to 'notify=true'
> > >
> > > The only recent one I can remember was a core dump.
> > > - Bug 2039675 - pacemaker coredump with ocf:heartbeat:mysql resource
> > > (https://bugzilla.redhat.com/show_bug.cgi?id=2039675)
> > >
> > >  From a quick inspection of the pcs resource validation code
> > > (lib/pacemaker/live.py:validate_resource_instance_attributes_via_pcmk()),
> > > it doesn't look like it passes the meta attributes. It only passes the
> > > instance attributes. (I could be mistaken.)
> > >
> > > The pgsqlms resource agent checks the notify meta attribute's value as
> > > part of the validate-all action. If pcs doesn't pass the meta
> > > attributes to crm_resource, then the check will fail.
> > >
> >
> > Pcs cannot pass meta attributes to crm_resource, because there is
> > nowhere to pass them to.
>
> But, they are passed as environment variable by Pacemaker, why pcs couldn't set
> them as well when running the agent?

pcs uses crm_resource to run the validate-all action. crm_resource
doesn't provide a way to pass in meta attributes -- only instance
attributes. Whether crm_resource should provide that is another
question...

>
> > As defined in OCF 1.1, only instance attributes
> > matter for validation, see
> > https://github.com/ClusterLabs/OCF-spec/blob/main/ra/1.1/resource-agent-api.md#check-levels
>
> It doesn't state clearly that meta attributes must be ignored by the agent
> during these actions.

This section says validate-all "...should validate the instance
parameters provided. The thoroughness of the check may optionally be
influenced by Check Levels.":
https://github.com/ClusterLabs/OCF-spec/blob/main/ra/1.1/resource-agent-api.md#optional-actions

The term "parameter" is used throughout the document to mean "instance
parameters". For example:
https://github.com/ClusterLabs/OCF-spec/blob/main/ra/1.1/resource-agent-api.md#resource-parameters

On the other hand, the meta attributes are also exposed via
environment variables that begin with OCF_RESKEY -- specifically,
OCF_RESKEY_CRM_meta. So perhaps some clarification is in order that
meta attributes are not included (unless we decide to include them).

>
> And one could argue checking a meta attribute is a purely internal setup check,
> at level 0.
>
> > The agents are bugged - they depend on meta data being passed to
> > validation. This is already tracked and being worked on:
> >
> > https://github.com/ClusterLabs/resource-agents/pull/1826
>
> The pgsqlms resource agent checks the OCF_RESKEY_CRM_meta_notify environment
> variable before raising this error.
>
> The pgsqlms resource agent is relying on notify action to make some important
> checks and actions. Without notifies, the resource will just behave wrongly.
> This is an essential check.

I don't have an opinion right now on whether validate-all should check
meta attributes in cases like this. Regardless, I think it would be a
good idea to add a note to the pgsqlms metadata that says notify must
be set to true. I don't see such a note, except that `notify=true` is
part of the example commands.

>
> However, I've been considering moving some of these checks only during the
> probe action. Would it make sense? The notify check could move there as there's
> no need to check it on a regular basis.

IIRC, validate-all (or the logic that it calls) typically doesn't run
during recurring monitors for the reason you described. It should run
for the validate-all action and perhaps a probe and/or start.

Also, if we keep the OCF validate-all scheme the same (check only the
instance parameters), then I think probe and/or start would be a good
place to put the meta attribute validation.

>
> Thanks,
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>

-- 
Regards,

Reid Wahl (He/Him)
Senior Software Engineer, Red Hat
RHEL High Availability - Pacemaker