[ClusterLabs] Resource validation [was: multiple resources - pgsqlms - and IP(s)]
Jehan-Guillaume de Rorthais
jgdr at dalibo.com
Mon Jan 9 08:44:46 EST 2023
Hi,
I definitely have some work/improvements to do on the pgsqlms agent, but
there's still some details I'm interested to discuss below.
On Fri, 6 Jan 2023 16:36:19 -0800
Reid Wahl <nwahl at redhat.com> wrote:
> On Fri, Jan 6, 2023 at 3:26 PM Jehan-Guillaume de Rorthais via Users
> <users at clusterlabs.org> wrote:
>>
>> On Wed, 4 Jan 2023 11:15:06 +0100
>> Tomas Jelinek <tojeline at redhat.com> wrote:
>>
>>> Dne 04. 01. 23 v 8:29 Reid Wahl napsal(a):
>>>> On Tue, Jan 3, 2023 at 10:53 PM lejeczek via Users
>>>> <users at clusterlabs.org> wrote:
>>>>>
>>>>> On 03/01/2023 21:44, Ken Gaillot wrote:
>>>>>> On Tue, 2023-01-03 at 18:18 +0100, lejeczek via Users wrote:
[...]
>>>>>>> Not related - Is this an old bug?:
>>>>>>>
>>>>>>> -> $ pcs resource create pgsqld-apps ocf:heartbeat:pgsqlms
>>>>>>> bindir=/usr/bin pgdata=/apps/pgsql/data op start timeout=60s
>>>>>>> op stop timeout=60s op promote timeout=30s op demote
>>>>>>> timeout=120s op monitor interval=15s timeout=10s
>>>>>>> role="Master" op monitor interval=16s timeout=10s
>>>>>>> role="Slave" op notify timeout=60s meta promotable=true
>>>>>>> notify=true master-max=1 --disable
>>>>>>> Error: Validation result from agent (use --force to override):
>>>>>>> ocf-exit-reason:You must set meta parameter notify=true
>>>>>>> for your master resource
>>>>>>> Error: Errors have occurred, therefore pcs is unable to continue
>>>>>> pcs now runs an agent's validate-all action before creating a
>>>>>> resource. In this case it's detecting a real issue in your command.
>>>>>> The options you have after "meta" are clone options, not meta options
>>>>>> of the resource being cloned. If you just change "meta" to "clone" it
>>>>>> should work.
>>>>> Nope. Exact same error message.
>>>>> If I remember correctly there was a bug specifically
>>>>> pertained to 'notify=true'
>>>>
>>>> The only recent one I can remember was a core dump.
>>>> - Bug 2039675 - pacemaker coredump with ocf:heartbeat:mysql resource
>>>> (https://bugzilla.redhat.com/show_bug.cgi?id=2039675)
>>>>
>>>> From a quick inspection of the pcs resource validation code
>>>> (lib/pacemaker/live.py:validate_resource_instance_attributes_via_pcmk()),
>>>> it doesn't look like it passes the meta attributes. It only passes the
>>>> instance attributes. (I could be mistaken.)
>>>>
>>>> The pgsqlms resource agent checks the notify meta attribute's value as
>>>> part of the validate-all action. If pcs doesn't pass the meta
>>>> attributes to crm_resource, then the check will fail.
>>>
>>> Pcs cannot pass meta attributes to crm_resource, because there is
>>> nowhere to pass them to.
>>
>> But, they are passed as environment variable by Pacemaker, why pcs couldn't
>> set them as well when running the agent?
>
> pcs uses crm_resource to run the validate-all action. crm_resource
> doesn't provide a way to pass in meta attributes -- only instance
> attributes. Whether crm_resource should provide that is another
> question...
But crm_resource can set them as environment variable, they are inherited to
the resource agent when executing it:
# This fails
# crm_resource --validate \
--class ocf --agent pgsqlms --provider heartbeat \
--option pgdata=/var/lib/pgsql/15/data \
--option bindir=/usr/pgsql-15/bin
Operation validate (ocf:heartbeat:pgsqlms) returned 5 (not installed:
You must set meta parameter notify=true for your "master" resource)
ocf-exit-reason:You must set meta parameter notify=true for your "master"
resource
crm_resource: Error performing operation: Not installed
# This fails on a different mandatory setup
# OCF_RESKEY_CRM_meta_notify=1 \
crm_resource --validate \
--class ocf --agent pgsqlms --provider heartbeat \
--option pgdata=/var/lib/pgsql/15/data \
--option bindir=/usr/pgsql-15/bin
Operation validate (ocf:heartbeat:pgsqlms) returned 5 (not installed:
You must set meta parameter master-max=1 for your "master" resource)
ocf-exit-reason:You must set meta parameter master-max=1 for your "master"
resource
crm_resource: Error performing operation: Not installed
# This succeed
# OCF_RESKEY_CRM_meta_notify=1 \
OCF_RESKEY_CRM_meta_master_max=1 \
crm_resource --validate \
--class ocf --agent pgsqlms --provider heartbeat \
--option pgdata=/var/lib/pgsql/15/data \
--option bindir=/usr/pgsql-15/bin
Operation validate (ocf:heartbeat:pgsqlms) returned 0 (ok)
>>> As defined in OCF 1.1, only instance attributes
>>> matter for validation, see
>>> https://github.com/ClusterLabs/OCF-spec/blob/main/ra/1.1/resource-agent-api.md#check-levels
>>
>> It doesn't state clearly that meta attributes must be ignored by the agent
>> during these actions.
>
> This section says validate-all "...should validate the instance
> parameters provided. The thoroughness of the check may optionally be
> influenced by Check Levels.":
> https://github.com/ClusterLabs/OCF-spec/blob/main/ra/1.1/resource-agent-api.md#optional-actions
>
> The term "parameter" is used throughout the document to mean "instance
> parameters". For example:
> https://github.com/ClusterLabs/OCF-spec/blob/main/ra/1.1/resource-agent-api.md#resource-parameters
>
> On the other hand, the meta attributes are also exposed via
> environment variables that begin with OCF_RESKEY -- specifically,
> OCF_RESKEY_CRM_meta. So perhaps some clarification is in order that
> meta attributes are not included (unless we decide to include them).
Well, it seems some resource agent really need them to work correctly. Checking
them early is probably best to avoid bad consequences and user experience
during the cluster setup. Repeated failures and/or fencing can be discouraging.
The soon they get a clear warning or error message, the best it is.
>> And one could argue checking a meta attribute is a purely internal setup
>> check, at level 0.
>>
>>> The agents are bugged - they depend on meta data being passed to
>>> validation. This is already tracked and being worked on:
>>>
>>> https://github.com/ClusterLabs/resource-agents/pull/1826
>>
>> The pgsqlms resource agent checks the OCF_RESKEY_CRM_meta_notify environment
>> variable before raising this error.
>>
>> The pgsqlms resource agent is relying on notify action to make some
>> important checks and actions. Without notifies, the resource will just
>> behave wrongly. This is an essential check.
>
> I don't have an opinion right now on whether validate-all should check
> meta attributes in cases like this. Regardless, I think it would be a
> good idea to add a note to the pgsqlms metadata that says notify must
> be set to true. I don't see such a note, except that `notify=true` is
> part of the example commands.
Indeed, I'll add this. Thanks for checking it.
>> However, I've been considering moving some of these checks only during the
>> probe action. Would it make sense? The notify check could move there as
>> there's no need to check it on a regular basis.
>
> IIRC, validate-all (or the logic that it calls) typically doesn't run
> during recurring monitors for the reason you described. It should run
> for the validate-all action and perhaps a probe and/or start.
Originally, the resource agent was started by following the RA dev guide, where
the "validate-all" action is called on every action, even on monitor. See:
https://github.com/ClusterLabs/resource-agents/blob/main/doc/dev-guides/ra-dev-guide.asc
« validate-all is usually wrapped in a function that is not only called when
explicitly invoking the corresponding action, but also — as a sanity check —
from just about any other function. Therefore, the resource agent author
must keep in mind that the function may be invoked during the start, stop,
and monitor operations, and also during probes. »
> Also, if we keep the OCF validate-all scheme the same (check only the
> instance parameters), then I think probe and/or start would be a good
> place to put the meta attribute validation.
Ok, I'll study this approach.
Thanks!
More information about the Users
mailing list