[ClusterLabs] OCF_ERR_CONFIGURED (was: Virtual ip resource restarted on node with down network device)

Tue Sep 20 10:43:23 EDT 2016

On 09/20/2016 07:38 AM, Lars Ellenberg wrote:
> From the point of view of the resource agent,
> you configured it to use a non-existing network.
> Which it considers to be a configuration error,
> which is treated by pacemaker as
> "don't try to restart anywhere
> but let someone else configure it properly, first".
> 
> I think the OCF_ERR_CONFIGURED is good, though, otherwise 
> configuration errors might go unnoticed for quite some time.
> A network interface is not supposed to "vanish".
> 
> You may disagree with that choice,

This is a point we should settle in the upcoming changes to the OCF
standard.

The OCF 1.0 standard
(https://github.com/ClusterLabs/OCF-spec/blob/master/ra/resource-agent-api.md)
merely says it means "Program is not configured". That is open to
interpretation.

Pacemaker
(http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#s-ocf-return-codes)
has a more narrow view: "The resource's configuration is invalid. E.g.
required parameters are missing."

The reason Pacemaker considers it a fatal error is that it expects it to
be returned only for an error in the resource agent's configuration *in
the cluster*. If the cluster config is bad, it doesn't matter which node
we try it on. For example, if an agent takes a parameter "frobble" with
valid values from 1 to 10, and the user supplies "frobble=-1", that
would be a configuration error.

I think in OCF 2.0 we should distinguish "supplied RA parameters are
bad" from "service's configuration on this host is bad". Currently,
Pacemaker expects the latter error to generate OCF_ERR_GENERIC,
OCF_ERR_ARGS, OCF_ERR_PERM, or OCF_ERR_INSTALLED, which allows it to try
the resource on another node.