[ClusterLabs] Fuzzy/misleading references to "restart" of a resource (Was: When does pacemaker call 'restart'/'force-reload' operations on LSB resource?)

Wed Dec 4 15:19:28 EST 2019

On 04/12/19 14:53 +0900, Ondrej wrote:
> When adding 'LSB' script to pacemaker cluster I can see that
> pacemaker advertises 'restart' and 'force-reload' operations to be
> present - regardless if the LSB script supports it or not.  This
> seems to be coming from following piece of code.
> 
> https://github.com/ClusterLabs/pacemaker/blob/92b0c1d69ab1feb0b89e141b5007f8792e69655e/lib/services/services_lsb.c#L39-L40
> 
> Questions:
> 1.  When the 'restart' and 'force-reload' operations are called on
>     the LSB script cluster resource?

[reordered]

> I would have expected that 'restart' operation would be called when
> using 'crm_resource --restart --resource myResource', but I can see
> that 'stop' and 'start' operations are used in that case instead.

This is due to how "crm_resource --restart" is arranged,
directly in the implementation of this CLI tool itself
(see tools/crm_resource_runtime.c:cli_resource_restart):

- first, target-role meta-attribute for resource is set to Stopped

- then, once the activity settled, it is set back to the target-role
  it was originally at

Performing this stepwise like this, there's no reasonably
implementable mapping back to a single step being the actual
composition (stop, start -> restart) when the plan is not shared
in full in advance (it is not) with the respective moving parts.
And there's plain common sense that would still preclude it (below).

Hence, it is in actuality a great discovery that "restart" trigerring
verb/action is in fact completely neglected and bogus when it comes
to handling by pacemaker.  If it implements any optimizations (thanks
to having the intimate knowledge of the resource at hand, plus knowing
before-after state combo and possibly how to transition in one go),
cluster resource management won't benefit from that in any way.

Interestingly, such optimizations are exactly what the original
OCF draft had in mind :-)
https://github.com/ClusterLabs/OCF-spec/blob/start/resource_agent/API/02#L225
(even more interestingly, only to be reconsidered again some decades
later: https://github.com/ClusterLabs/OCF-spec/issues/10;
yeah, aren't we masters of following targets moving to the extent they
are sometimes contradictory?  I'd blame a desperate lack of written
[and easily obtainable] design decisions made in the past for that)

They are mandated by LSB as well, but hey, in systemd era, we are
now _free_ to call LSB severely broken as it (shamefully, I'd say)
never even tried to accommodate proper dealing with dependency
chains (and actual serializability thereof!), as explained
in an example below.  Or put in other words, LSB was never meant
to stand for a holistic resource management, something both systemd
and pacemaker attempt to cover (single/multi-machine wide).

OTOH, this enforced split of state transitions is perhaps what makes
the transaction (comprising perhaps countless other interdependent
resources) serializable and thus feasible at all (think: you cannot
nest any further handling -- so as to satisfy given constraints -- in
between stop and start when that's an atom, otherwise), and that's
exactly how, say, systemd approaches that, likely for that very reason:
https://github.com/systemd/systemd/commit/6539dd7c42946d9ba5dc43028b8b5785eb2db3c5

So I see a room for improvement here as our takeaway:

* resource agents:

  - some agents declare/implement "restart" action when there is
    no practical reason to (AudibleAlarm, Xinetd, dhcpd, etc.)
    [as a side note, there are non-sensical considerations, such as
    when default "start" and "stop" timeouts for dhcpd are 20 seconds
    each, how come, then, that "restart" defined as "stop; start"
    would also make do with 20 seconds altogher, unless there is
    some amortized work I fail to see :-)]

* pacemaker:

  - artificially generated meta-data mention "restart" action when
    there is no good reason to (lib/services/services_lsb.c)

  - there are some correct clues in Pacemaker Explained, but perhaps,
    it shall take a time to emphasize that whenever "restart" is
    referred, it is never an atomic step, but always a sequence
    of two steps that may be considered atomic on their own,
    but possibly interleaved with other steps so as to retain
    soundness wrt. the imposed constraints and/or changes made
    in parallel

  - the same gist of "restart" shall be sketched in a help screen
    of crm_resource

> For 'force-reload' I have no idea on how to try trigger it looking
> at 'crm_resource --help' output.

Sorry, that's even more bogus, as there's no relevance whatsoever.
It needs to either be dropped from artificially generated meta-data
as well, or investigated further whether there's any reason to make
of such an operation triggerable by users, and if positive, how
much of impact spread to be expected when implemented (do the
dependent services need to be reloaded or "restarted" as well,
since the change might be non-local? any precedent there?
again, hard to analyse in the lack of written design decisions
that would provide an immediate frame for thinking about this)

[reordered]

> 2. How can I trigger 'restart' and 'force-reload' operation on LSB
>    script cluster resource in pacemaker?
> 
> Cluster resource definition looks like this:
> <primitive class="lsb" id="myResource" type="script.sh">
>   <operations>
>     <op id="myResource-force-reload-interval-0s" interval="0s"
>         name="force-reload" timeout="15s"/>
>     <op id="myResource-monitor-interval-15" interval="15" name="monitor"
>         timeout="15"/>
>     <op id="myResource-restart-interval-0s" interval="0s" name="restart"
>         timeout="15"/>
>     <op id="myResource-start-interval-0s" interval="0s" name="start"
>         timeout="15"/>
>     <op id="myResource-stop-interval-0s" interval="0s" name="stop"
>         timeout="15"/>
>   </operations>
>   <instance_attributes id="myResource-instance_attributes"/>
>   <meta_attributes id="myResource-meta_attributes"/>
> </primitive>
> 
> [...]
> 
> I want to make sure that cluster will not attempt running 'restart'
> nor 'force-reload' on script that is not implementing them.

Understood, I am reasonably sure about the former and definitely sure
about the latter, in the current state of implementation anyway.
That you even need to stress about these bogus circumstances doesn't
put us in a good light, but the more important this feedback loop is.

> As for now I'm considering to return exit code '3' from script when
> these actions are called to indicate that they are 'unimplemented
> feature' as suggested by LSB specification below. However I would
> like to verify that this works as expected.
> http://refspecs.linuxfoundation.org/LSB_5.0.0/LSB-Core-generic/LSB-Core-generic/iniscrptact.html

If your resource is solely to be run under pacemaker, I'd prune
all those those quirks altogethher, to make one's life easier.

-- 
Jan (Poki)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20191204/a00b7f02/attachment.sig>