[ClusterLabs Developers] RA as a systemd wrapper -- the right way?

Thu Sep 22 00:05:32 CEST 2016

On 09/21/2016 03:25 PM, Adam Spiers wrote:
> Hi Jan,
> 
> Jan Pokorný <jpokorny at redhat.com> wrote:
>> Hello,
>>
>> https://github.com/ClusterLabs/resource-agents/pull/846 seems to be
>> a first crack on integrating systemd to otherwise init-system-unaware
>> resource-agents.
>>
>> As pacemaker already handles native systemd integration, I wonder if
>> it wouldn't be better to just allow, on top of that, perhaps as
>> special "systemd+hooks" class of resources that would also accept
>> "hooks" (meta) attribute pointing to an executable implementing
>> formalized API akin to OCF (say on-start, on-stop, meta-data
>> actions) that would take care of initial reflecting on the rest of
>> the parameters + possibly a cleanup later on.

I can see the usefulness of having "hooks" for OS resources
(systemd/lsb/upstart/service). Let pacemaker start and stop the resource
via the OS mechanism, but do a little bit of extra housekeeping.

It could easily get ugly, though. Version dependencies, extra overhead, etc.

>> Technically, something akin to injecting Environment, ExecStartPre
>> and ExecStopPost to the service definition might also achieve the
>> same goal if there's a transparent way to do it from pacemaker using
>> just systemd API (I don't know).

Sure, pacemaker already creates a unit override before starting a
systemd resource. It would be trivial to add this. It could even simply
be configured as meta-attributes of systemd resources.

However, that wouldn't let you change the behavior of a status call, for
example.

>> Indeed, the scenario I have in mind would make do with separate
>> "prepare grounds" agent, suitably grouped with such systemd-class
>> resource, but that seems more fragile configuration-wise (this
>> is not the granularity cluster administrator would be supposed
>> to be thinking in, IMHO, just as with ocf class).

That isn't pretty either, but it's probably the best approach currently.

There are some non-obvious pitfalls when writing a "secondary" OCF agent
like this, but it's easy to document what they are and how to avoid them.

Nagios agents are another possibility; essentially, they implement a
status action and nothing else. So, a systemd resource + nagios resource
would provide an application-aware status.

Constraints and failure handling become trickier with this "two agents"
approach.

>> Just thinking aloud before the can is open.
> 
> Thanks for sharing - I'm very interested to hear your ideas on this,
> because I was thinking along somewhat similar lines for the
> openstack-resource-agents repository which I maintain.
> 
> Currently the OpenStack RAs duplicate much of the logic and config of
> corresponding systemd / LSB init scripts for starting / stopping
> OpenStack services and checking their status.  The main difference is
> that RAs also have a "monitor" action which can check the health of
> the service at application level, e.g. via HTTP rather than a naive
> "is this pid running" kind of check.
> 
> This duplication causes issues with portability between Linux
> distributions, since each distribution has a slightly different way of
> starting and stopping the services.  It also results in subtlely
> different behaviour for OpenStack clouds depending on whether or not
> they are deployed in HA mode using Pacemaker.
> 
> As a result I have been thinking about the idea of changing the
> start/stop/status actions of these RAs so that they wrap around
> service(8) (which would be even more portable across distros than
> systemctl).
> 
> The primary difference with your approach is that we probably wouldn't
> need to make the RAs dynamically create any systemd configuration, since
> that would already be provided by the packages which install the OpenStack
> services.  But then AFAIK none of the OpenStack services use the
> multi-instance feature of systemd (foo@{one,two,three,etc}.service).
> 
> Cheers,
> Adam

The main complication I see is that pacemaker expects OCF agents to
return success only after an action is complete. For example, start
should not return until the service is fully active. I believe systemctl
does not behave this way, rather it initiates the action and returns
immediately.

Pacemaker's native systemd integration has a lot of workarounds for
quirks in systemd behavior (and more every release). I'm not sure
moving/duplicating that logic to the RA is a good approach.