[ClusterLabs Developers] RA as a systemd wrapper -- the right way?

Mon Sep 26 12:39:09 EDT 2016

On 09/26/2016 09:10 AM, Adam Spiers wrote:
> Ken Gaillot <kgaillot at redhat.com> wrote:
>> On 09/22/2016 10:39 AM, Adam Spiers wrote:
>>> Ken Gaillot <kgaillot at redhat.com> wrote:
>>>> On 09/22/2016 08:49 AM, Adam Spiers wrote:
>>>>> Ken Gaillot <kgaillot at redhat.com> wrote:
>>>>>> On 09/21/2016 03:25 PM, Adam Spiers wrote:
>>>>>>> As a result I have been thinking about the idea of changing the
>>>>>>> start/stop/status actions of these RAs so that they wrap around
>>>>>>> service(8) (which would be even more portable across distros than
>>>>>>> systemctl).
>>>>>>>
>>>>>>> The primary difference with your approach is that we probably wouldn't
>>>>>>> need to make the RAs dynamically create any systemd configuration, since
>>>>>>> that would already be provided by the packages which install the OpenStack
>>>>>>> services.  But then AFAIK none of the OpenStack services use the
>>>>>>> multi-instance feature of systemd (foo@{one,two,three,etc}.service).
>>>>>>
>>>>>> The main complication I see is that pacemaker expects OCF agents to
>>>>>> return success only after an action is complete. For example, start
>>>>>> should not return until the service is fully active. I believe systemctl
>>>>>> does not behave this way, rather it initiates the action and returns
>>>>>> immediately.
>>>>>
>>>>> But that's trivial to work around: polling via "service foo status"
>>>>> after "service foo start" converts it back from an asynchronous
>>>>> operation to a synchronous one.
>>>>
>>>> Yes, that's exactly what pacemaker does now: start/stop, then every two
>>>> seconds, poll the status.
>>>>
>>>> However, I'm currently working on a project to change that, so that we
>>>> use DBus signalling to be notified when the job completes, rather than
>>>> (or in addition to) polling.
>>>>
>>>> The reason is twofold: the two-second wait can be an unnecessary
>>>> recovery delay in some cases; and (at least from the DBus API, not sure
>>>> about systemctl status) there's no reliable way to distinguish "service
>>>> is inactive because the start didn't work properly" from "service is
>>>> inactive because systemd has some slow-starting dependencies of its own
>>>> to start first".
>>>
>>> OK, that makes sense - thanks.
> 
> Although thinking about it more - why couldn't systemctl return
> different exit codes for these two cases, or add an "is-starting"
> subcommand, or similar?

That would be nice. I'm not sure what systemctl returns now, since we
use the DBus API, but I'm guessing it's equivalent.

systemd does have an "activating" state when the service is starting.
However, it does not enter that state while (After=) dependencies are
being started, only when the service itself is being started. It shows
"inactive" when waiting for dependencies to start, and also when the
service is cleanly stopped, and as far as I know, there's no reliable
way to distinguish those two cases.

>>>>>> Pacemaker's native systemd integration has a lot of workarounds for
>>>>>> quirks in systemd behavior (and more every release). I'm not sure
>>>>>> moving/duplicating that logic to the RA is a good approach.
>>>>>
>>>>> What other quirks are there?
>>>>
>>>> When pacemaker starts a systemd service, it creates a unit override in
>>>> /run/systemd/system/<agent>.service.d/50-pacemaker.conf, with these
>>>> overrides (and removes the file when stopping the resource):
>>>>
>>>> * It prefixes the description with "Cluster Controlled" (e.g. "Postfix
>>>> Mail Transport Agent" -> "Cluster Controlled Postfix Mail Transport
>>>> Agent"). This gives a clear indicator in systemd messages in the syslog
>>>> that it's a cluster resource.
>>>>
>>>> * "Before=pacemaker.service": This ensures that when someone shuts down
>>>> the system via systemd, systemd doesn't stop pacemaker before pacemaker
>>>> can stop the resource.
>>>>
>>>> * "Restart=no": This ensures that pacemaker stays in control of
>>>> responding to service failures.
>>>
>>> Yes, I was aware of that, and you're right that my approach of making
>>> the RA wrap service(8) or systemctl(8) would need to duplicate this
>>> functionality - *unless* the creation of the unit override could be
>>> moved out of Pacemaker's C code into a shell script which both
>>> Pacemaker and external RAs which want to adopt this wrapping technique
>>> could call.
>>>
>>>> Additionally:
>>>>
>>>> * Pacemaker uses intelligent timeout values (based on cluster
>>>> configuration) when making systemd calls.
>>>
>>> I guess I'd need more details to fully understand this, but couldn't
>>> those intelligently chosen timeout values be passed to the RA if
>>> necessary?  Although that does put a bit of a dampener on my hope of
>>> using service(8) to remain agnostic to whichever pid-1 system happened
>>> to be in use on the current machine.  Having said that, maybe everyone
>>> in the OpenStack (HA) community has already moved to systemd by now
>>> anyway.
>>
>> One pacemaker action (start/stop/whatever) may involve multiple
>> interactions with systemd. At each step, pacemaker knows the remaining
>> timeout for the whole action, so it can use an appropriate timeout with
>> each systemd action.
>>
>> There's no way for the RA to know how much time is remaining.
> 
> Stupid question - why not?  Couldn't Pacemaker tell it?

Theoretically, pacemaker could pass a "deadline" timestamp by which the
RA is expected to finish, but that's blurring the lines between agent
and cluster manager. Also, I think it's probably sufficient that
pacemaker times out the entire RA action, so hopefully this wouldn't be
a big issue.

>> But I guess it's not important, since pacemaker will timeout the entire
>> RA action if necessary.
>>
>>>> * Pacemaker interprets/remaps systemd return status as needed. For
>>>> example, a stop followed by a status poll that returns "OK" means the
>>>> service is still running. Fairly obvious, but there are a lot of cases
>>>> that need to be handled.
>>>
>>> Other than (obviously) start followed by status, what other cases are
>>> there?
>>
>> It's just a matter of looking at all the possible return values of each
>> systemd call, and then mapping that to something the cluster can
>> interpret. Pacemaker uses the DBus API so the specifics will be
>> different compared to systemctl. It's just important to get right.
> 
> I'm struggling to understand the specifics here, or find the bit of
> Pacemaker code which corresponds to them.

It's basically the entirety of pacemaker's dbus/systemd handling.
Anytime a systemd call is made, you just have to be sure you fully
understand what all of the possible return values are and what they
mean. Systemd is internally consistent but not always intuitive --
especially from a cluster point of view, which similarly has its own
specific definitions that aren't always obvious.

The "inactive" state mentioned above is the latest example to bite us.
We had mistakenly assumed that any successfully initiated start action
would put the service into "activating" state, but slow dependencies can
delay that.

>>> All of this stuff sounds like generic problems which could be solved
>>> once for all wrapper RAs via a simple shell library.  I'd happily
>>> maintain this in openstack-resource-agents, although TBH it would
>>> probably belong in resource-agents if anywhere.
>>>
>>>> All of these were added gradually over the past few years, so I'd expect
>>>> the list to grow over the next few years.
>>>
>>> Well, hopefully they could be grown in a way which also supported
>>> wrapper RAs :-)
>>>
>>> Alternatively, if you think that there's a better solution than this
>>> wrapper RA idea, I'm all ears.  The two main problems are essentially:
>>>
>>>   1. RAs duplicate a whole bunch of logic / config already provided
>>>      by vendor packages and systemd service units.
>>>
>>>   2. RAs have a "monitor" action which can do proper application-level
>>>      monitoring (e.g. HTTP pings), whereas apparently systemd has
>>>      nothing equivalent.
>>>
>>> So currently we are forced to choose between a) using systemd
>>> Pacemaker resources, and b) having proper monitoring rather than just
>>> naive pid-level monitoring, but having to duplicate a whole load of
>>> stuff which systemd already does nicely.
>>>
>>> If I'm missing something, or you can think of a better alternative
>>> then please tell me!
>>
>> I don't see a clear answer.
>>
>> I suppose a resource-agents interface could minimize the problems.
>> Something like ocf_start_via_systemd could create an override file,
>> start a service, and poll until it has a status. Similarly for stop.
> 
> Exactly.
> 
>> The main drawbacks I see are that I'm not sure you can solve the
>> problems with polling without the dbus interface
> 
> I still don't get why not - but that's most likely due to my
> ignorance of the details.  Any pointers gratefully received if you
> have time.

We hope to get around the "inactive" ambiguity by using DBus signalling
to receive notifications when a start job is complete, rather than poll
the status repeatedly. I don't know of any equivalent way to do that
with systemctl.

Also, the polling interval is inherently a start-up/recovery delay. I
don't think there's a way around that, either. For some users, even a
small interval of 2 seconds is undesirable (especially with a chain of
dependent systemd resources).

>> and the override file is tailored to pacemaker (which
>> resource-agents stays independent of).
> 
> Not sure what you mean by this, or why it would be a drawback?

resource-agents conforms to the OCF standard. While pacemaker is by far
the most common use case for OCF resource agents, the point of the
standard is to be interoperable with other systems (such as rgmanager,
certain monitoring systems, or manual/scripted usage). So, the
resource-agents package avoids any strict dependency on being run via
pacemaker (that's the key difference between ocf:pacemaker: agents and
other providers).

The override file contains "Before=pacemaker.service", so that
introduces a bit of specific handling, but "Before=" indicates optional
ordering, so it's maybe acceptable.

The override also contains "Restart=no", which would be suitable for any
cluster manager (not just pacemaker), but maybe not for all possible
(manual/scripted) use cases. Again, maybe acceptable.

One possibility would be for pacemaker to provide the wrapper library
instead of resource-agents. That would restrict it to agents that must
be used with pacemaker. But maybe that's preferable, especially if
agents that need this feature will be likely to need to set pacemaker
node attributes and have the dependency anyway.

>> If you want to give it a try, here are some test cases:
>>
>> * A service that takes a long time to start, with another resource
>> ordered after it (make sure the second resource doesn't start until the
>> first is fully up)
> 
> I assume you mean the ordering would come from a Pacemaker order
> constraint, not from systemd Before= or After= ?

correct

> In this case, the OCF RA "start" action would invoke the shared
> ocf_start_via_systemd() library function, which would initiate
> startup, and then poll until startup started, at which point the RA
> action would complete and Pacemaker would continue by starting the
> resource ordered after it.  I'm not sure I see the problem here.  Of
> course there would be a Pacemaker start op timeout which would need to
> be long enough, but that's no different regardless of which resource
> agent is in use.

It's not a problem, it's a test case :)

It would verify that the wrapper is not returning until startup is complete.

>> * A service that has a Requires= dependency that takes a long time to
>> start and is not managed by the cluster
> 
> Here you mean that if service A has Requires=B, then B (not A) takes a
> long time to start and is not managed by the cluster - right?  Here
> there are two cases: a) service A also has After=B, and b) it does
> not.

Correct, I meant to say Requires= and After=

> Again, I don't see a problem.  In case a), ocf_start_via_systemd()
> would initiate startup of A and then poll until A completed startup,
> and systemd would ensure that B started before A started.  In case b)
> then systemd would initiate startup of A and B in parallel.  Again,
> the polling would not return until A was fully started.  This could be
> before B was fully started, but that would be OK because if it wasn't,
> A would have had After=B.
> 
> The fact that I don't see any problems where you apparently do makes
> me deeply suspicious of my own understanding ;-)  Please tell me what
> I'm missing.

This test case would verify that the wrapper is not fooled by A's
"inactive" state while B is starting.

Paired with this should be another test case to verify that the wrapper
returns failure if A stays "inactive". (We could just let it time out in
this case, but especially with long timeouts, it would be better to
detect sooner that the service isn't going to come up.)

>> * Use systemctl to shut down the host while the cluster is active, with
>> resources that take a while to stop
> 
> The overrides tell systemd that it has no business shutting down
> resources managed by Pacemaker.  So systemd only cares about shutting
> down Pacemaker itself (which would not complete until all
> Pacemaker-managed resources had been stopped by Pacemaker), and other
> services not managed by Pacemaker.

Correct, this test case verifies that the override is working.

> Now, here I *do* see a potential problem.  If service B is managed by
> Pacemaker, is configured with Requires=A and After=A, but service A is
> *not* managed by Pacemaker, we would need to ensure that on system
> shutdown, systemd would shutdown Pacemaker (and hence B) *before* it
> (systemd) shuts down A, otherwise A could be stopped before B,
> effectively pulling the rug from underneath B's feet.
> 
> But isn't that an issue even if Pacemaker only uses systemd resources?
> I don't see how the currently used override files protect against this
> issue.  Have I just "discovered" a bug, or more likely, is there again
> a gap in my understanding?

Systemd handles the dependencies properly here:

- A must be stopped after B (B's After=A)
- B must be stopped after pacemaker (B's Before=pacemaker via override)
- therefore, stop pacemaker, then A (which will be a no-op because
pacemaker will already have stopped it), then B