[ClusterLabs] Should pacemaker pursue its own and corosync's instant resurrection if either dies? (Was: Is corosync supposed to be restarted if it dies?)

Sat Dec 2 08:30:33 EST 2017

On 30/11/17 11:00 +0300, Andrei Borzenkov wrote:
> On Thu, Nov 30, 2017 at 12:42 AM, Jan Pokorný <jpokorny at redhat.com> wrote:
>> On 29/11/17 22:00 +0100, Jan Pokorný wrote:
>>> On 28/11/17 22:35 +0300, Andrei Borzenkov wrote:
>>>> I'm not sure what is expected outcome, but pacemaker.service is still
>>>> restarted (due to Restart=on-failure).
>>> 
>>> Expected outcome is that pacemaker.service will become
>>> "inactive (dead)" after killing corosync (as a result of being
>>> "bound" by pacemaker).  Have you indeed issued "systemctl
>>> daemon-reload" after updating the pacemaker unit file?
>>> 
> 
> Of course. I even rebooted ... :)

I must admit I was experimenting just with simplistic units, which
do not necessarily generalize as well as I thought -- sorry.

> ha1:~ # systemctl cat pacemaker.service  | grep corosync
> After=corosync.service
> BindsTo=corosync.service
> # ExecStopPost=/bin/sh -c 'pidof crmd || killall -TERM corosync'

From the log you've provided:

> Nov 30 10:41:14 ha1 systemd[1]: corosync.service: Main process exited,
> code=killed, status=9/KILL
+
> Nov 30 10:41:14 ha1 systemd[1]: pacemaker.service: Main process
> exited, code=exited, status=107/n/a

is one (causal here) ordering of actions vs. the inverse later
(because the finalizing sweep is over the unordered or
reverse-ordered tasks at that later stage?):

> Nov 30 10:41:14 ha1 systemd[1]: Stopped Pacemaker High Availability
> Cluster Manager.
> Nov 30 10:41:14 ha1 systemd[1]: pacemaker.service: Unit entered failed state.
> Nov 30 10:41:14 ha1 systemd[1]: pacemaker.service: Failed with result
> 'exit-code'.
+
> Nov 30 10:41:14 ha1 systemd[1]: corosync.service: Unit entered failed state.
> Nov 30 10:41:14 ha1 systemd[1]: corosync.service: Failed with result 'signal'.

This hasn't been happening in my testing, but I didn't have the events
to close one to another, which may be causing the race you are talking
about, leading to unprevented restart:

> Nov 30 10:41:14 ha1 systemd[1]: pacemaker.service: Service hold-off
> time over, scheduling restart.

Note sure why this is being triggered once more(could be a mere
formal part of the scheduled "restart" transition?):

> Nov 30 10:41:14 ha1 systemd[1]: Stopped Pacemaker High Availability
> Cluster Manager.

And this troublemaker is clearly because of the "restart" transition:

> Nov 30 10:41:14 ha1 systemd[1]: Starting Corosync Cluster Engine...
> 
> Do you mean you get different results? Do not forget that the only
> thing BindsTo does is to stop service is dependency failed; it does
> *not* affect decision whether to restart service in any way (at least
> directly).

In race-condition free situation, such a BindsTo-incurred stopping (or
at least scheduled to since 235?) of the service is then not a subject
of auto-restarting, from what I've observed, and documentation agrees:

  Restart= [...] When the death of the process is a result of systemd
  operation (e.g. service stop or restart), the service will not be
  restarted

>>> (FTR, I tried with systemd 235).
>>> 
> 
> Well ... what we have here is race condition. We have two events -
> corosync.service and pacemaker.service *independent* failures
> and two (re-)actions - stop pacemaker.service in response to the
> former (due to BindsTo) and restart pacemaker.service in response to
> the latter (due to Restart=on-failure). The final result depends on
> the order in which systemd gets those events and schedules actions
> (and relative timing when those actions complete) and this is not
> deterministic.

Coming to similar conclusion.

> Now 235 includes some changes to restart logic which refuses to do
> restart if other action (like stop) is currently being scheduled. I am
> not sure what happens if restart is scheduled first though (such
> "implementation details" tend to be not documented in systemd world).

References (I figure that you were active around that issue, so you
surely have much more first-hand knowledge, but for completeness):
https://github.com/systemd/systemd/commit/0f52f8e552f869269f02f0359c2d1019cc39f15a
https://github.com/systemd/systemd/issues/6504

Having a more thorough look at systemd code, having that order of
"Failed with result" messages as per above would be a case lost
also with this v235, it seems.  What makes this race more likely
seems to be the fact that corosync goes through more internal cycles:
- either of hard termination than pacemaker, as pacemaker uses
  SendSIGKILL=no
- due to additionally "cgroup empty" event being handled by systemd
  for corosync, whereas it it is hardly the case of pacemaker
  because of KillMode=process

These might (or not) be aritificially reconciled with something like
"ExecStopPost=/usr/bin/sleep 1".  Admittedly, the only advantage
compared to proposed RestartPreventExitStatus solution is that
one doesn't need to examine undocumented (see following paragraph),
possibly unstable implementation details of pacemakerd.

And true for the final note, but there are quite some gray areas in
cluster behaviour as well, let's not hide that fact :-).

This gap also comes from the fact that in order to be qualified to
tell what's missing the documentation coverage, you gotta read up on
existing documentation inside out, understand all the presented
concepts, and importantly, keep that knowledge mentally strictly
separated to what you know from other sources/what your intuition
tells you.  Only then you can attempt for a documentation
_self-containment_.  And for documentation _completeness_, you either
need incredible amount of trial and errors, to know the code,
or most likely combination of both.

Documentation is hard.
That being told, any feedback for documentation of cluster components
is welcome, indeed.

And that's also something that touches the development practices:
would be nice to have a justification of Restart= directive in
pacemaker.service attached either directly as a comment or at least in
the respective commit message.  I am myself curious what use cases it
is supposed to help with, as these apparently are not addressed in
systemd-less systems.  Vaguely, it seems that together with
KillMode=process, the aim is to keep as much as pacemaker runtime
preserved in case of its internal failure in order to not to take
resources down when the problem is recoverable and the fencing will
not get triggered.  It raises "ethical" questions as to whether
faulty run-time should be still believed...

> I have been doing systemd troubleshooting for a long time to know that
> even if you observe specific sequence of events, another system may
> exhibit completely different sequence.
> 
> Anyway, I will try to install system with 235 on the same platform to
> see how it behaves.

I also tried v219 and it was behaving the same for my simplistic test
case.  See above, things are not expected change, but you'll see.

>>>> If intention is to unconditionally stop it when corosync dies,
>>>> pacemaker should probably exit with unique code and unit files have
>>>> RestartPreventExitStatus set to it.
>>> 
>>> That would be an elaborate way to reach the same.
>>> 
> 
> This is the *only* way to reach the same. You cannot both tell service
> manager to restart service and skip restart for some events without
> your service somehow indicating to service manager which events has to
> be skipped.

It should be possible as this indication of the events to skip when
considering a restart boils down to BindsTo+After well defined
behaviour (see quotation above) in case the target exits.
Sadly, some synchronization is missing to assuredly take this
indication into account.

>>> But good point in questioning what's the "best intention" around these
>>> scenarios -- normally, fencing would happen, but as you note, the node
>>> had actually survived by being fast enough to put corosync back to
>>> life, and from there, whether it adds any value to have pacemaker
>>> restarted on non-clean terminations at all.  I don't know.
>>> 
>>> Would it make more sense to have FailureAction=reboot-immediate to
>>> at least in part emulate the fencing instead?
>> 
>> Although the restart may be also blazingly fast in some cases,
>> not making much difference except for taking all the previously
>> running resources forcibly down as an extra step, which may be
>> either good or bad.

I had "pacemaker service enabled" scenario in mind, which may not
hold, indeed.  But monitor operations/probes should figure out + the
agents be able to do some recovery autonomously where suitable.

Perhaps we should test this directive in practice and add it,
commented out, with a description when you may want to use that.

-- 
Poki
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20171202/98a3aff6/attachment-0002.sig>