[ClusterLabs] Stray started resource leakages (Was: [Problem] The crmd fails to connect with pengine.)

Tue Jan 8 09:06:40 UTC 2019

On 02/01/19 15:43 +0100, Jan Pokorný wrote:
> On 28/12/18 05:51 +0900, renayama19661014 at ybb.ne.jp wrote:
>> As a result, Pacemaker will stop without stopping the resource.
> 
> This might have serious consequences in some scenarios, perhaps
> unless some watchdog-based solution (SBD?) was used as a fencing
> of choice since it would not get defused just as the resource
> wasn't stopped, I think...

Just very recently, I realized that pacemaker is likely not
sufficiently vigorous, in part for simplicity of design constraints,
in part for neglectation thereof, to prevent any such "stray started
resource" leaks that verge on resource-level split-brains, at least
in theory.

Take, for example, an OCF/LSB resource (hence with just approximated
monitoring capabilities by design) that takes unusually long to start.
What if pacemaker-execd (lrmd) crashes midway to bring it to start,
making the original resource process reparented to PID1?
Pacemakerd will restart this child daemon anew, resources will get
probed, but because the OCF/LSB resource in question is not started
yet (e.g. it double-forks, it does a lengthy initialization in between
the forks, only near the finish line it will create a pid file that is
also the only indicator for the respective monitor operation),
pacemaker on this node indicates to the peers this particular resource
is _not_ running locally, making them free to run it if DC decides
so.  That is, unless the start operation comes with an override of
"on-fail" default if this start-monitor pair would be evaluated as
a failed start at all (I don't know).  But what we are observing now
is an opportunity for resource-level split-brain to emerge; remember,
the resource on the original node, now under PID1's supervision, is
about to finish its initialization any present momement + no more
probe/monitor is coming there (unless explicitly configured so)
to realize this disaster any time soon.

This theoretical observation makes systemd class of resources (putting
nagios and upstart aside now for not having a look at them, and,
perhaps naively, assuming that things like a double-fencing are
relatively harmless -- it's meant to be downright idempotent when
the action is "off", unless it would collide with the parallel manual
intervention, indeed) the only one universally and relatively safely
survivable pacemaker-execd isolated restart (even then, it might be
recommended to have systemd sitting on the ticking watchdog just in
case, since when it internally "asserts", no further actions are
possible till the machine is restarted; indeed, unless pacemaker
can capture this circumstance and panic on its own).

Alternatively, one needs to make sure the OCF/LSB agent's start
operation begins with creating what's usually called a lock file, so
that after-restart probe in such a scenario will spot, in combination
with missing pid file, that the resource is still coming to its start,
give it some time for pid file to actually appear, and if not in time,
preferably trigger panic/self-fencing, since any
getting-hold-of-a-process-by-procfs-scan is a broken approach
(there's no snapshot semantics imposed with POSIX), especially
when there can be containers running on that host.

The other alternative in the current state of affairs and without
having OCF/LSB resources in use properly scrutinized (fact that
they start timely may be sufficient) is declaring PCMK_fail_fast=yes
in /etc/sysconfig/pacemaker or equivalent.

I do apologize beforehand for not having verified these scenarios
by hand, I wish I had a throughput for that.  Sadly, the failure
modes are far from being documented, which is best done along
creating and implementing the design (with a very desirable feedback
loop when running into particular corner cases), without the need
for reverse engineering (reverse grasping of the intentions prone
to misunderstanding) afterwards.

Keep calm, things have always been this way :-)

-- 
Nazdar,
Jan (Poki)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20190108/d51fe8d8/attachment.sig>