[ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

Mon Feb 18 15:08:16 EST 2019

On 15/02/19 08:48 +0100, Jan Friesse wrote:
> Ulrich Windl napsal(a):
>> IMHO any process running at real-time priorities must make sure
>> that it consumes the CPU only for short moment that are really
>> critical to be performed in time.

Pardon me, Ulrich, but something is off about this, especially
if meant in general.

Even if the infrastructure of the OS was entirely happy with
switching scheduling parameters constantly and in the furious rate
(I assume there may be quite a penalty when doing so, in the overhead
caused with reconfiguration of the schedulers involved), the
time-to-proceed-critical sections do not appear to be easily divisible
(if at all) in the overall code flow of a singlethreaded program like
corosync, since everything is time-critical in a sense (token and
other timeouts are ticking), and offloading some side/non-critical
tasks for asynchronous processing is likely not on the roadmap for
corosync, given the historical move from multithreading (only retained
for logging for which an extra precaution is needed so as to prevent
priority inversion, which will generally always be a threat when
unequal priority processes do interface, even if transitively).

The step around multithreading is to have another worker process
with IPC of some sorts, but with that, you only add more overhead
and complexity around such additionally managed queues into the
game (+ possibly priority inversion yet again).

BTW. regarding "must make sure" part, barring self-supervision
of any sort (new complexity + overhead), that's a problem of
fixed-priority scheduling assignment.  I've been recently raising
an awareness of (Linux-specific) *deadline scheduler* [1,2], which:

- has even higher hierarchical priority compared to SCHED_RR
  policy (making the latter possibly ineffective, which would
  not be very desirable, I guess)

- may better express not only the actual requirements, but some
  "under normal circumstances, using reasonably scoped HW for
  the task" (speaking of hypothetical defaults now, possibly
  user configurable and/or influenced with actual configured
  timeouts at corosync level) upper boundary for how much of
  CPU run-time shall be allowed for the process in absolute
  terms, possibly preventing said livelock scenarios (being
  throttled when exceeded, presumably speeding the loss of
  the token and subsequent fencing up)

Note that in systemd deployments, it would be customary for
the service launcher (unit file executor) to actually expose
this stuff as yet another user-customizable wrapping around
the actual run, but support for this very scheduling policy
is currently missing[3].

>> Specifically having some code that performs poorly (for various
>> reasons) is absolutely _not_ a candidate to be run with real-time
>> priorities to fix the bad performance!

You've managed to flip (more-or-less, have no contrary evidence)
isolated occurrence of evidently buggy behaviour to a generalized
description of the performance of the involved pieces of SW.
If that was that bad, we would hear there's not enough room for
the actual clustered resources all the time, but I am not aware
of that.

With buggy behaviour, I mean, logs from https://clbin.com/9kOUM
and https://github.com/ClusterLabs/libqb/commit/2a06ffecd bug fix
from the past seem to have something in common, like high load
as a surrounding circumstance, and the missed event/job (on,
presumably a socket, fd=15 in the log ... since that never gets
handled even when there's no other input event).  Guess that
another look is needed at _poll_and_add_to_jobs_ function
(not sure why it's without leading/trailing underscore in the
provided gdb backtrace [snipped]:
>>> Thread 1 (Thread 0x7f6fd43c7b80 (LWP 16242)):
>>> #0 0x00007f6fd31c5183 in epoll_wait () from /lib64/libc.so.6
>>> #1 0x00007f6fd3b3dea8 in poll_and_add_to_jobs () from /lib64/libqb.so.0
>>> #2 0x00007f6fd3b2ed93 in qb_loop_run () from /lib64/libqb.so.0
>>> #3 0x000055592d62ff78 in main ()
) and its use.

>> So if corosync is using 100% CPU in real-time, this says something
>> about the code quality in corosync IMHO.

... or in any other library that's involved (primary suspect: libqb)
down to kernel level, and, keep in mind, no piece of nontrivial SW is
bug-free, especially if the reproducer requires rather a specific
environment that is not prioritized by anyone incl. those tasked
with quality assurance.

>> Also SCHED_RR is even more cooperative than SCHED_FIFO, and another
>> interesting topic is which of the 100 real-time priorities to
>> assign to which process.  (I've written some C code that allows to
>> select the scheduling mechanism and the priority via command-line
>> argument, so the user and not the program is responsible if the
>> system locks up. Maybe corosync should thing about something
>> similar.
> 
> And this is exactly why corosync option -p (-P) exists (in 3.x these
> were moved to corosync.conf as a sched_rr/priority).
> 
>> Personally I also think that a program that sends megabytes of XML
>> as realtime-priority task through the network is broken by design:
>> If you care about response time, minimize the data and processing
>> required before using real-time priorities.

This is partially done already (compressing big XML chunks) before
sending on the pacemaker side.  The next reasonable step there would
be to move towards some of the nicely wrapped binary formats (e.g.
Protocol Buffers or FlatBuffers[4]), but it is a speculative long-term
direction, and core XML data interchange will surely be retained for
a long long time for compatibility reasons.  Other than that, corosync
doesn't interpret transferred data, and conversely, pacemaker daemons
do not run with realtime priorities.

>>>> Edwin Török <edvin.torok at citrix.com> 14.02.19 18.34 Uhr
>>> [...]
>>> 
>>> This appears to be a priority inversion problem, if corosync runs
>>> as realtime then everything it needs (timers...) should be
>>> realtime as well, otherwise running as realtime guarantees we'll
>>> miss the watchdog deadline, instead of guaranteeing that we
>>> process the data before the deadline.

This may not be an immediate priority inversion problem per se, but
(seemingly) a rare bug (presumably in libqb, see the other similar
one above) accented with the fixed-priority (only very lightly
upper-bounded) realtime scheduling and the fact this all somehow
manages to collide with as vital processes as those required for
an actual network packets delivery, IIUIC (yielding some
conclusions about putting VPNs etc. into the mix).

Note sure if this class of problems in general would be at least
partially self-solved with a deadline (word used twice in the above
excerpt, out of curiousity) scheduling with some reasonable parameters.

>>> [...]
>>> 
>>> Also would it be possible for corosync to avoid hogging the CPU in
>>> libqb?

...or possibly (having no proof), for either side not to get
inconsistent event tracking, which may slow any further progress down
(if not preventing it), see the similar libqb issue referenced above.

>>> (Our hypothesis is that if softirqs are not processed then timers
>>> wouldn't work for processes on that CPU either)

Interesting.  Anyway, thanks for sharing your observations.

>>> [...]

[1] https://lwn.net/Articles/743740/
[2] https://lwn.net/Articles/743946/
[3] https://github.com/systemd/systemd/issues/10034
[4] https://bugs.clusterlabs.org/show_bug.cgi?id=5376#c3

-- 
Jan (Poki)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20190218/21d2f535/attachment-0002.sig>