[ClusterLabs] Antw: Re: corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

Tue Feb 19 02:22:17 EST 2019

>>> Jan Pokorný <jpokorny at redhat.com> schrieb am 18.02.2019 um 21:08 in
Nachricht
<20190218200816.GD23696 at redhat.com>:
> On 15/02/19 08:48 +0100, Jan Friesse wrote:
>> Ulrich Windl napsal(a):
>>> IMHO any process running at real-time priorities must make sure
>>> that it consumes the CPU only for short moment that are really
>>> critical to be performed in time.
> 
> Pardon me, Ulrich, but something is off about this, especially
> if meant in general.
> 
> Even if the infrastructure of the OS was entirely happy with
> switching scheduling parameters constantly and in the furious rate
> (I assume there may be quite a penalty when doing so, in the overhead
> caused with reconfiguration of the schedulers involved), the
> time-to-proceed-critical sections do not appear to be easily divisible
> (if at all) in the overall code flow of a singlethreaded program like
> corosync, since everything is time-critical in a sense (token and
> other timeouts are ticking), and offloading some side/non-critical
> tasks for asynchronous processing is likely not on the roadmap for
> corosync, given the historical move from multithreading (only retained
> for logging for which an extra precaution is needed so as to prevent
> priority inversion, which will generally always be a threat when
> unequal priority processes do interface, even if transitively).

That's what I'm talking about: If you let users pick the priority, every user
will pick the highest possible priority, because he/she expects to get better
service. I n fact they don't. And for real-time scheduling it can even halt the
system. For corosync this means that RT priority should be really used where
it's actually needed, not just to cope with poor performance.

> 
> The step around multithreading is to have another worker process
> with IPC of some sorts, but with that, you only add more overhead
> and complexity around such additionally managed queues into the
> game (+ possibly priority inversion yet again).

Yes, a real-time system has to be designed with real-time in mind all the
time; you can't make any system a realtime system just by using real-time
scheduling priorities.

> 
> BTW. regarding "must make sure" part, barring self-supervision
> of any sort (new complexity + overhead), that's a problem of
> fixed-priority scheduling assignment.  I've been recently raising
> an awareness of (Linux-specific) *deadline scheduler* [1,2], which:
> 
> - has even higher hierarchical priority compared to SCHED_RR
>   policy (making the latter possibly ineffective, which would
>   not be very desirable, I guess)
> 
> - may better express not only the actual requirements, but some
>   "under normal circumstances, using reasonably scoped HW for
>   the task" (speaking of hypothetical defaults now, possibly
>   user configurable and/or influenced with actual configured
>   timeouts at corosync level) upper boundary for how much of
>   CPU run-time shall be allowed for the process in absolute
>   terms, possibly preventing said livelock scenarios (being
>   throttled when exceeded, presumably speeding the loss of
>   the token and subsequent fencing up)
> 
> Note that in systemd deployments, it would be customary for
> the service launcher (unit file executor) to actually expose
> this stuff as yet another user-customizable wrapping around
> the actual run, but support for this very scheduling policy
> is currently missing[3].
> 
>>> Specifically having some code that performs poorly (for various
>>> reasons) is absolutely _not_ a candidate to be run with real-time
>>> priorities to fix the bad performance!
> 
> You've managed to flip (more-or-less, have no contrary evidence)
> isolated occurrence of evidently buggy behaviour to a generalized
> description of the performance of the involved pieces of SW.

Actually (independent of this issue) I always had the impression that corosync
is communicating too much (a lot of traffic while nothing s happening in the
cluster), and it easily breaks under load. And I had the impression that
developers tried to fix this by adding real-time priorities to the parts that
expose the problem. Which is the wrong type of fix IMHO...

> If that was that bad, we would hear there's not enough room for
> the actual clustered resources all the time, but I am not aware
> of that.

Depends on what "room" actually refers to: Would corosync ever work reasonably
on a single-CPU system? Yes that's poorly hypothetical, but there actually
exists software that deadlocks with only one CPU...

> 
> With buggy behaviour, I mean, logs from https://clbin.com/9kOUM 
> and https://github.com/ClusterLabs/libqb/commit/2a06ffecd bug fix
> from the past seem to have something in common, like high load
> as a surrounding circumstance, and the missed event/job (on,
> presumably a socket, fd=15 in the log ... since that never gets
> handled even when there's no other input event).  Guess that
> another look is needed at _poll_and_add_to_jobs_ function
> (not sure why it's without leading/trailing underscore in the
> provided gdb backtrace [snipped]:
>>>> Thread 1 (Thread 0x7f6fd43c7b80 (LWP 16242)):
>>>> #0 0x00007f6fd31c5183 in epoll_wait () from /lib64/libc.so.6
>>>> #1 0x00007f6fd3b3dea8 in poll_and_add_to_jobs () from /lib64/libqb.so.0
>>>> #2 0x00007f6fd3b2ed93 in qb_loop_run () from /lib64/libqb.so.0
>>>> #3 0x000055592d62ff78 in main ()
> ) and its use.
> 
>>> So if corosync is using 100% CPU in real-time, this says something
>>> about the code quality in corosync IMHO.
> 
> ... or in any other library that's involved (primary suspect: libqb)
> down to kernel level, and, keep in mind, no piece of nontrivial SW is
> bug-free, especially if the reproducer requires rather a specific
> environment that is not prioritized by anyone incl. those tasked
> with quality assurance.

Yes, but busy-waiting (using poll()) in a real-time task is always dangerous,
especially if you do not have control over the events that are supposed to
arrive.

> 
>>> Also SCHED_RR is even more cooperative than SCHED_FIFO, and another
>>> interesting topic is which of the 100 real-time priorities to
>>> assign to which process.  (I've written some C code that allows to
>>> select the scheduling mechanism and the priority via command-line
>>> argument, so the user and not the program is responsible if the
>>> system locks up. Maybe corosync should thing about something
>>> similar.
>> 
>> And this is exactly why corosync option -p (-P) exists (in 3.x these
>> were moved to corosync.conf as a sched_rr/priority).
>> 
>>> Personally I also think that a program that sends megabytes of XML
>>> as realtime-priority task through the network is broken by design:
>>> If you care about response time, minimize the data and processing
>>> required before using real-time priorities.
> 
> This is partially done already (compressing big XML chunks) before
> sending on the pacemaker side.  The next reasonable step there would
> be to move towards some of the nicely wrapped binary formats (e.g.
> Protocol Buffers or FlatBuffers[4]), but it is a speculative long-term
> direction, and core XML data interchange will surely be retained for
> a long long time for compatibility reasons.  Other than that, corosync
> doesn't interpret transferred data, and conversely, pacemaker daemons
> do not run with realtime priorities.

Maybe an overall picture with tasks, priorities, responsibilities and
communication paths will be helpful to understand.

> 
>>>>> Edwin Török <edvin.torok at citrix.com> 14.02.19 18.34 Uhr
>>>> [...]
>>>> 
>>>> This appears to be a priority inversion problem, if corosync runs
>>>> as realtime then everything it needs (timers...) should be
>>>> realtime as well, otherwise running as realtime guarantees we'll
>>>> miss the watchdog deadline, instead of guaranteeing that we
>>>> process the data before the deadline.
> 
> This may not be an immediate priority inversion problem per se, but
> (seemingly) a rare bug (presumably in libqb, see the other similar
> one above) accented with the fixed-priority (only very lightly
> upper-bounded) realtime scheduling and the fact this all somehow
> manages to collide with as vital processes as those required for
> an actual network packets delivery, IIUIC (yielding some
> conclusions about putting VPNs etc. into the mix).
> 
> Note sure if this class of problems in general would be at least
> partially self-solved with a deadline (word used twice in the above
> excerpt, out of curiousity) scheduling with some reasonable parameters.
> 
>>>> [...]
>>>> 
>>>> Also would it be possible for corosync to avoid hogging the CPU in
>>>> libqb?
> 
> ...or possibly (having no proof), for either side not to get
> inconsistent event tracking, which may slow any further progress down
> (if not preventing it), see the similar libqb issue referenced above.
> 
>>>> (Our hypothesis is that if softirqs are not processed then timers
>>>> wouldn't work for processes on that CPU either)
> 
> Interesting.  Anyway, thanks for sharing your observations.

Regards,
Ulrich

> 
>>>> [...]
> 
> [1] https://lwn.net/Articles/743740/ 
> [2] https://lwn.net/Articles/743946/ 
> [3] https://github.com/systemd/systemd/issues/10034 
> [4] https://bugs.clusterlabs.org/show_bug.cgi?id=5376#c3 
> 
> -- 
> Jan (Poki)