[ClusterLabs] [IMPORTANT] Fatal, yet rare issue verging on libqb's design flaw and/or it's use in corosync around daemon-forking

Mon Jan 29 09:22:53 EST 2018

[developers list subscribers, kindly jump to "Current libqb PR" part]

On 22/01/18 11:29 +0100, Jan Friesse wrote:
>> It was discovered that corosync exposes itself for a self-crash
>> under rare circumstance whereby corosync executable is run when there
>> is already a daemon instance around (does not apply to corosync serving
>> without any backgrounding, i.e. launched with "-f" switch).
>> 
>> Such a circumstance can be provoked unattendedly by the third party,
>> incl. "corosync -v" probe triggered internally by pcs (since 9e19af58
>> ~ 0.9.145), which is what makes the root cause analysis of such
>> inflicted crash somewhat difficult to guess & analyze (the other
>> reason may be rather runaway core dump if produced at all due to
>> fencing coming, based on the few observed cases).
>> 
>> The problem comes from the fact that corosync is arranged such that
>> the logging is set up very early, even before the main control flow
>> of the program starts.  And part of this early enabling is also
>> starting "blackbox" recording, which uses mmap'd file stored in
>> /dev/shm that, moreover, only varies on PID that is part of the file
>> name -- and when corosync performs the fork so as to detach itself
>> from the environment it started it, such PID is free to be reused.
>> And against all odds, when that happens with this fresh new corosync
>> process, it happily mangles the file underneath the former daemon one,
>> leading to crashes indicated by SIGBUS, rarely also SIGFPE.
>> 
>> * * *
>> 
>> There are two quick mitigation techniques that can be readily applied:
>> 
>> 1. make on-PATH corosync executable rather a "careful" wrapper:
>> 
>>    cp -a /sbin/corosync /sbin/corosync.orig
>>    > /sbin/corosync cat <<EOF
>>    #!/bin/sh
>>    test "\$1" != -v || { echo "$(/sbin/corosync.orig -v)"; exit 0; }
>>    exec /sbin/corosync.orig "\$@"
>>    EOF
>> 
>>    (when using SELinux, check the function and possibly fix the
>>    contexts on these files)
>> 
>> 2. extend the PID space so as to move its wrap-around (precondition
>>    for reproducing the issue) further to the future (hence make the
>>    critical moments spread less frequently, lowering the overall
>>    probability), for instance with Linux kernel:
>> 
>>    echo 4194303 > /proc/sys/kernel/pid_max

or with recent enough Linux kernel, thanks to the fact that lowest
300 PIDs won't get recycled for being reserved (mutually exclusive
to measure 2. above):

3. start corosync service in a separate PID namespace, e.g. it gets
   started with initscript (or alike script, which may happen also
   under systemd supervision):

   > /sbin/corosync-daemon cat <<EOF
   #!/bin/sh
   <&- >&- 2>&- unshare -fp --mount-proc sh -c \\
   "corosync \$*& while kill -0 -1 2>/dev/null;do wait&&sleep 1||exit \\\$?;done"&
   EOF
   chmod +x /sbin/corosync-daemon
   echo "prog=corosync-daemon" >> /etc/sysconfig/corosync

[Apparently, it would be easier to just wrap the non-forking corosync
(-f switch) in substantially easier way, but then, it's questionable
why this is not the default where possible (e.g. under systemd).]

>> * * *
>> 
>> To claim this problem is fixed, at least all three mentioned components
>> will have to do its part to limit the problem in the future:
>> 
>> - corosync (do something new after fork?)
> 
> Patch proposal:
> 
> https://github.com/corosync/corosync/pull/308

What I propose is a solution addressing mix-and-match scenarios
amongst components, with fix split between corosync:

  https://github.com/corosync/corosync/pull/309

and libqb:

  https://github.com/ClusterLabs/libqb/pull/293

For the former, it may be indeed appealing to add posibility to
toggle blackbox logging directly in corosync configuration, but
that's not the long-term continuity fix.

Note that corosync patched like this alone is sufficient to prevent
the crashes at hand fully, patched libqb then only adds smoothness
to the PID-rollover-while-blackbox-in-use scenario.

> Also problem is really very rare and reproducing it is quite hard.

Agreed, though it's not that hard to trigger the condition
purposefully, see the libqb PR containing also the contrived
reproducer.

>> - libqb (be more careful about the crashing condition?)

Current libqb PR shifts the responsibility to avoid the precondition
(clash on the PID recycled in the interim) to the libqb client that
initiates blackbox logging prior to the fork wherein the parent
immediately terminates.

There's hardly a way to arrange it other way around, since while there
are limited ways for libqb to hook on the forking in the main process
(cf. pthread_atfork(3)), they are not generic enough (cf. clone(2)),
and verifying whether PID has changed upon each log entry would add
unjustified overhead, amplified further by the fact glibc is not
caching getpid(2) value anymore:

  https://sourceware.org/bugzilla/show_bug.cgi?id=15392

Ironically in this context, libqb caches "logging source" PID on it's
own, hence is practically untouched.  On the other hand, that's why
that value goes likewise out-of-sync when PID effectively changes,
so one is currently also on her own to call qb_log_format_set() for
all the enabled log targets upon such a change.

Perhaps this asks for some qb_log_reflect_pid macro that would
wrap all these handlings to recover from out-of-sync upon fork.
Any feedback on this in particular?

>> - pcs (either find a different way to check "is-old-stack", or double
>>   check if the probe's PID doesn't happen to hit the one baked in
>>   existing files in /dev/shm?)

Original proposal turned to be overcomplicated, while all that's
needed to cover this particular exposure to crash is to follow the
least-privilege principle (that's very healthy in general and should
have been followed already; apparently, just about anyone can run
"corosync -v"):

  https://github.com/ClusterLabs/pcs/issues/158

>> so as to cover the-counterpart-not-up2date cases, and also will likely
>> lead to augmenting and/or overloading semantics of libqb's API.
>> All is being worked on, stay tuned.

See the references above.

-- 
Jan (Poki)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20180129/d8e572a4/attachment-0002.sig>