[ClusterLabs] [IMPORTANT] Fatal, yet rare issue verging on libqb's design flaw and/or it's use corosync around daemon-forking
Jan Friesse
jfriesse at redhat.com
Mon Jan 22 05:29:45 EST 2018
> It was discovered that corosync exposes itself for a self-crash
> under rare circumstance whereby corosync executable is run when there
> is already a daemon instance around (does not apply to corosync serving
> without any backgrounding, i.e. launched with "-f" switch).
>
> Such a circumstance can be provoked unattendedly by the third party,
> incl. "corosync -v" probe triggered internally by pcs (since 9e19af58
> ~ 0.9.145), which is what makes the root cause analysis of such
> inflicted crash somewhat difficult to guess & analyze (the other
> reason may be rather runaway core dump if produced at all due to
> fencing coming, based on the few observed cases).
>
> The problems comes from the fact that corosync is arranged such that
> the logging is set up very early, even before the main control flow
> of the program starts. And part of this early enabling is also
> starting "blackbox" recording, which uses mmap'd file stored in
> /dev/shm that, moreover, only varies on PID that is part of the file
> name -- and when corosync perform the fork so as to detach itself
> from the environment it started it, such PID is free to be reused.
> And against all odds, when that happens with this fresh new corosync
> process, it happily mangles the file underneath the former daemon one,
> leading to crashes indicated by SIGBUS, rarely also SIGFPE.
>
> * * *
>
> There are two quick mitigation techniques that can be readily applied:
>
> 1. make on-PATH corosync executable rather a "careful" wrapper:
>
> cp -a /sbin/corosync /sbin/corosync.orig
> > /sbin/corosync cat <<EOF
> #!/bin/sh
> test "\$1" != -v || { echo "$(/sbin/corosync.orig -v)"; exit 0; }
> exec /sbin/corosync.orig "\$@"
> EOF
>
> (when using SELinux, check the function and possibly fix the
> contexts on these files)
>
> 2. extend the PID space so as to move its wrap-around (precondition
> for reproducing the issue) further to the future (hence make the
> critical moments spread less frequently, lowering the overall
> probability), for instance with Linux kernel:
>
> echo 4194303 > /proc/sys/kernel/pid_max
>
> * * *
>
> The claim this problem is fixed, at least all three mentioned components
> will have to do its part to limit the problem in the future:
>
> - corosync (do something new after fork?)
Patch proposal:
https://github.com/corosync/corosync/pull/308
Also problem is really very rare and reproducing it is quite hard.
>
> - libqb (be more careful about the crashing condition?)
>
> - pcs (either find a different way to check "is-old-stack", or double
> check if the probe's PID doesn't happen to hit the one baked in
> existing files in /dev/shm?)
>
> so as to cover the-counterpart-not-up2date cases, and also will likely
> lead to augmenting and/or overloading semantics of libqb's API.
> All is being worked on, stay tuned.
>
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
More information about the Users
mailing list