[ClusterLabs] Recurring troubles with the weak grip on processes (Was: Timeout stopping corosync-qdevice service)
jpokorny at redhat.com
Thu May 2 08:26:50 EDT 2019
On 30/04/19 20:39 +0300, Andrei Borzenkov wrote:
> 30.04.2019 9:51, Jan Friesse пишет:
>>> Now, corosync-qdevice gets SIGTERM as "signal to terminate", but it
>>> installs SIGTERM handler that does not exit and only closes some socket.
>>> May be this should trigger termination of main loop, but somehow it does
>> Yep, this is exactly how qdevice daemon shutdown works. Signal just
>> closes socket (should be signal safe) and poll in main loop do its job
>> so main loop is terminated.
> That is bug in corosync 2.4.4 which is still used in TW. stop is using
> pidof, I have two corosync-qdevice processes so corosync-qdevice never
> gets signal in the first place.
> ++ pidof corosync-qdevice
> + kill -TERM '1812 1811'
Needless to remind that half of the cluster stack, especially the
agents, still make decisions based on overly naive assumptions based
on unreliable grip on processes (and singletons thereof, which may not
apply, as demonstrated above) per their name/PID, something that may
clash even totally accidentally (typical default of process namespace
serving just 2^15 slots leading to possibly quick wraparounds; someone
invoking the pacemaker daemon just so as to on-off fetch the metadata
provided in this way), with containers make the situation just worse
from the host perspective.
Luckily, we've fixed some of these troublemakers in pacemaker with
the recent security updates, and there are some interesting synergies
possible in the outlook, see "pidfd" from the newest developments
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 819 bytes
Desc: not available
More information about the Users