[ClusterLabs] corosync-qdevice[3772]: Heuristics worker waitpid failed (10): No child processes

Jan Friesse jfriesse at redhat.com
Mon May 6 03:53:26 EDT 2019


Andrei,


> While testing corosync-qdevice I repeatedly got the above message. The
> reason seems to be startup sequence in corosync-qdevice. Consider:
> 
> 
> ● corosync-qdevice.service - Corosync Qdevice daemon
>     Loaded: loaded (/etc/systemd/system/corosync-qdevice.service;
> disabled; vendor preset: disabled)
>     Active: active (running) since Sun 2019-05-05 08:22:03 MSK; 2s ago
>       Docs: man:corosync-qdevice
>    Process: 3770 ExecStart=/usr/sbin/corosync-qdevice
> $COROSYNC_QDEVICE_OPTIONS (code=exited, status=0/SUCCESS)
>   Main PID: 3772 (corosync-qdevic)
>      Tasks: 2 (limit: 553)
>     Memory: 2.1M
>     CGroup: /system.slice/corosync-qdevice.service
>             ├─3771 /usr/sbin/corosync-qdevice
>             └─3772 /usr/sbin/corosync-qdevice
> 
> ...
> May 05 08:11:41 ha2 corosync-qdevice[3772]: Heuristics worker waitpid
> failed (10): No child processes
> May 05 08:11:41 ha2 systemd[1]: Stopping Corosync Qdevice daemon...
> 
> Startup sequence of corosync-qdevice is
> 
> 1. PID 3770 forks off heuristics worker (PID 3771) in
> qdevice_heuristics_init(). Parent of PID 3771 is PID 3770.
> 2. PID 3770 calls utils_tty_detach() to daemonize. PID 3770 forks off
> child (PID 3772) and exits. At this point both PID 3771 and PID 3772 are
> reparented to PID 1, so 3772 can NOT receive status of 3771.
>  > Backgrounding is default behavior. In case of systemd it can trivially
> be turned off and service defined as simple. As there is no consumer of

Yep, it's because during init qdevice first forks heuristics process and 
then daemonize itself. During shutdown it waits for end of heuristics 
process but because its not process child it fails. I know about this 
problem, but it's quite low prio for me simply because it's harmless, 
and fully fixed in 3.x if systemd is used (-f is default there and unit 
type is notify). It's also not super easy to fix properly (eventho it 
would be quite easy to mask this error what I will probably do because 
wait is noneffective anyway).

Unit type simple would be also possible to use, sadly various parties 
are unhappy with simple type because it doesn't display failure when 
service is started and fails to init (usually because of NSS DB, 
corosync connection, ...).

Regards,
   Honza

> corosync-qdevice it does not matter - nothing needs to wait for it. Here
> is example service which seems to work for me:


> 
> [Unit]
> Description=Corosync Qdevice daemon
> Documentation=man:corosync-qdevice
> ConditionKernelCommandLine=!nocluster
> Wants=corosync.service
> After=corosync.service
> 
> [Service]
> EnvironmentFile=-/etc/sysconfig/corosync-qdevice
> ExecStart=/usr/sbin/corosync-qdevice -f $COROSYNC_QDEVICE_OPTIONS
> Type=simple
> RuntimeDirectory=corosync-qdevice
> RuntimeDirectoryMode=0770
> KillMode=mixed
> 
> [Install]
> WantedBy=multi-user.target
> 
> 
> 
> with result
> 
> ● corosync-qdevice.service - Corosync Qdevice daemon
>     Loaded: loaded (/etc/systemd/system/corosync-qdevice.service;
> disabled; vendor preset: disabled)
>     Active: active (running) since Sun 2019-05-05 08:28:51 MSK; 13s ago
>       Docs: man:corosync-qdevice
>   Main PID: 4075 (corosync-qdevic)
>      Tasks: 2 (limit: 553)
>     Memory: 2.0M
>     CGroup: /system.slice/corosync-qdevice.service
>             ├─4075 /usr/sbin/corosync-qdevice -f
>             └─4076 /usr/sbin/corosync-qdevice -f
> 
> and after stop
> 
> ● corosync-qdevice.service - Corosync Qdevice daemon
>     Loaded: loaded (/etc/systemd/system/corosync-qdevice.service;
> disabled; vendor preset: disabled)
>     Active: inactive (dead)
>       Docs: man:corosync-qdevice
> 
> May 05 08:27:04 ha2 systemd[1]: corosync-qdevice.service: Succeeded.
> May 05 08:27:51 ha2 systemd[1]: Started Corosync Qdevice daemon.
> May 05 08:28:14 ha2 systemd[1]: Stopping Corosync Qdevice daemon...
> May 05 08:28:14 ha2 systemd[1]: corosync-qdevice.service: Succeeded.
> May 05 08:28:14 ha2 systemd[1]: Stopped Corosync Qdevice daemon.
> May 05 08:28:51 ha2 systemd[1]: Started Corosync Qdevice daemon.
> May 05 08:29:19 ha2 systemd[1]: Stopping Corosync Qdevice daemon...
> May 05 08:29:19 ha2 systemd[1]: corosync-qdevice.service: Succeeded.
> May 05 08:29:19 ha2 systemd[1]: Stopped Corosync Qdevice daemon.
> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
> 



More information about the Users mailing list