[ClusterLabs] Corosync main process was not scheduled for 2889.8477 ms (threshold is 800.0000 ms), though it runs with realtime priority and there was not much load on the node

Mon Sep 9 11:23:02 EDT 2019

On Mon, 2019-09-09 at 14:21 +0200, wferi at niif.hu wrote:
> Andrei Borzenkov <arvidjaar at gmail.com> writes:
> 
> > 04.09.2019 0:27, wferi at niif.hu пишет:
> > 
> > > Jeevan Patnaik <g1patnaik at gmail.com> writes:
> > > 
> > > > [16187] node1 corosyncwarning [MAIN  ] Corosync main process
> > > > was not
> > > > scheduled for 2889.8477 ms (threshold is 800.0000 ms). Consider
> > > > token
> > > > timeout increase.
> > > > [...]
> > > > 2. How to fix this? We have not much load on the nodes, the
> > > > corosync is
> > > > already running with RT priority.
> > > 
> > > Does your corosync daemon use a watchdog device?  (See in the
> > > startup
> > > logs.)  Watchdog interaction can be *slow*.
> > 
> > Can you elaborate? This is the first time I see that corosync has
> > anything to do with watchdog. How exactly corosync interacts with
> > watchdog? Where in corosync configuration watchdog device is
> > defined?
> 
> Inside the resources directive you can specify a watchdog_device, 

Side comment: corosync's built-in watchdog handling is an older
alternative to sbd, the watchdog manager that pacemaker uses. You'd use
one or the other.

If you're running pacemaker on top of corosync, you'd probably want sbd
since pacemaker can use it for more situations than just cluster
membership loss.

> which
> Corosync will "pet" from its main loop.  From corosync.conf(5):
> 
> > In a cluster with properly configured power fencing a watchdog
> > provides no additional value.  On the other hand, slow watchdog
> > communication may incur multi-second delays in the Corosync main
> > loop,
> > potentially breaking down membership.  IPMI watchdogs are
> > particularly
> > notorious in this regard: read about kipmid_max_busy_us in IPMI.txt
> > in
> > the Linux kernel documentation.
-- 
Ken Gaillot <kgaillot at redhat.com>