[ClusterLabs] Corosync main process was not scheduled for 2889.8477 ms (threshold is 800.0000 ms), though it runs with realtime priority and there was not much load on the node

Mon Sep 9 08:21:59 EDT 2019

Andrei Borzenkov <arvidjaar at gmail.com> writes:

> 04.09.2019 0:27, wferi at niif.hu пишет:
>
>> Jeevan Patnaik <g1patnaik at gmail.com> writes:
>> 
>>> [16187] node1 corosyncwarning [MAIN  ] Corosync main process was not
>>> scheduled for 2889.8477 ms (threshold is 800.0000 ms). Consider token
>>> timeout increase.
>>> [...]
>>> 2. How to fix this? We have not much load on the nodes, the corosync is
>>> already running with RT priority.
>> 
>> Does your corosync daemon use a watchdog device?  (See in the startup
>> logs.)  Watchdog interaction can be *slow*.
>
> Can you elaborate? This is the first time I see that corosync has
> anything to do with watchdog. How exactly corosync interacts with
> watchdog? Where in corosync configuration watchdog device is defined?

Inside the resources directive you can specify a watchdog_device, which
Corosync will "pet" from its main loop.  From corosync.conf(5):

| In a cluster with properly configured power fencing a watchdog
| provides no additional value.  On the other hand, slow watchdog
| communication may incur multi-second delays in the Corosync main loop,
| potentially breaking down membership.  IPMI watchdogs are particularly
| notorious in this regard: read about kipmid_max_busy_us in IPMI.txt in
| the Linux kernel documentation.
-- 
Regards,
Feri