[ClusterLabs] Corosync main process was not scheduled for 115935.2266 ms (threshold is 800.0000 ms). Consider token timeout increase.

Wed Feb 17 05:30:55 EST 2016

Hi,

I am seeing massages like this in my logs:

Jan 29 07:00:41 B5-2U-205-LS lrmd[3012]: notice: operation_finished:
diskManager_monitor_30000:18807:stderr [ Failed to get properties:
Connection timed out ]
Jan 29 07:00:41 B5-2U-205-LS lrmd[3012]: notice: operation_finished:
pmdh_monitor_30000:18817:stderr [ Failed to get properties: Connection
timed out ]
Jan 29 07:00:41 B5-2U-205-LS lrmd[3012]: notice: operation_finished:
sddh_monitor_30000:18818:stderr [ Failed to get properties: Connection
timed out ]
Jan 29 07:00:41 B5-2U-205-LS lrmd[3012]: notice: operation_finished:
sm0_monitor_30000:18821:stderr [ Failed to get properties: Connection timed
out ]
Jan 29 07:00:43 B5-2U-205-LS corosync[2742]: [MAIN  ] Corosync main process
was not scheduled for 12483.7363 ms (threshold is 800.0000 ms). Consider
token timeout increase.
Jan 29 07:00:44 B5-2U-205-LS crmd[3015]: notice: process_lrm_event:
Operation sm0dh_monitor_30000: not running (node=node-0, call=59, rc=7,
cib-update=261, confirmed=false)
Jan 29 07:00:44 B5-2U-205-LS crmd[3015]: notice: process_lrm_event:
node-0-sm0dh_monitor_30000:59 [ Failed to get properties: Connection timed
out\n ]
Jan 29 07:01:02 B5-2U-205-LS corosync[2742]: [TOTEM ] Process pause
detected for 17843 ms, flushing membership messages.
Jan 29 07:01:04 B5-2U-205-LS indexServer(indexServer)[18891]: WARNING: RA:
[monitor] : got rc=1
Jan 29 07:01:04 B5-2U-205-LS diskHelper(dmdh)[18892]: WARNING: RA:
[monitor] : got rc=1
Jan 29 07:01:19 B5-2U-205-LS adminServer(adminServer)[18911]: WARNING: RA:
[monitor] : got rc=1
Jan 29 07:01:36 B5-2U-205-LS lrmd[3012]: notice: operation_finished:
indexServer_monitor_30000:18828:stderr [ Failed to get properties:
Connection timed out ]
Jan 29 07:01:41 B5-2U-205-LS corosync[2742]: [MAIN  ] Corosync main process
was not scheduled for 55969.9180 ms (threshold is 800.0000 ms). Consider
token timeout increase.
Jan 29 07:02:01 B5-2U-205-LS lrmd[3012]: notice: operation_finished:
dmdh_monitor_30000:18830:stderr [ Failed to get properties: Connection
timed out ]
Jan 29 07:03:39 B5-2U-205-LS corosync[2742]: [MAIN  ] Corosync main process
was not scheduled for 115935.2266 ms (threshold is 800.0000 ms). Consider
token timeout increase.
Jan 29 07:03:47 B5-2U-205-LS
notificationService(notificationService)[18959]: WARNING: RA: [monitor] :
got rc=1
Jan 29 07:03:47 B5-2U-205-LS storageManager(sm0)[18958]: WARNING: RA:
[monitor] : got rc=1
Jan 29 07:03:47 B5-2U-205-LS diskManager(diskManager)[18960]: WARNING: RA:
[monitor] : got rc=1
Jan 29 07:03:58 B5-2U-205-LS diskHelper(pmdh)[18964]: WARNING: RA:
[monitor] : got rc=1
Jan 29 07:04:00 B5-2U-205-LS lrmd[3012]: notice: operation_finished:
adminServer_monitor_30000:18853:stderr [ Failed to get properties:
Connection timed out ]
Jan 29 07:04:04 B5-2U-205-LS diskHelper(sm0dh)[18968]: WARNING: RA:
[monitor] : got rc=1
Jan 29 07:04:16 B5-2U-205-LS diskHelper(sddh)[18987]: WARNING: RA:
[monitor] : got rc=1
Jan 29 07:04:31 B5-2U-205-LS corosync[2742]: [TOTEM ] Process pause
detected for 109635 ms, flushing membership messages.

What is happening to the cluster here?
Why Corosync says "Corosync main process was not scheduled for ..."?
Why lrmd says "... _monitor_30000:18828:stderr [ Failed to get properties:
Connection timed out ]"?

It is worth to mention that the system was under big IO load.
Also, I am not sure whether is has to do something
with load-threshold="400%".

Thank you,
Kostia
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20160217/1da659fe/attachment-0002.html>