[ClusterLabs] Corosync main process was not scheduled for 115935.2266 ms (threshold is 800.0000 ms). Consider token timeout increase.

Wed Feb 17 11:29:42 UTC 2016

I attached a snipped from corosync.log file.
The last successful cluster recheck happens at "Jan 29 06:56:05" (line #
38), then failures start happening.
At that time the cluster has been working under the same IO load (really
high) for about 16 hours without any problems (and no warnings in the log).
Also based on time stamps in the log I fell like the cluster doesn't react
fast enough to events. Am I right here?

Thank you,
Kostia

On Wed, Feb 17, 2016 at 12:30 PM, Kostiantyn Ponomarenko <
konstantin.ponomarenko at gmail.com> wrote:

> Hi,
>
> I am seeing massages like this in my logs:
>
> Jan 29 07:00:41 B5-2U-205-LS lrmd[3012]: notice: operation_finished:
> diskManager_monitor_30000:18807:stderr [ Failed to get properties:
> Connection timed out ]
> Jan 29 07:00:41 B5-2U-205-LS lrmd[3012]: notice: operation_finished:
> pmdh_monitor_30000:18817:stderr [ Failed to get properties: Connection
> timed out ]
> Jan 29 07:00:41 B5-2U-205-LS lrmd[3012]: notice: operation_finished:
> sddh_monitor_30000:18818:stderr [ Failed to get properties: Connection
> timed out ]
> Jan 29 07:00:41 B5-2U-205-LS lrmd[3012]: notice: operation_finished:
> sm0_monitor_30000:18821:stderr [ Failed to get properties: Connection timed
> out ]
> Jan 29 07:00:43 B5-2U-205-LS corosync[2742]: [MAIN  ] Corosync main
> process was not scheduled for 12483.7363 ms (threshold is 800.0000 ms).
> Consider token timeout increase.
> Jan 29 07:00:44 B5-2U-205-LS crmd[3015]: notice: process_lrm_event:
> Operation sm0dh_monitor_30000: not running (node=node-0, call=59, rc=7,
> cib-update=261, confirmed=false)
> Jan 29 07:00:44 B5-2U-205-LS crmd[3015]: notice: process_lrm_event:
> node-0-sm0dh_monitor_30000:59 [ Failed to get properties: Connection timed
> out\n ]
> Jan 29 07:01:02 B5-2U-205-LS corosync[2742]: [TOTEM ] Process pause
> detected for 17843 ms, flushing membership messages.
> Jan 29 07:01:04 B5-2U-205-LS indexServer(indexServer)[18891]: WARNING: RA:
> [monitor] : got rc=1
> Jan 29 07:01:04 B5-2U-205-LS diskHelper(dmdh)[18892]: WARNING: RA:
> [monitor] : got rc=1
> Jan 29 07:01:19 B5-2U-205-LS adminServer(adminServer)[18911]: WARNING: RA:
> [monitor] : got rc=1
> Jan 29 07:01:36 B5-2U-205-LS lrmd[3012]: notice: operation_finished:
> indexServer_monitor_30000:18828:stderr [ Failed to get properties:
> Connection timed out ]
> Jan 29 07:01:41 B5-2U-205-LS corosync[2742]: [MAIN  ] Corosync main
> process was not scheduled for 55969.9180 ms (threshold is 800.0000 ms).
> Consider token timeout increase.
> Jan 29 07:02:01 B5-2U-205-LS lrmd[3012]: notice: operation_finished:
> dmdh_monitor_30000:18830:stderr [ Failed to get properties: Connection
> timed out ]
> Jan 29 07:03:39 B5-2U-205-LS corosync[2742]: [MAIN  ] Corosync main
> process was not scheduled for 115935.2266 ms (threshold is 800.0000 ms).
> Consider token timeout increase.
> Jan 29 07:03:47 B5-2U-205-LS
> notificationService(notificationService)[18959]: WARNING: RA: [monitor] :
> got rc=1
> Jan 29 07:03:47 B5-2U-205-LS storageManager(sm0)[18958]: WARNING: RA:
> [monitor] : got rc=1
> Jan 29 07:03:47 B5-2U-205-LS diskManager(diskManager)[18960]: WARNING: RA:
> [monitor] : got rc=1
> Jan 29 07:03:58 B5-2U-205-LS diskHelper(pmdh)[18964]: WARNING: RA:
> [monitor] : got rc=1
> Jan 29 07:04:00 B5-2U-205-LS lrmd[3012]: notice: operation_finished:
> adminServer_monitor_30000:18853:stderr [ Failed to get properties:
> Connection timed out ]
> Jan 29 07:04:04 B5-2U-205-LS diskHelper(sm0dh)[18968]: WARNING: RA:
> [monitor] : got rc=1
> Jan 29 07:04:16 B5-2U-205-LS diskHelper(sddh)[18987]: WARNING: RA:
> [monitor] : got rc=1
> Jan 29 07:04:31 B5-2U-205-LS corosync[2742]: [TOTEM ] Process pause
> detected for 109635 ms, flushing membership messages.
>
> What is happening to the cluster here?
> Why Corosync says "Corosync main process was not scheduled for ..."?
> Why lrmd says "... _monitor_30000:18828:stderr [ Failed to get properties:
> Connection timed out ]"?
>
> It is worth to mention that the system was under big IO load.
> Also, I am not sure whether is has to do something
> with load-threshold="400%".
>
> Thank you,
> Kostia
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20160217/d632d367/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cluster_under_load.log
Type: text/x-log
Size: 209171 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20160217/d632d367/attachment-0004.bin>