[ClusterLabs] Strange Corosync (TOTEM) logs, Pacemaker OK but DLM stuck
kwenning at redhat.com
Mon Sep 11 10:18:08 EDT 2017
On 09/11/2017 12:32 PM, Jan Friesse wrote:
>> wferi at niif.hu (Ferenc Wágner) writes:
>>> Jan Friesse <jfriesse at redhat.com> writes:
>>>> wferi at niif.hu writes:
>>>>> In a 6-node cluster (vhbl03-08) the following happens 1-5 times a day
>>>>> (in August; in May, it happened 0-2 times a day only, it's slowly
>>>>> ramping up):
>>>>> vhbl08 corosync: [TOTEM ] A processor failed, forming new
>>>>> vhbl03 corosync: [TOTEM ] A processor failed, forming new
>>>>> vhbl07 corosync: [MAIN ] Corosync main process was not
>>>>> scheduled for 4317.0054 ms (threshold is 2400.0000 ms). Consider
>>>>> token timeout increase.
>>>> ^^^ This is main problem you have to solve. It usually means that
>>>> machine is too overloaded. It is happening quite often when corosync
>>>> is running inside VM where host machine is unable to schedule regular
>>>> VM running.
>>> After some extensive tracing, I think the problem lies elsewhere: my
>>> IPMI watchdog device is slow beyond imagination.
Just for my understanding: You are using watchdog-handling in corosync?
>> Confirmed: setting watchdog_device: off cluster wide got rid of the
>> above warnings.
> Yep, good you found the issue. This is perfectly possible if ioctl
>>> Its ioctl operations can take seconds, starving all other functions.
>>> At least, it seems to block the main thread of Corosync. Is this a
>>> plausible scenario? Corosync has two threads, what are their roles?
> First (main) thread is basically doing almost everything. There is a
> main loop (epoll) I've described in previous mail.
> Second thread is created by libqb and it's used only for logging. This
> is to prevent blocking of corosync when syslog/file log write blocks
> for some reason. It means some messages may be lost but it's still
> better than blocking.
> Back to problem you have. It's definitively HW issue but I'm thinking
> how to solve it in software. Right now, I can see two ways:
> 1. Set dog FD to be non blocking right at the end of setup_watchdog -
> This is proffered but I'm not sure if it's really going to work.
> 2. Create thread which makes sure to tackle wd regularly.
> Users mailing list: Users at clusterlabs.org
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Users