[Pacemaker] Pacemaker/corosync freeze

Fri Mar 14 08:39:34 UTC 2014

Attila Megyeri napsal(a):
> Hi Honza,
> 
> What I also found in the log related to the freeze at 12:22:26:
> 
> 
> Corosync main process was not scheduled for  XXXX... Can It be the general cause of the issue?
> 

I don't think it will cause issue you are hitting BUT keep in mind that
if corosync is not scheduled for long time, it's probably fenced by
other node. So increase timeout is vital.

Honza

> 
> 
> Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: [10.9.1.5]:58597->[10.9.1.3]:161
> Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: [10.9.1.5]:47943->[10.9.1.3]:161
> Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: [10.9.1.5]:47943->[10.9.1.3]:161
> Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: [10.9.1.5]:59647->[10.9.1.3]:161
> 
> 
> Mar 13 12:22:26 ctmgr corosync[3024]:   [MAIN  ] Corosync main process was not scheduled for 6327.5918 ms (threshold is 4000.0000 ms). Consider token timeout increase.
> 
> 
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] The token was lost in the OPERATIONAL state.
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] A processor failed, forming new configuration.
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] entering GATHER state from 2(The token was lost in the OPERATIONAL state.).
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] Creating commit token because I am the rep.
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] Saving state aru 6a8c high seq received 6a8c
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] Storing new sequence id for ring 7dc
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] entering COMMIT state.
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] got commit token
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] entering RECOVERY state.
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [0] member 10.9.1.3:
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [1] member 10.9.1.41:
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [2] member 10.9.1.42:
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [3] member 10.9.1.71:
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [4] member 10.9.1.72:
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [5] member 10.9.2.11:
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [6] member 10.9.2.12:
> 
> ....
> 
> 
> Regards,
> Attila
> 
>> -----Original Message-----
>> From: Attila Megyeri [mailto:amegyeri at minerva-soft.com]
>> Sent: Thursday, March 13, 2014 2:27 PM
>> To: The Pacemaker cluster resource manager
>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>>
>>
>>> -----Original Message-----
>>> From: Attila Megyeri [mailto:amegyeri at minerva-soft.com]
>>> Sent: Thursday, March 13, 2014 1:45 PM
>>> To: The Pacemaker cluster resource manager; Andrew Beekhof
>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>>>
>>> Hello,
>>>
>>>> -----Original Message-----
>>>> From: Jan Friesse [mailto:jfriesse at redhat.com]
>>>> Sent: Thursday, March 13, 2014 10:03 AM
>>>> To: The Pacemaker cluster resource manager
>>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>>>>
>>>> ...
>>>>
>>>>>>>>
>>>>>>>> Also can you please try to set debug: on in corosync.conf and
>>>>>>>> paste full corosync.log then?
>>>>>>>
>>>>>>> I set debug to on, and did a few restarts but could not
>>>>>>> reproduce the issue
>>>>>> yet - will post the logs as soon as I manage to reproduce.
>>>>>>>
>>>>>>
>>>>>> Perfect.
>>>>>>
>>>>>> Another option you can try to set is netmtu (1200 is usually safe).
>>>>>
>>>>> Finally I was able to reproduce the issue.
>>>>> I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately
>>>>> (not
>>>> when node was up again).
>>>>>
>>>>> The corosync log with debug on is available at:
>>>>> http://pastebin.com/kTpDqqtm
>>>>>
>>>>>
>>>>> To be honest, I had to wait much longer for this reproduction as
>>>>> before,
>>>> even though there was no change in the corosync configuration - just
>>>> potentially some system updates. But anyway, the issue is
>>>> unfortunately still there.
>>>>> Previously, when this issue came, cpu was at 100% on all nodes -
>>>>> this time
>>>> only on ctmgr, which was the DC...
>>>>>
>>>>> I hope you can find some useful details in the log.
>>>>>
>>>>
>>>> Attila,
>>>> what seems to be interesting is
>>>>
>>>> Configuration ERRORs found during PE processing.  Please run
>>>> "crm_verify -
>>> L"
>>>> to identify issues.
>>>>
>>>> I'm unsure how much is this problem but I'm really not pacemaker
>> expert.
>>>
>>> Perhaps Andrew could comment on that. Any idea?
>>>
>>>
>>>>
>>>> Anyway, I have theory what may happening and it looks like related
>>>> with IPC (and probably not related to network). But to make sure we
>>>> will not try fixing already fixed bug, can you please build:
>>>> - New libqb (0.17.0). There are plenty of fixes in IPC
>>>> - Corosync 2.3.3 (already plenty IPC fixes)
>>>> - And maybe also newer pacemaker
>>>>
>>>
>>> I already use Corosync 2.3.3, built from source, and libqb-dev 0.16
>>> from Ubuntu package.
>>> I am currently building libqb 0.17.0, will update you on the results.
>>>
>>> In the meantime we had another freeze, which did not seem to be
>>> related to any restarts, but brought all coroync processes to 100%.
>>> Please check out the corosync.log, perhaps it is a different cause:
>>> http://pastebin.com/WMwzv0Rr
>>>
>>>
>>> In the meantime I will install the new libqb and send logs if we have
>>> further issues.
>>>
>>> Thank you very much for your help!
>>>
>>> Regards,
>>> Attila
>>>
>>
>> One more question:
>>
>> If I install libqb 0.17.0 from source, do I need to rebuild corosync as well, or if
>> it was built with libqb 0.16.0 it will be fine?
>>
>> BTW, in the meantime I installed the new libqb on 3 of the 7 hosts, so I can
>> see if it makes a difference. If I see crashes on the outdated ones, but not on
>> the new ones, we are fine. :)
>>
>> Thanks,
>>
>> Attila
>>
>>
>>
>>
>>
>>
>>
>>>
>>>
>>>> I know you were not very happy using hand-compiled sources, but
>>>> please give them at least a try.
>>>>
>>>> Thanks,
>>>>   Honza
>>>>
>>>>> Thanks,
>>>>> Attila
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> Regards,
>>>>>>   Honza
>>>>>>
>>>>>>>
>>>>>>> There are also a few things that might or might not be related:
>>>>>>>
>>>>>>> 1) Whenever I want to edit the configuration with "crm configure
>>>>>>> edit",
>>>>
>>>> ...
>>>>
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org Getting started:
>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>