[ClusterLabs] Antw: Re: When the DC crmd is frozen, cluster decisions are delayed infinitely

Thu Sep 8 07:08:19 UTC 2016

If the DC (crm daemon) is frozen (corosync is running without problem), DC will not time out. Frozen DC will be there forever.

Regards,
Shermal Fernando

-----Original Message-----
From: Ulrich Windl [mailto:Ulrich.Windl at rz.uni-regensburg.de] 
Sent: Thursday, September 08, 2016 12:18 PM
To: users at clusterlabs.org
Subject: [ClusterLabs] Antw: Re: When the DC crmd is frozen, cluster decisions are delayed infinitely

>>> Shermal Fernando <shermalfe at millenniumit.com> schrieb am 08.09.2016 
>>> um 06:41 in
Nachricht
<8CE6E8D87F896546B9C65ED80D30A4336578CB4A at LG-SPMB-MBX02.lseg.stockex.local>:
> The whole cluster will fail if the DC (crm daemon) is frozen due to 
> CPU starvation or hanging while trying to perform a IO operation.
> Please share some thoughts on this issue.

What is "the whole cluster will fail"? If the DC times out, some recovery will take place.

> 
> Regards,
> Shermal Fernando
> 
> 
> 
> 
> 
> 
> 
> -----Original Message-----
> From: Klaus Wenninger [mailto:kwenning at redhat.com]
> Sent: Monday, September 05, 2016 6:42 PM
> To: users at clusterlabs.org; developers at clusterlabs.org
> Subject: Re: [ClusterLabs] When the DC crmd is frozen, cluster 
> decisions are delayed infinitely
> 
> On 09/03/2016 08:42 PM, Shermal Fernando wrote:
>>
>> Hi,
>>
>>  
>>
>> Currently our system have 99.96% uptime. But our goal is to increase 
>> it beyond 99.999%. Now we are studying the 
>> reliability/performance/features of pacemaker to replace the existing 
>> clustering solution.
>>
>>  
>>
>> While testing pacemaker, I have encountered a problem. If the DC (crm
>> daemon) is frozen by sending the SIGSTOP signal, crmds in other 
>> machines never start election to elect a new DC. Therefore 
>> fail-overs, resource restartings and other cluster decisions will be 
>> delayed until the DC is unfrozen.
>>
>> Is this the default behavior of pacemaker or is it due to a 
>> misconfiguration? Is there any way to avoid this single point of failure?
>>
>>  
>>
>> For the testing, we use Pacemaker 1.1.12 with Corosync 2.3.3 in SLES
>> 12 SP1 operation system.
>>
> 
> Guess I can reproduce that with pacemaker 1.1.15 & corosync 2.3.6.
> I'm having sbd with pacemaker-watcher running as well on the nodes.
> As the node-health is not updated and the cib can be read sbd is happy 
> - as to be expected.
> Maybe we could at least add something into sbd-pacemaker-watcher to 
> detect the issue ... thinking ...
> 
> Regards,
> Klaus
> 
>>  
>>
>>  
>>
>> Regards,
>>
>> Shermal Fernando
>>
>>  
>>
>>  
>>
>>  
>>
>>  
>>
>>  
>>
>>  
>>
>>  
>>
>> This e-mail transmission (inclusive of any attachments) is strictly 
>> confidential and intended solely for the ordinary user of the e-mail 
>> address to which it was addressed. It may contain legally privileged 
>> and/or CONFIDENTIAL information. The unauthorized use, disclosure, 
>> distribution printing and/or copying of this e-mail or any 
>> information it contains is prohibited and could, in certain 
>> circumstances, constitute an offence. If you have received this 
>> e-mail in error or are not an intended recipient please inform the 
>> sender of the email and MillenniumIT immediately by return e-mail or 
>> telephone (+94-11) 2416000. We advise that in keeping with good 
>> computing practice, the recipient of this e-mail should ensure that 
>> it is virus free. We do not accept responsibility for any virus that 
>> may be transferred by way of this e-mail. E-mail may be susceptible 
>> to data corruption, interception and unauthorized amendment, and we 
>> do not accept liability for any such corruption, interception or 
>> amendment or any consequences thereof.
>>
>> www.millenniumit.com <http://www.millenniumit.com>
>>
>>
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org 
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org Getting started: 
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org 
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org Getting started: 
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org 
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org Getting started: 
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Users mailing list: Users at clusterlabs.org http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org