[ClusterLabs] When the DC crmd is frozen, cluster decisions are delayed infinitely

Shermal Fernando shermalfe at millenniumit.com
Thu Sep 8 00:41:12 EDT 2016


The whole cluster will fail if the DC (crm daemon) is frozen due to CPU starvation or hanging while trying to perform a IO operation.  
Please share some thoughts on this issue.

Regards,
Shermal Fernando







-----Original Message-----
From: Klaus Wenninger [mailto:kwenning at redhat.com] 
Sent: Monday, September 05, 2016 6:42 PM
To: users at clusterlabs.org; developers at clusterlabs.org
Subject: Re: [ClusterLabs] When the DC crmd is frozen, cluster decisions are delayed infinitely

On 09/03/2016 08:42 PM, Shermal Fernando wrote:
>
> Hi,
>
>  
>
> Currently our system have 99.96% uptime. But our goal is to increase 
> it beyond 99.999%. Now we are studying the 
> reliability/performance/features of pacemaker to replace the existing 
> clustering solution.
>
>  
>
> While testing pacemaker, I have encountered a problem. If the DC (crm
> daemon) is frozen by sending the SIGSTOP signal, crmds in other 
> machines never start election to elect a new DC. Therefore fail-overs, 
> resource restartings and other cluster decisions will be delayed until 
> the DC is unfrozen.
>
> Is this the default behavior of pacemaker or is it due to a 
> misconfiguration? Is there any way to avoid this single point of failure?
>
>  
>
> For the testing, we use Pacemaker 1.1.12 with Corosync 2.3.3 in SLES
> 12 SP1 operation system.
>

Guess I can reproduce that with pacemaker 1.1.15 & corosync 2.3.6.
I'm having sbd with pacemaker-watcher running as well on the nodes.
As the node-health is not updated and the cib can be read sbd is happy - as to be expected.
Maybe we could at least add something into sbd-pacemaker-watcher to detect the issue ... thinking ...

Regards,
Klaus

>  
>
>  
>
> Regards,
>
> Shermal Fernando
>
>  
>
>  
>
>  
>
>  
>
>  
>
>  
>
>  
>
> This e-mail transmission (inclusive of any attachments) is strictly 
> confidential and intended solely for the ordinary user of the e-mail 
> address to which it was addressed. It may contain legally privileged 
> and/or CONFIDENTIAL information. The unauthorized use, disclosure, 
> distribution printing and/or copying of this e-mail or any information 
> it contains is prohibited and could, in certain circumstances, 
> constitute an offence. If you have received this e-mail in error or 
> are not an intended recipient please inform the sender of the email 
> and MillenniumIT immediately by return e-mail or telephone (+94-11) 
> 2416000. We advise that in keeping with good computing practice, the 
> recipient of this e-mail should ensure that it is virus free. We do 
> not accept responsibility for any virus that may be transferred by way 
> of this e-mail. E-mail may be susceptible to data corruption, 
> interception and unauthorized amendment, and we do not accept 
> liability for any such corruption, interception or amendment or any 
> consequences thereof.
>
> www.millenniumit.com <http://www.millenniumit.com>
>
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org 
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org Getting started: 
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


_______________________________________________
Users mailing list: Users at clusterlabs.org http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org




More information about the Users mailing list