[ClusterLabs] Antw: Re: When the DC crmd is frozen, cluster decisions are delayed infinitely
Jehan-Guillaume de Rorthais
jgdr at dalibo.com
Thu Sep 8 10:26:40 UTC 2016
On Thu, 8 Sep 2016 09:51:27 +0000
Shermal Fernando <shermalfe at millenniumit.com> wrote:
> Hi Jehan-Guillaume,
>
> Sorry for disturbing you. This is really important for us to pass this test
> on the pacemaker resiliency and robustness. To my understanding, it's the
> pacemakerd who feeds the watchdog. If only the crmd is hung, fencing will not
> work. Am I correct here?
I guess yes.
I am talking of a scenario where the server is under a high load (fork bomb,
swap storm, ...), not only crmd being hung for some reasons.
> -----Original Message-----
> From: Jehan-Guillaume de Rorthais [mailto:jgdr at dalibo.com]
> Sent: Thursday, September 08, 2016 3:12 PM
> To: Shermal Fernando
> Cc: Cluster Labs - All topics related to open-source clustering welcomed
> Subject: Re: [ClusterLabs] Antw: Re: When the DC crmd is frozen, cluster
> decisions are delayed infinitely
>
> On Thu, 8 Sep 2016 08:58:15 +0000
> Shermal Fernando <shermalfe at millenniumit.com> wrote:
>
> > Hi Jehan-Guillaume,
> >
> > Does this means watchdog will serf-terminate the machine when the crm
> > daemon is frozen?
>
> This means that if the machine is under such a load that PAcemaker is not
> able to feed the watchdog, the watchdog will fence the machine itself.
>
> > -----Original Message-----
> > From: Jehan-Guillaume de Rorthais [mailto:jgdr at dalibo.com]
> > Sent: Thursday, September 08, 2016 12:52 PM
> > To: Digimer
> > Cc: Cluster Labs - All topics related to open-source clustering
> > welcomed
> > Subject: Re: [ClusterLabs] Antw: Re: When the DC crmd is frozen,
> > cluster decisions are delayed infinitely
> >
> > On Thu, 8 Sep 2016 15:55:50 +0900
> > Digimer <lists at alteeve.ca> wrote:
> >
> > > On 08/09/16 03:47 PM, Ulrich Windl wrote:
> > > >>>> Shermal Fernando <shermalfe at millenniumit.com> schrieb am
> > > >>>> 08.09.2016 um
> > > >>>> 06:41 in
> > > > Nachricht
> > > > <8CE6E8D87F896546B9C65ED80D30A4336578CB4A at LG-SPMB-MBX02.lseg.stockex.local>:
> > > >> The whole cluster will fail if the DC (crm daemon) is frozen due
> > > >> to CPU starvation or hanging while trying to perform a IO operation.
> > > >> Please share some thoughts on this issue.
> > > >
> > > > What is "the whole cluster will fail"? If the DC times out, some
> > > > recovery will take place.
> > >
> > > Yup. The starved node should be declared lost by corosync, the
> > > remaining nodes reform and if they're still quorate, the hung node
> > > should be fenced. Recovery occur and life goes on.
> >
> > +1
> >
> > And fencing might either come from outside, or just from the server
> > itself using watchdog.
More information about the Users
mailing list