[Pacemaker] Pacemaker Corosync Issue

Thu Oct 16 05:25:46 EDT 2014

On 16 Oct 2014, at 7:56 pm, Sahil Aggarwal <sahilaggarwalg at gmail.com> wrote:

> Sorry, i didn't get your point and i am again re-iterating the problem: 
> 
> Two Node cluster Node A , Node B .
> 
> Service X running on Node A, Node B is DC.
> 
> We are using stack corosync with Pacemaker.
> Failure Timeout is 10 sec . 
> Target-Role is started . 
> 
> Events happens like this
> 	• Node A sends event to Node B Service X is down
> 	• Node B prints Ignoring expired failure for Service X
> 	• After this Service X is never restarted by the Cluster.
> 
> 
> Now questions are:
> 
> 	• Why is Node B (DC) ignoring the expired failure?

Because you told it to

> 	• Even for this time DC ignored but as the Service X is down, Node A should monitor the service and again send failure status to Node B and at that time Node B should restart the service. Why this no hapenning?
> 
> 
> For FAILURE TIMEOUT: my understanding is:
> 
> 	• Node A sends Failure event of Service X to Node B(DC) at time T and failcount of Service X on Node A reached infinity and Node A is the only node where Service X can run
> 	• Now Node B (DC) will after T+FailureTimeoutSecounds will set the failcount of Service X on Node A to Zero and again restart the Service X on Node A.
> 
> 
> As per you Node B will ignore the Service X failure on Node A after Failure Timeout seconds. From which point Node B  starts calculating those seconds??
> 
> 
> 
> On Thu, Oct 16, 2014 at 1:07 PM, Andrew Beekhof <andrew at beekhof.net> wrote:
> 
> On 16 Oct 2014, at 6:33 pm, Sahil Aggarwal <sahilaggarwalg at gmail.com> wrote:
> 
> > Hello ,
> >
> > Yes that log might be due to that reason but , it should not ignore the resource as it is not taking any action for that resource i..e. not starting the resource .
> 
> it doesn't know that at the time
> 
> >
> > and second thing
> >
> > generally ignoring expired failure log comes as
> >  notice: unpack_rsc_op: Ignoring expired failure Server_last_failure_0
> >
> > but in case where service is ignored , log comes as
> >  notice: unpack_rsc_op: Ignoring expired failure (calculated) Server_last_failure_0
> >
> > this might be some another case.
> 
> possibly in the old code, but the latest has them combined
> 
> >
> > Please Suggest .
> >
> >
> >
> > On Thu, Oct 16, 2014 at 2:38 AM, Andrew Beekhof <andrew at beekhof.net> wrote:
> > You don't think that might be a little short?
> > Any failure that happened more than 10s is going to be ignored, leading to the pengine message you saw.
> >
> > On 16 Oct 2014, at 12:21 am, Sahil Aggarwal <sahilaggarwalg at gmail.com> wrote:
> >
> > > failure timeout for resource is 10s.
> > >
> > > On Wed, Oct 15, 2014 at 2:51 AM, Andrew Beekhof <andrew at beekhof.net> wrote:
> > >
> > > On 15 Oct 2014, at 4:23 am, Sahil Aggarwal <sahilaggarwalg at gmail.com> wrote:
> > >
> > > >
> > > > Hello Team Pacemaker,
> > > >
> > > > I am facing a constant issue with Pacemaker, it does not restart the Service even when he knows that the Service is down. It generates a message saying "Ignoring Expired Failure" for the service.
> > >
> > > What is the failure timeout set to?
> > >
> > > > Pacemaker and Corosync version are given below. OS CentOS 6.2
> > > >
> > > > corosync-1.4.1-4.el6_2.2.x86_64 pacemaker-1.1.9-2.el6.x86_64
> > > >
> > > > Log which pengine provide is:
> > > >
> > > >  pengine[45232]:   notice: unpack_rsc_op: Ignoring expired failure (calculated) Server_last_failure_0 (rc=7, magic=0:7;14:5699:0:459093cc-f3a1-483b-b853-53a1d9791361)
> > > >
> > > > Some more info is:
> > > >
> > > > 1.This is a two node cluster. There is time difference of 10 min b/w the two nodes.
> > > >
> > > >
> > > > --
> > > > Regards,
> > > > Sahil
> > > > Mobile - 09467607999
> > > > fbAddress-www.facebook.com/SahilAggarwalg
> > >
> > >
> > >
> > >
> > > --
> > > Sahil
> > > Mobile - 09467607999
> > > fbAddress-www.facebook.com/SahilAggarwalg
> >
> >
> >
> >
> > --
> > Sahil
> > Mobile - 09467607999
> > fbAddress-www.facebook.com/SahilAggarwalg
> 
> 
> 
> 
> -- 
> Sahil
> Mobile - 09467607999
> fbAddress-www.facebook.com/SahilAggarwalg

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20141016/a19f8370/attachment-0003.sig>