[Pacemaker] Moving Resources Due to Failure

Mon Apr 16 01:41:27 EDT 2012

Arnold,

Thanks for the explanation.

Regards,
Raffi

> -----Original Message-----
> From: Arnold Krille [mailto:arnold at arnoldarts.de]
> Sent: Saturday, April 14, 2012 3:27 PM
> To: The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] Moving Resources Due to Failure
> 
> On Saturday 14 April 2012 13:24:29 S, MOHAMED ** CTR ** wrote:
> > The Pacemaker_Explained.pdf document says that
> > " setting of migration-threshold=2 and failure-timeout=60s would cause
> the
> > resource to move to a new node after 2 failures, and allow it to move
> back
> > (depending on the stickiness and constraint scores) after one minute."
> >
> > Can you please help me understand what will happen on the following
> > scenarios in 2 node active passive configuration?
> > 1 - If one resource failed twice within 60s, it will move to the other
> node.
> > This is clear to understand.
> 
> Yep.
> 
> > 2 - If one resource failed once and there is no failure within 60s, will
> the
> > pacemaker reset the failcounts of that resource, so that the failcounts
> are
> > tracked freshly? Will the failcounts gets reset if the migration-
> threshold
> > didn't occur within the failure-timeout period?
> 
> The error-count will get set to zero after the failure-timeout. So in your
> example, the resource can again fail without moving once 60 seconds have
> passed since the last fail.
> Note that a fail means the monitor-action didn't finish or didn't return
> "OCF_RUNNING" when it was supposed to do so. The cluster then stops the
> resource, increments the failure-counter and then starts the resource
> again,
> on the same node if possible, or on a different node.
> When that failing resource is in a group, all the depending resources in
> that
> group will be stopped and restarted too.
> When the failed resource fails to execute the stop-action, this is a big
> fault
> crying for fencing of that whole node to get the resource back into a sane
> and
> known state.
> When the resource fails to start, that counts as 10000 failures (almost
> infinitely) and prevents the resource from starting on that node until you
> as
> admin clean it up. Or the whole node is fenced due to some other
> circumstance...
> 
> Have fun,
> 
> Arnold