[Pacemaker] Question on ILO stonith resource config and restarting

Wed Nov 5 10:51:56 EST 2008

On Wed, Nov 05, 2008 at 09:03:48AM -0500, Aaron Bush wrote:
> > Note that handling of clones is done on a different level, i.e.
> > by the CRM which decides where to run resources. The idea of
> > cloned stonith resources was to have "more" assurance that one of
> > the nodes which run the stonith resource may shoot the offending
> > node. Obviously, this may make sense only for clusters with more
> > than two nodes. On the other hand, if your stonith devices are
> > reliable and regularly monitored, I don't see any need for
> > shooting a node from more than one node. So, with the lights-out
> > devices which are capable of managing only its host (iLO, IBM
> > RSA, DRAC) I'd suggest having a normal (non-cloned) stonith
> > resource with a -INF constraint to prevent it from running on the
> > node it can shoot. This kind of power management setup seems to
> > be very popular and probably prevails today.
> > 
> > On larger clusters with stonith devices which may shoot a set of
> > nodes, a single cloned resource should suffice.
> > 
> > Does this help? A bit at least?
> 
> Dejan,
> 
> This does help me understand that a clone stonith in a simple two node
> cluster is probably not necessary.  I will backup my config. today and
> try a non-cloned resource to see what the behavior is and report back to
> the list.
> 
> 
> What I am really just trying to ensure is that on a failure of the
> STONITH resource to start/monitor that it will keep retrying to
> start/monitor.  I want to avoid the situation where we have a node that
> is online again after a brief network outage and is capable of running
> resources but is not able to shoot its partner.  I wasn't sure if this
> was actually a bug or more a configuration/operational/understanding
> issue on my part.
> 
> To add more information on the issue I did follow some of Tak's comments
> and took a look at the fail-count for the resource and it is at INFINITY
> (following the test failures):
> 
> # crm_failcount -G -r cl_stonith_lb02:0
>  name=fail-count-cl_stonith_lb02:0 value=INFINITY

Yes, that would probably prevent the resource from starting on
that node.

> I then cleared the failcount and made sure it took...
> 
> # crm_failcount -D -r cl_stonith_lb02:0
> # crm_failcount -G -r cl_stonith_lb02:0
>  name=fail-count-cl_stonith_lb02:0 value=0
> 
> And did a cleanup for both nodes:
> 
> # crm_resource -C -r cl_stonith_lb02:0 -H wwwlb01.microcenter.com
> # crm_resource -C -r cl_stonith_lb02:0 -H wwwlb02.microcenter.com
> 
> The stonith resource did restart and appears to be back to normal.  Just
> wondering if this would be the correct process to follow in the future
> or can the failure to retry interval be adjusted in the CIB or is this a
> bug?

Can't say without looking at the logs. I think that you should
start a new tread with these concerns. There are also quite a few
discussions on the topic in list archives. Note that in this
respect stonith resources are treated like any other.

Thanks,

Dejan

> Thanks for all the help,
> -ab
> 
> _______________________________________________
> Pacemaker mailing list
> Pacemaker at clusterlabs.org
> http://list.clusterlabs.org/mailman/listinfo/pacemaker