[Pacemaker] Question on ILO stonith resource config and restarting

Wed Nov 5 09:03:48 EST 2008

> Note that handling of clones is done on a different level, i.e.
> by the CRM which decides where to run resources. The idea of
> cloned stonith resources was to have "more" assurance that one of
> the nodes which run the stonith resource may shoot the offending
> node. Obviously, this may make sense only for clusters with more
> than two nodes. On the other hand, if your stonith devices are
> reliable and regularly monitored, I don't see any need for
> shooting a node from more than one node. So, with the lights-out
> devices which are capable of managing only its host (iLO, IBM
> RSA, DRAC) I'd suggest having a normal (non-cloned) stonith
> resource with a -INF constraint to prevent it from running on the
> node it can shoot. This kind of power management setup seems to
> be very popular and probably prevails today.
> 
> On larger clusters with stonith devices which may shoot a set of
> nodes, a single cloned resource should suffice.
> 
> Does this help? A bit at least?

Dejan,

This does help me understand that a clone stonith in a simple two node
cluster is probably not necessary.  I will backup my config. today and
try a non-cloned resource to see what the behavior is and report back to
the list.

What I am really just trying to ensure is that on a failure of the
STONITH resource to start/monitor that it will keep retrying to
start/monitor.  I want to avoid the situation where we have a node that
is online again after a brief network outage and is capable of running
resources but is not able to shoot its partner.  I wasn't sure if this
was actually a bug or more a configuration/operational/understanding
issue on my part.

To add more information on the issue I did follow some of Tak's comments
and took a look at the fail-count for the resource and it is at INFINITY
(following the test failures):

# crm_failcount -G -r cl_stonith_lb02:0
 name=fail-count-cl_stonith_lb02:0 value=INFINITY

I then cleared the failcount and made sure it took...

# crm_failcount -D -r cl_stonith_lb02:0
# crm_failcount -G -r cl_stonith_lb02:0
 name=fail-count-cl_stonith_lb02:0 value=0

And did a cleanup for both nodes:

# crm_resource -C -r cl_stonith_lb02:0 -H wwwlb01.microcenter.com
# crm_resource -C -r cl_stonith_lb02:0 -H wwwlb02.microcenter.com

The stonith resource did restart and appears to be back to normal.  Just
wondering if this would be the correct process to follow in the future
or can the failure to retry interval be adjusted in the CIB or is this a
bug?

Thanks for all the help,
-ab