[Pacemaker] Exec Failure issues.

Fri Oct 21 07:35:30 EDT 2011

On 2011-10-19 11:56, James Horsfall (CTR) wrote:
> The allow-migrate was just something I was trying, and as you point out
> produced that error unimplemented feature. Without the allow-migrate
> option I just get "exec timeout error" which is doubly as frustrating,
> where its simply failing to unload the ip addresses from the failed
> node. It knows it's supposed to migrate,

Correct.

> it starts to

Semi-correct. It only attempts to shut down the resource on one node. It
never tries to start on the other.

> and then goes nuts

Far from it.

> into the error timeout "unmanaged" state.

Working perfectly as designed. If a resource fails to stop, the cluster
has to assume that it still has access to shared resources. It is thus
unsafe to recover the resource on another node, and the cluster freezes
that resource.

> At that point I have to
> essentially restart the nodes to clear the error or clear the fail
> counts from the "IPS" element just to watch it explode all over again.
> 
>  The setup of stonith seems extreme by its description.

STONITH serves three purposes:

1. remove a node from the cluster that has stopped responding;
2. remove a node from a cluster that is not relinquishing resources
despite being told to;
3. lock access to shared resources while said removal is pending.

You're hitting #2, and you don't have STONITH configured where you should.

> Will this
> properly bring nodes back online after cables are re-plugged?

Of course.

> I'm not
> sharing data I'm just setting up a network pass through server or just a
> router in a sense.

Note, what follows is not safe for general use. If anyone pulls this
from the list archives and implements it in their own cluster, then
don't blame me for any unexpected results including a meteor hit, unless
you've asked at http://www.hastexo.com/help and we've actually given you
a green light that this is OK to use.

See
http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-resource-operations.html.
What you can do -- but let me reiterate, my recommendation is strongly
against this, you really should set up STONITH instead -- is to add
"on-fail=ignore" to the definition of your "stop" operation for those
resources. Which will, of course, possibly lead to duplicate IP
addresses flying around your cluster. Have I mentioned that I recommend
against this? Well, I recommend against this.

Note also that I've never seen an exec timeout error in the stop op for
IPaddr2 in the wild, except for really trivial setup errors, and I've
deployed or reviewed scores of clusters using it. So another possibility
is that your testing procedure is so far off anything that would ever
happen in the real world that you're simply creating an uncaught error
condition.

I'd have to look at the logs though to be sure; procedure for submitting
those is explained at the "help" URL I mentioned above.

Hope this helps,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now