[ClusterLabs] Antw: Re: Why Do All The Services Go Down When Just One Fails?

Andrei Borzenkov arvidjaar at gmail.com
Wed Feb 20 23:50:58 EST 2019


20.02.2019 21:51, Eric Robinson пишет:
> 
> The following should show OK in a fixed font like Consolas, but the following setup is supposed to be possible, and is even referenced in the ClusterLabs documentation.
> 
> 
> 
> 
> 
> +--------------+
> 
> |   mysql001   +--+
> 
> +--------------+  |
> 
> +--------------+  |
> 
> |   mysql002   +--+
> 
> +--------------+  |
> 
> +--------------+  |   +-------------+   +------------+   +----------+
> 
> |   mysql003   +----->+ floating ip +-->+ filesystem +-->+ blockdev |
> 
> +--------------+  |   +-------------+   +------------+   +----------+
> 
> +--------------+  |
> 
> |   mysql004   +--+
> 
> +--------------+  |
> 
> +--------------+  |
> 
> |   mysql005   +--+
> 
> +--------------+
> 
> 
> 
> In the layout above, the MySQL instances are dependent on the same underlying service stack, but they are not dependent on each other. Therefore, as I understand it, the failure of one MySQL instance should not cause the failure of other MySQL instances if on-fail=ignore on-fail=stop. At least, that’s the way it seems to me, but based on the thread, I guess it does not behave that way.
> 

This works this way for monitor operation if you set on-fail=block.
Failed resource is left "as is". The only case when it does not work
seems to be stop operation; even with explicit on-fail=block it still
attempts to initiate follow up actions. I still consider this a bug.

If this is not a bug, this needs clear explanation in documentation.

But please understand that assuming on-fail=block works you effectively
reduce your cluster to controlled start of resources during boot. As we
have seen, stopping of resource IP is blocked, meaning pacemaker also
cannot perform resource level recovery at all. And for mysql resources
you explicitly ignore any result of monitoring or failure to stop it.
And not having stonith also prevents pacemaker from handling node
failure. What leaves is at most restart of resources on another node
during graceful shutdown.

It begs a question - what do you need such "cluster" for at all?



More information about the Users mailing list