[ClusterLabs] Monitoring action of Pacemaker resources fail because of high load on the nodes

Fri Apr 22 13:58:19 EDT 2016

On 04/22/2016 08:57 AM, Klaus Wenninger wrote:
> On 04/22/2016 03:29 PM, John Gogu wrote:
>> Hello community,
>> I am facing following situation with a Pacemaker 2 nodes DB cluster 
>> (3 resources configured into the cluster - 1 MySQL DB resource, 1
>> Apache resource, 1 IP resource )
>> -at every 61 seconds an MySQL monitoring action is started and have a
>> 1200 sec timeout.
> You can increase the timeout for monitoring.
>>
>> In some situation due to high load on the machines, monitoring action
>> run into a timeout, and the cluster is performing a fail over even if
>> the DB is up and running. Do you have a hint how can  be prioritized
>> automatically monitoring actions?
>>
> Consider that monitoring - at least as part of the action - should check
> if what your service is actually providing
> is working according to some functional and nonfunctional constraints as
> to simulate the experience of the
> consumer of your services. So you probably don't want that to happen
> prioritized.
> So if you relaxed the timing requirements of your monitoring to
> something that would be acceptable in terms
> of the definition of the service you are providing and you are still
> running into troubles the service quality you
> are providing wouldn't be that spiffing either...

Also, you can provide multiple levels of monitoring:

http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#_multiple_monitor_operations

For example, you could provide a very simple check that just makes sure
MySQL is responding on its port, and run that frequently with a low
timeout. And your existing thorough monitor could be run less frequently
with a high timeout.

FYI there was a bug related to multiple monitors apparently introduced
in 1.1.10, such that a higher-level monitor failure might not trigger a
resource failure. It was recently fixed in the upstream master branch
(which will be in the soon-to-be-released 1.1.15-rc1).

>> Thank you and best regards,
>> John