[Pacemaker] Resource-Monitoring with an "On Fail"-Action

Wed Mar 17 05:57:16 EDT 2010

Hi Dominik

The problem is, that the cluster does not do the monitor-action every
20s. The last time, when he did the action was at 09:21. And now we
have 10:37:

 MySQL_MonitorAgent_Resource: migration-threshold=3
    + (479) stop: last-rc-change='Wed Mar 17 09:21:28 2010'
last-run='Wed Mar 17 09:21:28 2010' exec-time=3010ms queue-time=0ms
rc=0 (ok)
    + (480) start: last-rc-change='Wed Mar 17 09:21:31 2010'
last-run='Wed Mar 17 09:21:31 2010' exec-time=3010ms queue-time=0ms
rc=0 (ok)
    + (481) monitor: interval=10000ms last-rc-change='Wed Mar 17
09:21:34 2010' last-run='Wed Mar 17 09:21:34 2010' exec-time=20ms
queue-time=0ms rc=0 (ok)

If I restart the whole cluster, then the new returncode (exit99 or
exit4) will be saw by the cluster-monitor.

2010/3/17 Dominik Klein <dk at in-telegence.net>:
> Hi Tom
>
> have a look at the logs and see whether the monitor op really returns
> 99. (grep for the resource-id). If so, I'm not sure what the cluster
> does with rc=99. As far as I know, rc=4 would be status=failed (unknown
> actually).
>
> Regards
> Dominik
>
> Tom Tux wrote:
>> Thanks for your hint.
>>
>> I've configured an lsb-resource like this (with migration-threshold):
>>
>> primitive MySQL_MonitorAgent_Resource lsb:mysql-monitor-agent \
>>         meta target-role="Started" migration-threshold="3" \
>>         op monitor interval="10s" timeout="20s" on-fail="restart"
>>
>> I have now modified the init-script "/etc/init.d/mysql-monitor-agent",
>> to exit with a returncode not equal "0" (example exit 99), when the
>> monitor-operation is querying the status. But the cluster does not
>> recognise a failed monitor-action. Why this behaviour? For the
>> cluster, everything seems ok.
>>
>> node1:/ # showcores.sh MySQL_MonitorAgent_Resource
>> Resource                             Score     Node     Stickiness
>> #Fail    Migration-Threshold
>> MySQL_MonitorAgent_Resource          -1000000  node1 100        0        3
>> MySQL_MonitorAgent_Resource          100       node2 100        0        3
>>
>> I also saw, that the "last-run"-entry (crm_mon -fort1) for this
>> resource is not up-to-date. For me it seems, that the monitor-action
>> does not occurs every 10 seconds. Why? Any hints for this behaviour?
>>
>> Thanks a lot.
>> Tom
>>
>>
>> 2010/3/16 Dominik Klein <dk at in-telegence.net>:
>>> Tom Tux wrote:
>>>> Hi
>>>>
>>>> I've have a question about the resource-monitoring:
>>>> I'm monitoring an ip-resource every 20 seconds. I have configured the
>>>> "On Fail"-action with "restart". This works fine. If the
>>>> "monitor"-operation fails, then the resource will be restartet.
>>>>
>>>> But how can I define this resource, to migrate to the other node, if
>>>> the resource still fails after 10 restarts? Is this possible? How will
>>>> the "failcount" interact with this scenario?
>>>>
>>>> In the documentation I read, that the resource-"fail_count" will
>>>> encrease every time, when the resource restarts. But I can't see this
>>>> fail_count.
>>> Look at the meta attribute "migration-threshold".
>>>
>>> Regards
>>> Dominik
>
>
> _______________________________________________
> Pacemaker mailing list
> Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>