[Pacemaker] [SOLVED] Resource-Monitoring with an "On Fail"-Action

Wed Mar 31 07:16:33 EDT 2010

Hi

>From Novell-Support I got a PTF (Program Temporary Fix) which should
handle this issue.

I'm thinking, that the monitoring is working now. But I'm irritated
with the output of the command "crm_mon -t1", which shows me the
"last-rc-change" and the "last-run" of the monitor-operation. I have
defined the monitor-operation for an certain resource every 10
seconds, but the "last-run"-field of the "crm_mon -t1"-output doesn't
change it's value. It changes it's value only, when he got no
returncode with value "0" back and the failcount will be increased. Is
this behaviour correct?

Thanks a lot for your help.
Kind regards,
Tom

2010/3/19 Tom Tux <tomtux80 at gmail.com>:
> Hi
>
> Thanks a lot for your help.
>
> So now it's Novell's turn.....:-)
>
> Regards,
> Tom
>
>
> 2010/3/18 Dejan Muhamedagic <dejanmm at fastmail.fm>:
>> Hi,
>>
>> On Thu, Mar 18, 2010 at 02:15:07PM +0100, Tom Tux wrote:
>>> Hi Dejan
>>>
>>> hb_report -V says:
>>> cluster-glue: 1.0.2 (b75bd738dc09263a578accc69342de2cb2eb8db6)
>>
>> Yes, unfortunately that one is buggy.
>>
>>> I've opened a case by Novell. They will fix this problem with updating
>>> to the newest cluster-glue-release.
>>>
>>> Could it be, that I have another configuration-issue in my
>>> cluster-config? I think with the following setting, the resource
>>> should be monitored:
>>>
>>> ...
>>> ...
>>> primitive MySQL_MonitorAgent_Resource lsb:mysql-monitor-agent \
>>>         meta migration-threshold="3" \
>>>         op monitor interval="10s" timeout="20s" on-fail="restart"
>>> op_defaults $id="op_defaults-options" \
>>>         on-fail="restart" \
>>>         enabled="true"
>>> property $id="cib-bootstrap-options" \
>>>         expected-quorum-votes="2" \
>>>         dc-version="1.0.6-c48e3360eb18c53fd68bb7e7dbe39279ccbc0354" \
>>>         cluster-infrastructure="openais" \
>>>         stonith-enabled="true" \
>>>         no-quorum-policy="ignore" \
>>>         stonith-action="reboot" \
>>>         last-lrm-refresh="1268838090"
>>> ...
>>> ...
>>>
>>>
>>> And when I look the last-run-time with "crm_mon -fort1", then it results me:
>>>    MySQL_Server_Resource: migration-threshold=3
>>>     + (32) stop: last-rc-change='Wed Mar 17 10:49:55 2010'
>>> last-run='Wed Mar 17 10:49:55 2010' exec-time=5060ms queue-time=0ms
>>> rc=0 (ok)
>>>     + (40) start: last-rc-change='Wed Mar 17 11:09:06 2010'
>>> last-run='Wed Mar 17 11:09:06 2010' exec-time=4080ms queue-time=0ms
>>> rc=0 (ok)
>>>     + (41) monitor: interval=20000ms last-rc-change='Wed Mar 17
>>> 11:09:10 2010' last-run='Wed Mar 17 11:09:10 2010' exec-time=20ms
>>> queue-time=0ms rc=0 (ok)
>>>
>>> And the results above was yesterday....
>>
>> The configuration looks fine to me.
>>
>> Cheers,
>>
>> Dejan
>>
>>> Thanks for your help.
>>> Kind regards,
>>> Tom
>>>
>>>
>>>
>>> 2010/3/18 Dejan Muhamedagic <dejanmm at fastmail.fm>:
>>> > Hi,
>>> >
>>> > On Wed, Mar 17, 2010 at 12:38:47PM +0100, Tom Tux wrote:
>>> >> Hi Dejan
>>> >>
>>> >> Thanks for your answer.
>>> >>
>>> >> I'm using this cluster with the packages from the HAE
>>> >> (HighAvailability-Extension)-Repository from SLES11. Therefore, is it
>>> >> possible, to upgrade the cluster-glue from source?
>>> >
>>> > Yes, though I don't think that any SLE11 version has this bug.
>>> > When was your version released? What does hb_report -V say?
>>> >
>>> >> I think, the better
>>> >> way is to wait for updates in the hae-repository from novell. Or do
>>> >> you have experience, upgrading the cluster-glue from source (even if
>>> >> it is installed with zypper/rpm)?
>>> >>
>>> >> Do you know, when the HAE-Repository will be upgraded?
>>> >
>>> > Can't say. Best would be if you talk to Novell about the issue.
>>> >
>>> > Cheers,
>>> >
>>> > Dejan
>>> >
>>> >> Thanks a lot.
>>> >> Tom
>>> >>
>>> >>
>>> >> 2010/3/17 Dejan Muhamedagic <dejanmm at fastmail.fm>:
>>> >> > Hi,
>>> >> >
>>> >> > On Wed, Mar 17, 2010 at 10:57:16AM +0100, Tom Tux wrote:
>>> >> >> Hi Dominik
>>> >> >>
>>> >> >> The problem is, that the cluster does not do the monitor-action every
>>> >> >> 20s. The last time, when he did the action was at 09:21. And now we
>>> >> >> have 10:37:
>>> >> >
>>> >> > There was a serious bug in some cluster-glue packages. What
>>> >> > you're experiencing sounds like that. I can't say which
>>> >> > packages (probably sth like 1.0.1, they were never released). At
>>> >> > any rate, I'd suggest upgrading to cluster-glue 1.0.3.
>>> >> >
>>> >> > Thanks,
>>> >> >
>>> >> > Dejan
>>> >> >
>>> >> >>  MySQL_MonitorAgent_Resource: migration-threshold=3
>>> >> >>     + (479) stop: last-rc-change='Wed Mar 17 09:21:28 2010'
>>> >> >> last-run='Wed Mar 17 09:21:28 2010' exec-time=3010ms queue-time=0ms
>>> >> >> rc=0 (ok)
>>> >> >>     + (480) start: last-rc-change='Wed Mar 17 09:21:31 2010'
>>> >> >> last-run='Wed Mar 17 09:21:31 2010' exec-time=3010ms queue-time=0ms
>>> >> >> rc=0 (ok)
>>> >> >>     + (481) monitor: interval=10000ms last-rc-change='Wed Mar 17
>>> >> >> 09:21:34 2010' last-run='Wed Mar 17 09:21:34 2010' exec-time=20ms
>>> >> >> queue-time=0ms rc=0 (ok)
>>> >> >>
>>> >> >> If I restart the whole cluster, then the new returncode (exit99 or
>>> >> >> exit4) will be saw by the cluster-monitor.
>>> >> >>
>>> >> >>
>>> >> >> 2010/3/17 Dominik Klein <dk at in-telegence.net>:
>>> >> >> > Hi Tom
>>> >> >> >
>>> >> >> > have a look at the logs and see whether the monitor op really returns
>>> >> >> > 99. (grep for the resource-id). If so, I'm not sure what the cluster
>>> >> >> > does with rc=99. As far as I know, rc=4 would be status=failed (unknown
>>> >> >> > actually).
>>> >> >> >
>>> >> >> > Regards
>>> >> >> > Dominik
>>> >> >> >
>>> >> >> > Tom Tux wrote:
>>> >> >> >> Thanks for your hint.
>>> >> >> >>
>>> >> >> >> I've configured an lsb-resource like this (with migration-threshold):
>>> >> >> >>
>>> >> >> >> primitive MySQL_MonitorAgent_Resource lsb:mysql-monitor-agent \
>>> >> >> >>         meta target-role="Started" migration-threshold="3" \
>>> >> >> >>         op monitor interval="10s" timeout="20s" on-fail="restart"
>>> >> >> >>
>>> >> >> >> I have now modified the init-script "/etc/init.d/mysql-monitor-agent",
>>> >> >> >> to exit with a returncode not equal "0" (example exit 99), when the
>>> >> >> >> monitor-operation is querying the status. But the cluster does not
>>> >> >> >> recognise a failed monitor-action. Why this behaviour? For the
>>> >> >> >> cluster, everything seems ok.
>>> >> >> >>
>>> >> >> >> node1:/ # showcores.sh MySQL_MonitorAgent_Resource
>>> >> >> >> Resource                             Score     Node     Stickiness
>>> >> >> >> #Fail    Migration-Threshold
>>> >> >> >> MySQL_MonitorAgent_Resource          -1000000  node1 100        0        3
>>> >> >> >> MySQL_MonitorAgent_Resource          100       node2 100        0        3
>>> >> >> >>
>>> >> >> >> I also saw, that the "last-run"-entry (crm_mon -fort1) for this
>>> >> >> >> resource is not up-to-date. For me it seems, that the monitor-action
>>> >> >> >> does not occurs every 10 seconds. Why? Any hints for this behaviour?
>>> >> >> >>
>>> >> >> >> Thanks a lot.
>>> >> >> >> Tom
>>> >> >> >>
>>> >> >> >>
>>> >> >> >> 2010/3/16 Dominik Klein <dk at in-telegence.net>:
>>> >> >> >>> Tom Tux wrote:
>>> >> >> >>>> Hi
>>> >> >> >>>>
>>> >> >> >>>> I've have a question about the resource-monitoring:
>>> >> >> >>>> I'm monitoring an ip-resource every 20 seconds. I have configured the
>>> >> >> >>>> "On Fail"-action with "restart". This works fine. If the
>>> >> >> >>>> "monitor"-operation fails, then the resource will be restartet.
>>> >> >> >>>>
>>> >> >> >>>> But how can I define this resource, to migrate to the other node, if
>>> >> >> >>>> the resource still fails after 10 restarts? Is this possible? How will
>>> >> >> >>>> the "failcount" interact with this scenario?
>>> >> >> >>>>
>>> >> >> >>>> In the documentation I read, that the resource-"fail_count" will
>>> >> >> >>>> encrease every time, when the resource restarts. But I can't see this
>>> >> >> >>>> fail_count.
>>> >> >> >>> Look at the meta attribute "migration-threshold".
>>> >> >> >>>
>>> >> >> >>> Regards
>>> >> >> >>> Dominik
>>> >> >> >
>>> >> >> >
>>> >> >> > _______________________________________________
>>> >> >> > Pacemaker mailing list
>>> >> >> > Pacemaker at oss.clusterlabs.org
>>> >> >> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> >> >> >
>>> >> >>
>>> >> >> _______________________________________________
>>> >> >> Pacemaker mailing list
>>> >> >> Pacemaker at oss.clusterlabs.org
>>> >> >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> >> >
>>> >> > _______________________________________________
>>> >> > Pacemaker mailing list
>>> >> > Pacemaker at oss.clusterlabs.org
>>> >> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> >> >
>>> >>
>>> >> _______________________________________________
>>> >> Pacemaker mailing list
>>> >> Pacemaker at oss.clusterlabs.org
>>> >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> >
>>> > _______________________________________________
>>> > Pacemaker mailing list
>>> > Pacemaker at oss.clusterlabs.org
>>> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> >
>>>
>>> _______________________________________________
>>> Pacemaker mailing list
>>> Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> _______________________________________________
>> Pacemaker mailing list
>> Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>