[Pacemaker] Resource-Monitoring with an "On Fail"-Action

Thu Mar 18 07:59:19 EDT 2010

Hi,

On Wed, Mar 17, 2010 at 12:38:47PM +0100, Tom Tux wrote:
> Hi Dejan
> 
> Thanks for your answer.
> 
> I'm using this cluster with the packages from the HAE
> (HighAvailability-Extension)-Repository from SLES11. Therefore, is it
> possible, to upgrade the cluster-glue from source?

Yes, though I don't think that any SLE11 version has this bug.
When was your version released? What does hb_report -V say?

> I think, the better
> way is to wait for updates in the hae-repository from novell. Or do
> you have experience, upgrading the cluster-glue from source (even if
> it is installed with zypper/rpm)?
> 
> Do you know, when the HAE-Repository will be upgraded?

Can't say. Best would be if you talk to Novell about the issue.

Cheers,

Dejan

> Thanks a lot.
> Tom
> 
> 
> 2010/3/17 Dejan Muhamedagic <dejanmm at fastmail.fm>:
> > Hi,
> >
> > On Wed, Mar 17, 2010 at 10:57:16AM +0100, Tom Tux wrote:
> >> Hi Dominik
> >>
> >> The problem is, that the cluster does not do the monitor-action every
> >> 20s. The last time, when he did the action was at 09:21. And now we
> >> have 10:37:
> >
> > There was a serious bug in some cluster-glue packages. What
> > you're experiencing sounds like that. I can't say which
> > packages (probably sth like 1.0.1, they were never released). At
> > any rate, I'd suggest upgrading to cluster-glue 1.0.3.
> >
> > Thanks,
> >
> > Dejan
> >
> >>  MySQL_MonitorAgent_Resource: migration-threshold=3
> >>     + (479) stop: last-rc-change='Wed Mar 17 09:21:28 2010'
> >> last-run='Wed Mar 17 09:21:28 2010' exec-time=3010ms queue-time=0ms
> >> rc=0 (ok)
> >>     + (480) start: last-rc-change='Wed Mar 17 09:21:31 2010'
> >> last-run='Wed Mar 17 09:21:31 2010' exec-time=3010ms queue-time=0ms
> >> rc=0 (ok)
> >>     + (481) monitor: interval=10000ms last-rc-change='Wed Mar 17
> >> 09:21:34 2010' last-run='Wed Mar 17 09:21:34 2010' exec-time=20ms
> >> queue-time=0ms rc=0 (ok)
> >>
> >> If I restart the whole cluster, then the new returncode (exit99 or
> >> exit4) will be saw by the cluster-monitor.
> >>
> >>
> >> 2010/3/17 Dominik Klein <dk at in-telegence.net>:
> >> > Hi Tom
> >> >
> >> > have a look at the logs and see whether the monitor op really returns
> >> > 99. (grep for the resource-id). If so, I'm not sure what the cluster
> >> > does with rc=99. As far as I know, rc=4 would be status=failed (unknown
> >> > actually).
> >> >
> >> > Regards
> >> > Dominik
> >> >
> >> > Tom Tux wrote:
> >> >> Thanks for your hint.
> >> >>
> >> >> I've configured an lsb-resource like this (with migration-threshold):
> >> >>
> >> >> primitive MySQL_MonitorAgent_Resource lsb:mysql-monitor-agent \
> >> >>         meta target-role="Started" migration-threshold="3" \
> >> >>         op monitor interval="10s" timeout="20s" on-fail="restart"
> >> >>
> >> >> I have now modified the init-script "/etc/init.d/mysql-monitor-agent",
> >> >> to exit with a returncode not equal "0" (example exit 99), when the
> >> >> monitor-operation is querying the status. But the cluster does not
> >> >> recognise a failed monitor-action. Why this behaviour? For the
> >> >> cluster, everything seems ok.
> >> >>
> >> >> node1:/ # showcores.sh MySQL_MonitorAgent_Resource
> >> >> Resource                             Score     Node     Stickiness
> >> >> #Fail    Migration-Threshold
> >> >> MySQL_MonitorAgent_Resource          -1000000  node1 100        0        3
> >> >> MySQL_MonitorAgent_Resource          100       node2 100        0        3
> >> >>
> >> >> I also saw, that the "last-run"-entry (crm_mon -fort1) for this
> >> >> resource is not up-to-date. For me it seems, that the monitor-action
> >> >> does not occurs every 10 seconds. Why? Any hints for this behaviour?
> >> >>
> >> >> Thanks a lot.
> >> >> Tom
> >> >>
> >> >>
> >> >> 2010/3/16 Dominik Klein <dk at in-telegence.net>:
> >> >>> Tom Tux wrote:
> >> >>>> Hi
> >> >>>>
> >> >>>> I've have a question about the resource-monitoring:
> >> >>>> I'm monitoring an ip-resource every 20 seconds. I have configured the
> >> >>>> "On Fail"-action with "restart". This works fine. If the
> >> >>>> "monitor"-operation fails, then the resource will be restartet.
> >> >>>>
> >> >>>> But how can I define this resource, to migrate to the other node, if
> >> >>>> the resource still fails after 10 restarts? Is this possible? How will
> >> >>>> the "failcount" interact with this scenario?
> >> >>>>
> >> >>>> In the documentation I read, that the resource-"fail_count" will
> >> >>>> encrease every time, when the resource restarts. But I can't see this
> >> >>>> fail_count.
> >> >>> Look at the meta attribute "migration-threshold".
> >> >>>
> >> >>> Regards
> >> >>> Dominik
> >> >
> >> >
> >> > _______________________________________________
> >> > Pacemaker mailing list
> >> > Pacemaker at oss.clusterlabs.org
> >> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >> >
> >>
> >> _______________________________________________
> >> Pacemaker mailing list
> >> Pacemaker at oss.clusterlabs.org
> >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > _______________________________________________
> > Pacemaker mailing list
> > Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> 
> _______________________________________________
> Pacemaker mailing list
> Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker