[ClusterLabs] Pacemaker shows false status of a resource and doesn't react on OCF_NOT_RUNNING rc.

Tue Jan 19 13:20:07 EST 2016

I've put the wrong entry from "journalctl --since="2016-01-19"
--until="2016-01-20"".
The correct one is:

Jan 19 23:42:24 A2-2U12-302-LS ntpd[2204]: 0.0.0.0 c61c 0c clock_step
-43194.111405 s
Jan 19 11:42:29 A2-2U12-302-LS ntpd[2204]: 0.0.0.0 c614 04 freq_mode
Jan 19 11:42:29 A2-2U12-302-LS systemd[1]: Time has been changed

Yes, the really first monitor operation was successful.
But I still have a question.
The first monitor op was at [Jan 19 23:42:16] and the failure was
considered by Pacemaker at [Jan 19 12:57:53].
So, the first one still remains older and here I don't understand why
Pacemaker considers it failed?

Thank you,
Kostia

On Tue, Jan 19, 2016 at 8:02 PM, Ken Gaillot <kgaillot at redhat.com> wrote:

> On 01/19/2016 10:30 AM, Kostiantyn Ponomarenko wrote:
> > The resource that wasn't running, but was reported as running, is
> > "adminServer".
> >
> > Here are a brief chronological description:
> >
> > [Jan 19 23:42:16] The first time Pacemaker triggers its monitor function
> at
> > line #1107. (those lines are from its Resource Agent)
> > [Jan 19 23:42:16] Then Pacemaker starts the resource - line #1191.
> > [Jan 19 11:42:53] The first failure is reported by monitor operation at
> > line #1543.
> > [Jan 19 11:42:53] The fail-count is set, but I don't see any attempt from
> > Pacemaker to "start" the resource - the start function is not called
> (from
> > the logs) - line #1553.
> > [Jan 19 12:27:56] Then adminServer's monitor operation keeps returning
> > $OCF_NOT_RUNNING - starts at line #1860.
> > [Jan 19 12:57:53] Then the expired failcount is cleared at line #1969.
> > [Jan 19 12:57:53] Another call of the monitor function happens at line
> > #2038.
> > [Jan 19 12:57:53] I assume that the line #2046 means "not running" (?).
> > [Jan 19 12:57:53] The "stop" function is called - line #2150
> > [Jan 19 12:57:53] The "start" function is called and the resource is
> > successfully started - line #2164
> >
> >
> > The time change occurred while cluster was starting, I see this from
> > "journalctl --since="2016-01-19" --until="2016-01-20"":
> >
> > Jan 19 23:10:39 A2-2U12-302-LS ntpd[2210]: 0.0.0.0 c61c 0c clock_step
> > -43193.793349 s
> > Jan 19 11:10:45 A2-2U12-302-LS ntpd[2210]: 0.0.0.0 c614 04 freq_mode
> > Jan 19 11:10:45 A2-2U12-302-LS systemd[1]: Time has been changed
> >
> > I am attaching corosync.log.
>
> The time change is interesting. I suspect what's happening is that
> pacemaker considers the failed monitor "older" than the original
> successful one, and so ignores it.
>
> In general, we don't support large clock shifts in a running cluster (a
> topic that has come up before on this list). But if you can reproduce
> the behavior with 1.1.14, feel free to open a bug report. It might be
> worth revisiting to see if there is anything we can do about it.
>
> > Thank you,
> > Kostia
> >
> > On Tue, Jan 19, 2016 at 5:17 PM, Bogdan Dobrelya <bdobrelia at mirantis.com
> >
> > wrote:
> >
> >> On 19.01.2016 16:13, Ken Gaillot wrote:
> >>> On 01/19/2016 06:49 AM, Kostiantyn Ponomarenko wrote:
> >>>> One of resources in my cluster is not actually running, but "crm_mon"
> >> shows
> >>>> it with the "Started" status.
> >>>> Its resource agent's monitor function returns "$OCF_NOT_RUNNING", but
> >>>> Pacemaker doesn't react on this anyhow - crm_mon show the resource as
> >>>> Started.
> >>>> I couldn't find an explanation to this behavior, so I suppose it is a
> >> bug,
> >>>> is it?
> >>>
> >>> That is unexpected. Can you post the configuration and logs from around
> >>> the time of the issue?
> >>>
> >>
> >> Oh, sorry, I forgot to mention the related thread [0]. That is exactly
> >> the case I reported there. Looks same, so I thought you've just updated
> >> my thread :)
> >>
> >> These may be merged perhaps.
> >>
> >> [0] http://clusterlabs.org/pipermail/users/2016-January/002035.html
> >>
> >>>
> >>> _______________________________________________
> >>> Users mailing list: Users at clusterlabs.org
> >>> http://clusterlabs.org/mailman/listinfo/users
> >>>
> >>> Project Home: http://www.clusterlabs.org
> >>> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >>> Bugs: http://bugs.clusterlabs.org
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20160119/d49d7cc7/attachment-0003.html>