[ClusterLabs] Pacemaker shows false status of a resource and doesn't react on OCF_NOT_RUNNING rc.

Kostiantyn Ponomarenko konstantin.ponomarenko at gmail.com
Tue Jan 19 11:30:46 EST 2016


The resource that wasn't running, but was reported as running, is
"adminServer".

Here are a brief chronological description:

[Jan 19 23:42:16] The first time Pacemaker triggers its monitor function at
line #1107. (those lines are from its Resource Agent)
[Jan 19 23:42:16] Then Pacemaker starts the resource - line #1191.
[Jan 19 11:42:53] The first failure is reported by monitor operation at
line #1543.
[Jan 19 11:42:53] The fail-count is set, but I don't see any attempt from
Pacemaker to "start" the resource - the start function is not called (from
the logs) - line #1553.
[Jan 19 12:27:56] Then adminServer's monitor operation keeps returning
$OCF_NOT_RUNNING - starts at line #1860.
[Jan 19 12:57:53] Then the expired failcount is cleared at line #1969.
[Jan 19 12:57:53] Another call of the monitor function happens at line
#2038.
[Jan 19 12:57:53] I assume that the line #2046 means "not running" (?).
[Jan 19 12:57:53] The "stop" function is called - line #2150
[Jan 19 12:57:53] The "start" function is called and the resource is
successfully started - line #2164


The time change occurred while cluster was starting, I see this from
"journalctl --since="2016-01-19" --until="2016-01-20"":

Jan 19 23:10:39 A2-2U12-302-LS ntpd[2210]: 0.0.0.0 c61c 0c clock_step
-43193.793349 s
Jan 19 11:10:45 A2-2U12-302-LS ntpd[2210]: 0.0.0.0 c614 04 freq_mode
Jan 19 11:10:45 A2-2U12-302-LS systemd[1]: Time has been changed

I am attaching corosync.log.

Thank you,
Kostia

On Tue, Jan 19, 2016 at 5:17 PM, Bogdan Dobrelya <bdobrelia at mirantis.com>
wrote:

> On 19.01.2016 16:13, Ken Gaillot wrote:
> > On 01/19/2016 06:49 AM, Kostiantyn Ponomarenko wrote:
> >> One of resources in my cluster is not actually running, but "crm_mon"
> shows
> >> it with the "Started" status.
> >> Its resource agent's monitor function returns "$OCF_NOT_RUNNING", but
> >> Pacemaker doesn't react on this anyhow - crm_mon show the resource as
> >> Started.
> >> I couldn't find an explanation to this behavior, so I suppose it is a
> bug,
> >> is it?
> >
> > That is unexpected. Can you post the configuration and logs from around
> > the time of the issue?
> >
>
> Oh, sorry, I forgot to mention the related thread [0]. That is exactly
> the case I reported there. Looks same, so I thought you've just updated
> my thread :)
>
> These may be merged perhaps.
>
> [0] http://clusterlabs.org/pipermail/users/2016-January/002035.html
>
> >
> > _______________________________________________
> > Users mailing list: Users at clusterlabs.org
> > http://clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >
>
>
> --
> Best regards,
> Bogdan Dobrelya,
> Irc #bogdando
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20160119/9c2cd9b6/attachment-0003.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: corosync.log_resfailure
Type: application/octet-stream
Size: 327955 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20160119/9c2cd9b6/attachment-0006.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: crm_configure_show.out
Type: application/octet-stream
Size: 2860 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20160119/9c2cd9b6/attachment-0007.obj>


More information about the Users mailing list