[ClusterLabs] Pacemaker shows false status of a resource and doesn't react on OCF_NOT_RUNNING rc.

Tue Jan 19 13:04:29 EST 2016

On 01/19/2016 11:02 AM, Kostiantyn Ponomarenko wrote:
> Just in case, this is the monitor function from the resource agent:
> ra_monitor() {
> #   ocf_log info "$RA: [monitor]"
>     systemctl status ${service}
>     rc=$?
>     if [ "$rc" -eq "0" ]; then
>         return $OCF_SUCCESS
>     fi
> 
>     ocf_log warn "$RA: [monitor] : got rc=$rc"
>     return $OCF_NOT_RUNNING
> }

Out of curiosity, why are you wrapping systemctl with OCF when pacemaker
supports systemd resources natively? The native support works around a
number of quirks in systemd behavior. (In fact a recent commit to the
master branch handles yet another one.)

> Thank you,
> Kostia
> 
> On Tue, Jan 19, 2016 at 6:30 PM, Kostiantyn Ponomarenko <
> konstantin.ponomarenko at gmail.com> wrote:
> 
>> The resource that wasn't running, but was reported as running, is
>> "adminServer".
>>
>> Here are a brief chronological description:
>>
>> [Jan 19 23:42:16] The first time Pacemaker triggers its monitor function
>> at line #1107. (those lines are from its Resource Agent)
>> [Jan 19 23:42:16] Then Pacemaker starts the resource - line #1191.
>> [Jan 19 11:42:53] The first failure is reported by monitor operation at
>> line #1543.
>> [Jan 19 11:42:53] The fail-count is set, but I don't see any attempt from
>> Pacemaker to "start" the resource - the start function is not called (from
>> the logs) - line #1553.
>> [Jan 19 12:27:56] Then adminServer's monitor operation keeps returning
>> $OCF_NOT_RUNNING - starts at line #1860.
>> [Jan 19 12:57:53] Then the expired failcount is cleared at line #1969.
>> [Jan 19 12:57:53] Another call of the monitor function happens at line
>> #2038.
>> [Jan 19 12:57:53] I assume that the line #2046 means "not running" (?).
>> [Jan 19 12:57:53] The "stop" function is called - line #2150
>> [Jan 19 12:57:53] The "start" function is called and the resource is
>> successfully started - line #2164
>>
>>
>> The time change occurred while cluster was starting, I see this from
>> "journalctl --since="2016-01-19" --until="2016-01-20"":
>>
>> Jan 19 23:10:39 A2-2U12-302-LS ntpd[2210]: 0.0.0.0 c61c 0c clock_step
>> -43193.793349 s
>> Jan 19 11:10:45 A2-2U12-302-LS ntpd[2210]: 0.0.0.0 c614 04 freq_mode
>> Jan 19 11:10:45 A2-2U12-302-LS systemd[1]: Time has been changed
>>
>> I am attaching corosync.log.
>>
>> Thank you,
>> Kostia
>>
>> On Tue, Jan 19, 2016 at 5:17 PM, Bogdan Dobrelya <bdobrelia at mirantis.com>
>> wrote:
>>
>>> On 19.01.2016 16:13, Ken Gaillot wrote:
>>>> On 01/19/2016 06:49 AM, Kostiantyn Ponomarenko wrote:
>>>>> One of resources in my cluster is not actually running, but "crm_mon"
>>> shows
>>>>> it with the "Started" status.
>>>>> Its resource agent's monitor function returns "$OCF_NOT_RUNNING", but
>>>>> Pacemaker doesn't react on this anyhow - crm_mon show the resource as
>>>>> Started.
>>>>> I couldn't find an explanation to this behavior, so I suppose it is a
>>> bug,
>>>>> is it?
>>>>
>>>> That is unexpected. Can you post the configuration and logs from around
>>>> the time of the issue?
>>>>
>>>
>>> Oh, sorry, I forgot to mention the related thread [0]. That is exactly
>>> the case I reported there. Looks same, so I thought you've just updated
>>> my thread :)
>>>
>>> These may be merged perhaps.
>>>
>>> [0] http://clusterlabs.org/pipermail/users/2016-January/002035.html