[ClusterLabs Developers] CRM trying to demote a stopped resource

Wed Aug 5 11:06:38 EDT 2015

On 08/05/2015 08:40 AM, Jehan-Guillaume de Rorthais wrote:
> On Wed, 5 Aug 2015 16:37:39 +0300
> Andrei Borzenkov <arvidjaar at gmail.com> wrote:
> 
>> On Wed, Aug 5, 2015 at 4:04 PM, Jehan-Guillaume de Rorthais
>> <jgdr at dalibo.com> wrote:
>>> hi guys,
>>>
>>> We are still on our new postgresql resource agent.
>>>
>>> We kind of make our minds with the promotion issue (see ml thread "problem
>>> with master score limited to 1000000") and found an acceptable algorithm.
>>>
>>> Now we are testing this RA, I found a strange behavior of the CRM with a
>>> simple failure scenario: The master resource is stopped.
>>>
>>> When I stop gracefully the master,
>>
>> You mean - stop postgres outside of pacemaker?
> 
> Yes, to simulate a resource failure.
> 
>>>                                                   the CRM tries to recover
>>> the resource with :
>>>
>>> * demote it
>>> * stop it
>>> * start it
>>> * promote it
>>>
>>> Sounds logic, but it fails at the first step because the master is actually
>>> stopped. According to the "ra-dev-guide", the RA should returns
>>> OCF_ERR_GENERIC if the resource is stopped on demote. See:
>>>
>>>   http://www.linux-ha.org/doc/dev-guides/_literal_demote_literal_action.html
>>>
>>> When teaching my RA to follow this, the CRM keep trying the same transition
>>> again and again until the failcount reaches the migration-threshold. Then it
>>> stops trying to recover it and moves the resource to another node.
>>>
>>> Same result if the RA returns OCF_NOT_RUNNING from the demote action
>>> instead of OCF_ERR_GENERIC.
>>>
>>> I could try to obey the CRM and start the resource as a slave and
>>> return OCF_SUCCESS, but it sounds ridiculous as it will be stopped at the
>>> really next step, then start again one step later...
>>>
>>> Did I missed something? Is this behavior normal? Any advise to fix this?

What version of pacemaker are you using? I spoke to another developer,
and he thinks the behavior may have changed (so that if demote fails, it
proceeds to stop anyway) in 1.1.12 or 1.1.13. For older versions, the
behavior you describe is expected.