[ClusterLabs Developers] CRM trying to demote a stopped resource

Jehan-Guillaume de Rorthais jgdr at dalibo.com
Wed Aug 5 09:04:56 EDT 2015


hi guys,

We are still on our new postgresql resource agent.

We kind of make our minds with the promotion issue (see ml thread "problem with
master score limited to 1000000") and found an acceptable algorithm.

Now we are testing this RA, I found a strange behavior of the CRM with a simple
failure scenario: The master resource is stopped.

When I stop gracefully the master, the CRM tries to recover the resource
with :

* demote it
* stop it
* start it
* promote it

Sounds logic, but it fails at the first step because the master is actually
stopped. According to the "ra-dev-guide", the RA should returns OCF_ERR_GENERIC
if the resource is stopped on demote. See: 

  http://www.linux-ha.org/doc/dev-guides/_literal_demote_literal_action.html

When teaching my RA to follow this, the CRM keep trying the same transition
again and again until the failcount reaches the migration-threshold. Then it
stops trying to recover it and moves the resource to another node.

Same result if the RA returns OCF_NOT_RUNNING from the demote action instead of
OCF_ERR_GENERIC.

I could try to obey the CRM and start the resource as a slave and
return OCF_SUCCESS, but it sounds ridiculous as it will be stopped at the
really next step, then start again one step later...

Did I missed something? Is this behavior normal? Any advise to fix this?

Regards,
-- 
Jehan-Guillaume de Rorthais
Dalibo
http://www.dalibo.com




More information about the Developers mailing list