[ClusterLabs] how to always promote current slave on master death

Fri Apr 24 18:08:54 UTC 2015

----- Original Message -----
> Hello,
> 
> I'm writing an OCF script for a m/s resource and I am having a bit of trouble
> achieving what I desire.
> 
> When the master dies (e.g. I am killing it from the command line to test) I
> want the current slave to always be promoted. First off -- I am assuming
> that this can be achieved in Pacemaker, please correct me if I am wrong.
> 
> In order to force this I have tried to alter the crm_master level inside the
> OCF script during the demote() action (e.g. crm_master -l reboot -v 0).

Your agent only needs to manage your service, not the clustering itself. So when you get the "demote" action, put your service into slave mode locally as appropriate for that particular service -- don't worry about Pacemaker's conception of master/slave. Pacemaker will take care of detecting the master failure and calling promote on the slave, assuming your agent supports the appropriate actions and returns the correct status codes, as described here:

http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#_clone_resource_agent_requirements

http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#_requirements_for_multi_state_resource_agents

http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#ap-ocf

You also have to configure the resource options (clone-max, clone-node-max, master-max, master-node-max, globally-unique, interleave, etc.) appropriately for your situation, configure two monitor operations on the resource (one for slave and one for master), and make sure you don't have any constraints that prevent a node from taking the master role.

If everything is properly configured, then when you kill the master, the next recurring master-role monitor operation will detect that failure (by getting the appropriate return code from your agent's monitor action), and then pacemaker will call your agent on another node with the promote action.

> However, it doesn't seem to have any affect, the failed master resource is
> still promoted:
> 
> Apr 23 23:25:06 server-dmz-b attrd[1668]: notice: attrd_perform_update: Sent
> update 5034: master-epttd=0
> Apr 23 23:25:06 server-dmz-b crmd[1670]: notice: process_lrm_event: Operation
> epttd_monitor_2000: unknown error (node=server-dmz-b, call=100, rc=1,
> cib-update=4667, confirmed=false)
> Apr 23 23:25:06 server-dmz-b attrd[1668]: notice: attrd_cs_dispatch: Update
> relayed from server-dmz-a
> Apr 23 23:25:06 server-dmz-b attrd[1668]: notice: attrd_trigger_update:
> Sending flush op to all hosts for: fail-count-epttd (2)
> Apr 23 23:25:06 server-dmz-b attrd[1668]: notice: attrd_perform_update: Sent
> update 5036: fail-count-epttd=2
> Apr 23 23:25:06 server-dmz-b attrd[1668]: notice: attrd_cs_dispatch: Update
> relayed from server-dmz-a
> Apr 23 23:25:06 server-dmz-b attrd[1668]: notice: attrd_trigger_update:
> Sending flush op to all hosts for: last-failure-epttd (1429831506)
> Apr 23 23:25:06 server-dmz-b attrd[1668]: notice: attrd_perform_update: Sent
> update 5038: last-failure-epttd=1429831506
> Apr 23 23:25:06 server-dmz-b epttd(epttd)[28692]: INFO: not running
> Apr 23 23:25:06 server-dmz-b crmd[1670]: notice: process_lrm_event: Operation
> epttd_demote_0: ok (node=server-dmz-b, call=102, rc=0, cib-update=4669,
> confirmed=true)
> Apr 23 23:25:07 server-dmz-b attrd[1668]: notice: attrd_trigger_update:
> Sending flush op to all hosts for: master-epttd (<null>)
> Apr 23 23:25:07 server-dmz-b attrd[1668]: notice: attrd_perform_update: Sent
> delete 5040: node=server-dmz-b, attr=master-epttd, id=<n/a>, set=(null),
> section=status
> Apr 23 23:25:07 server-dmz-b attrd[1668]: notice: attrd_perform_update: Sent
> delete 5042: node=server-dmz-b, attr=master-epttd, id=<n/a>, set=(null),
> section=status
> Apr 23 23:25:07 server-dmz-b epttd(epttd)[28717]: INFO: is not running.
> Apr 23 23:25:07 server-dmz-b crmd[1670]: notice: process_lrm_event: Operation
> epttd_stop_0: ok (node=server-dmz-b, call=103, rc=0, cib-update=4670,
> confirmed=true)
> Apr 23 23:25:07 server-dmz-b epttd(epttd)[28742]: INFO: not running
> Apr 23 23:25:07 server-dmz-b epttd(epttd)[28742]: INFO: not running
> Apr 23 23:25:08 server-dmz-b attrd[1668]: notice: attrd_trigger_update:
> Sending flush op to all hosts for: master-epttd (100)
> Apr 23 23:25:08 server-dmz-b attrd[1668]: notice: attrd_perform_update: Sent
> update 5044: master-epttd=100
> Apr 23 23:25:08 server-dmz-b crmd[1670]: notice: process_lrm_event: Operation
> epttd_start_0: ok (node=server-dmz-b, call=104, rc=0, cib-update=4671,
> confirmed=true)
> Apr 23 23:25:08 server-dmz-b crmd[1670]: notice: process_lrm_event: Operation
> epttd_promote_0: ok (node=server-dmz-b, call=105, rc=0, cib-update=4672,
> confirmed=true)
> Apr 23 23:25:08 server-dmz-b crmd[1670]: notice: process_lrm_event: Operation
> epttd_monitor_2000: master (node=server-dmz-b, call=106, rc=8,
> cib-update=4673, confirmed=false)
> 
> 
> Any advice would be greatly appreciated,
> -Brett Moser
> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-- Ken Gaillot <kgaillot at redhat.com>