[ClusterLabs] Master/slave failover does not work as expected

Mon Aug 12 18:51:15 EDT 2019

On Mon, 2019-08-12 at 23:09 +0300, Andrei Borzenkov wrote:
> 
> 
> On Mon, Aug 12, 2019 at 4:12 PM Michael Powell <
> Michael.Powell at harmonicinc.com> wrote:
> > At 07:44:49, the ss agent discovers that the master instance has
> > failed on node mgraid…-0 as a result of a failed ssadm request in
> > response to an ss_monitor() operation.  It issues a crm_master -Q
> > -D command with the intent of demoting the master and promoting the
> > slave, on the other node, to master.  The ss_demote() function
> > finds that the application is no longer running and returns
> > OCF_NOT_RUNNING (7).  In the older product, this was sufficient to
> > promote the other instance to master, but in the current product,
> > that does not happen.  Currently, the failed application is
> > restarted, as expected, and is promoted to master, but this takes
> > 10’s of seconds.
> >  
> > 
> 
> Did you try to disable resource stickiness for this ms?

Stickiness shouldn't affect where the master role is placed, just
whether the resource instances should stay on their current nodes
(independently of whether their role is staying the same or changing).

Are there any constraints that apply to the master role?

Another possibility is that you are mixing crm_master with and without
--lifetime=reboot (which controls whether the master attribute is
transient or permanent). Transient should really be the default but
isn't for historical reasons. It's a good idea to always use --
lifetime=reboot. You could double-check with "cibadmin -Q|grep master-" 
and see if there is more than one entry per node.
-- 
Ken Gaillot <kgaillot at redhat.com>