[ClusterLabs] FLoating IP failing over but not failing back with active/active LDAP (dirsrv)

Thu Mar 10 11:02:55 EST 2016

On 03/10/2016 09:38 AM, Bernie Jones wrote:
> Hi Ken,
> Thanks for your response, I've now corrected the constraint order but the
> behaviour is still the same, the IP does not fail over (after the first
> time) unless I issue a pcs resource cleanup command on dirsrv-daemon.
> 
> Also, I'm not sure why you advise against using is-managed=false in
> production. We are trying to use pacemaker purely to fail over on detection
> of a failure and not to control starting or stopping of the instances. It is
> essential that in normal operation we have both instances up as we are using
> MMR.
> 
> Thanks,
> Bernie

I think you misunderstand is-managed. It is used to be able to perform
maintenance on a service without pacemaker fencing the node when the
service is stopped/restarted. Failover won't work with is-managed=false,
because failover involves stopping and starting the service.

Your goal is already accomplished by using a clone with master-max=2.
With the clone, pacemaker will run the service on both nodes, and with
master-max=2, it will be master/master.

> -----Original Message-----
> From: Ken Gaillot [mailto:kgaillot at redhat.com] 
> Sent: 10 March 2016 15:01
> To: users at clusterlabs.org
> Subject: Re: [ClusterLabs] FLoating IP failing over but not failing back
> with active/active LDAP (dirsrv)
> 
> On 03/10/2016 08:48 AM, Bernie Jones wrote:
>> A bit more info..
>>
>>  
>>
>> If, after I restart the failed dirsrv instance, I then perform a "pcs
>> resource cleanup dirsrv-daemon" to clear the FAIL messages then the
> failover
>> will work OK.
>>
>> So it's as if the cleanup is changing the status in some way..
>>
>>  
>>
>> From: Bernie Jones [mailto:bernie at securityconsulting.ltd.uk] 
>> Sent: 10 March 2016 08:47
>> To: 'Cluster Labs - All topics related to open-source clustering welcomed'
>> Subject: [ClusterLabs] FLoating IP failing over but not failing back with
>> active/active LDAP (dirsrv)
>>
>>  
>>
>> Hi all, could you advise please?
>>
>>  
>>
>> I'm trying to configure a floating IP with an active/active deployment of
>> 389 directory server. I don't want pacemaker to manage LDAP but just to
>> monitor and switch the IP as required to provide resilience. I've seen
> some
>> other similar threads and based my solution on those.
>>
>>  
>>
>> I've amended the ocf for slapd to work with 389 DS and this tests out OK
>> (dirsrv).
>>
>>  
>>
>> I've then created my resources as below:
>>
>>  
>>
>> pcs resource create dirsrv-ip ocf:heartbeat:IPaddr2 ip="192.168.26.100"
>> cidr_netmask="32" op monitor timeout="20s" interval="5s" op start
>> interval="0" timeout="20" op stop interval="0" timeout="20"
>>
>> pcs resource create dirsrv-daemon ocf:heartbeat:dirsrv op monitor
>> interval="10" timeout="5" op start interval="0" timeout="5" op stop
>> interval="0" timeout="5" meta "is-managed=false"
> 
> is-managed=false means the cluster will not try to start or stop the
> service. It should never be used in regular production, only when doing
> maintenance on the service.
> 
>> pcs resource clone dirsrv-daemon meta globally-unique="false"
>> interleave="true" target-role="Started" "master-max=2"
>>
>> pcs constraint colocation add dirsrv-daemon-clone with dirsrv-ip
>> score=INFINITY
> 
> This constraint means that dirsrv is only allowed to run where dirsrv-ip
> is. I suspect you want the reverse, dirsrv-ip with dirsrv-daemon-clone,
> which means keep the IP with a working dirsrv instance.
> 
>> pcs property set no-quorum-policy=ignore
> 
> If you're using corosync 2, you generally don't need or want this.
> Instead, ensure corosync.conf has two_node: 1 (which will be done
> automatically if you used pcs cluster setup).
> 
>> pcs resource defaults migration-threshold=1
>>
>> pcs property set stonith-enabled=false
>>
>>  
>>
>> On startup all looks well:
>>
>>
> ____________________________________________________________________________
>> ____________
>>
>>  
>>
>> Last updated: Thu Mar 10 08:28:03 2016
>>
>> Last change: Thu Mar 10 08:26:14 2016
>>
>> Stack: cman
>>
>> Current DC: ga2.idam.com - partition with quorum
>>
>> Version: 1.1.11-97629de
>>
>> 2 Nodes configured
>>
>> 3 Resources configured
>>
>>  
>>
>>  
>>
>> Online: [ ga1.idam.com ga2.idam.com ]
>>
>>  
>>
>> dirsrv-ip   (ocf::heartbeat:IPaddr2):     Started ga1.idam.com
>>
>>  Clone Set: dirsrv-daemon-clone [dirsrv-daemon]
>>
>>      dirsrv-daemon      (ocf::heartbeat:dirsrv):        Started
> ga2.idam.com
>> (unmanaged)
>>
>>      dirsrv-daemon      (ocf::heartbeat:dirsrv):        Started
> ga1.idam.com
>> (unmanaged)
>>
>>  
>>
>>  
>>
>>
> ____________________________________________________________________________
>> ____________
>>
>>  
>>
>> Stop dirsrv on ga1:
>>
>>  
>>
>> Last updated: Thu Mar 10 08:28:43 2016
>>
>> Last change: Thu Mar 10 08:26:14 2016
>>
>> Stack: cman
>>
>> Current DC: ga2.idam.com - partition with quorum
>>
>> Version: 1.1.11-97629de
>>
>> 2 Nodes configured
>>
>> 3 Resources configured
>>
>>  
>>
>>  
>>
>> Online: [ ga1.idam.com ga2.idam.com ]
>>
>>  
>>
>> dirsrv-ip   (ocf::heartbeat:IPaddr2):     Started ga2.idam.com
>>
>>  Clone Set: dirsrv-daemon-clone [dirsrv-daemon]
>>
>>      dirsrv-daemon      (ocf::heartbeat:dirsrv):        Started
> ga2.idam.com
>> (unmanaged)
>>
>>      dirsrv-daemon      (ocf::heartbeat:dirsrv):        FAILED
> ga1.idam.com
>> (unmanaged)
>>
>>  
>>
>> Failed actions:
>>
>>     dirsrv-daemon_monitor_10000 on ga1.idam.com 'not running' (7):
> call=12,
>> status=complete, last-rc-change='Thu Mar 10 08:28:41 2016', queued=0ms,
>> exec=0ms
>>
>>  
>>
>> IP fails over to ga2 OK:
>>
>>  
>>
>>
> ____________________________________________________________________________
>> ____________
>>
>>  
>>
>> Restart dirsrv on ga1
>>
>>  
>>
>> Last updated: Thu Mar 10 08:30:01 2016
>>
>> Last change: Thu Mar 10 08:26:14 2016
>>
>> Stack: cman
>>
>> Current DC: ga2.idam.com - partition with quorum
>>
>> Version: 1.1.11-97629de
>>
>> 2 Nodes configured
>>
>> 3 Resources configured
>>
>>  
>>
>>  
>>
>> Online: [ ga1.idam.com ga2.idam.com ]
>>
>>  
>>
>> dirsrv-ip   (ocf::heartbeat:IPaddr2):     Started ga2.idam.com
>>
>>  Clone Set: dirsrv-daemon-clone [dirsrv-daemon]
>>
>>      dirsrv-daemon      (ocf::heartbeat:dirsrv):        Started
> ga2.idam.com
>> (unmanaged)
>>
>>      dirsrv-daemon      (ocf::heartbeat:dirsrv):        Started
> ga1.idam.com
>> (unmanaged)
>>
>>  
>>
>> Failed actions:
>>
>>     dirsrv-daemon_monitor_10000 on ga1.idam.com 'not running' (7):
> call=12,
>> status=complete, last-rc-change='Thu Mar 10 08:28:41 2016', queued=0ms,
>> exec=0ms
>>
>>  
>>
>>
> ____________________________________________________________________________
>> ____________
>>
>>  
>>
>> Stop dirsrv on ga2:
>>
>>  
>>
>> Last updated: Thu Mar 10 08:31:14 2016
>>
>> Last change: Thu Mar 10 08:26:14 2016
>>
>> Stack: cman
>>
>> Current DC: ga2.idam.com - partition with quorum
>>
>> Version: 1.1.11-97629de
>>
>> 2 Nodes configured
>>
>> 3 Resources configured
>>
>>  
>>
>>  
>>
>> Online: [ ga1.idam.com ga2.idam.com ]
>>
>>  
>>
>> dirsrv-ip   (ocf::heartbeat:IPaddr2):     Started ga2.idam.com
>>
>>  Clone Set: dirsrv-daemon-clone [dirsrv-daemon]
>>
>>      dirsrv-daemon      (ocf::heartbeat:dirsrv):        FAILED
> ga2.idam.com
>> (unmanaged)
>>
>>      dirsrv-daemon      (ocf::heartbeat:dirsrv):        Started
> ga1.idam.com
>> (unmanaged)
>>
>>  
>>
>> Failed actions:
>>
>>     dirsrv-daemon_monitor_10000 on ga2.idam.com 'not running' (7):
> call=11,
>> status=complete, last-rc-change='Thu Mar 10 08:31:12 2016', queued=0ms,
>> exec=0ms
>>
>>     dirsrv-daemon_monitor_10000 on ga1.idam.com 'not running' (7):
> call=12,
>> status=complete, last-rc-change='Thu Mar 10 08:28:41 2016', queued=0ms,
>> exec=0ms
>>
>>  
>>
>> But IP stays on failed node
>>
>> Looking in the logs it seems that the cluster is not aware that ga1 is
>> available even though the status output shows it is.
>>
>>  
>>
>> If I repeat the tests but with ga2 started up first the behaviour is
> similar
>> i.e. it fails over to ga1 but not back to ga2.
>>
>>  
>>
>> Many thanks,
>>
>> Bernie
>