[ClusterLabs] Multiple OpenSIPS services on one cluster

Mon Nov 2 19:53:16 UTC 2015

On 11/02/2015 01:24 PM, Nuno Pereira wrote:
> Hi all.
> 
>  
> 
> We have one cluster that has 9 nodes and 20 resources.
> 
>  
> 
> Four of those hosts are PSIP-SRV01-active, PSIP-SRV01-passive,
> PSIP-SRV02-active and PSIP-SRV02-active.
> 
> They should provide an lsb:opensips service, 2 by 2:
> 
> .         The SRV01-opensips and SRV01-IP resources should be active on one of
> PSIP-SRV01-active or PSIP-SRV01-passive;
> 
> .         The SRV02-opensips and SRV02-IP resources should be active on one of
> PSIP-SRV02-active or PSIP-SRV02-passive.
> 
>  
> 
> The relevant configuration is the following:
> 
>  
> 
> Resources:
> 
> Resource: SRV01-IP (class=ocf provider=heartbeat type=IPaddr2)
> 
>   Attributes: ip=10.0.0.1 cidr_netmask=27
> 
>   Meta Attrs: target-role=Started
> 
>   Operations: monitor interval=8s (SRV01-IP-monitor-8s)
> 
> Resource: SRV01-opensips (class=lsb type=opensips)
> 
>   Operations: monitor interval=8s (SRV01-opensips-monitor-8s)
> 
> Resource: SRV02-IP (class=ocf provider=heartbeat type=IPaddr2)
> 
>   Attributes: ip=10.0.0.2 cidr_netmask=27
> 
>   Operations: monitor interval=8s (SRV02-IP-monitor-8s)
> 
> Resource: SRV02-opensips (class=lsb type=opensips)
> 
>   Operations: monitor interval=30 (SRV02-opensips-monitor-30)
> 
>  
> 
> Location Constraints:
> 
>   Resource: SRV01-opensips
> 
>     Enabled on: PSIP-SRV01-active (score:100) (id:prefer1-srv01-active)
> 
>     Enabled on: PSIP-SRV01-passive (score:99) (id:prefer3-srv01-active)
> 
>   Resource: SRV01-IP
> 
>     Enabled on: PSIP-SRV01-active (score:100) (id:prefer-SRV01-ACTIVE)
> 
>     Enabled on: PSIP-SRV01-passive (score:99) (id:prefer-SRV01-PASSIVE)
> 
>   Resource: SRV02-IP
> 
>     Enabled on: PSIP-SRV02-active (score:100) (id:prefer-SRV02-ACTIVE)
> 
>     Enabled on: PSIP-SRV02-passive (score:99) (id:prefer-SRV02-PASSIVE)
> 
>   Resource: SRV02-opensips
> 
>     Enabled on: PSIP-SRV02-active (score:100) (id:prefer-SRV02-ACTIVE)
> 
>     Enabled on: PSIP-SRV02-passive (score:99) (id:prefer-SRV02-PASSIVE)
> 
>  
> 
> Ordering Constraints:
> 
>   SRV01-IP then SRV01-opensips (score:INFINITY) (id:SRV01-opensips-after-ip)
> 
>   SRV02-IP then SRV02-opensips (score:INFINITY) (id:SRV02-opensips-after-ip)
> 
> Colocation Constraints:
> 
>   SRV01-opensips with SRV01-IP (score:INFINITY) (id:SRV01-opensips-with-ip)
> 
>   SRV02-opensips with SRV02-IP (score:INFINITY) (id:SRV02-opensips-with-ip)
> 
>  
> 
> Cluster Properties:
> 
> cluster-infrastructure: cman
> 
> .
> 
> symmetric-cluster: false
> 
>  
> 
>  
> 
> Everything works fine, until the moment that one of those nodes is rebooted.
> In the last case the problem occurred with a reboot of PSIP-SRV01-passive,
> that wasn't providing the service at that moment.
> 
>  
> 
> To be noted that all opensips nodes had the opensips service to be started on
> boot by initd, which was removed in the meanwhile.
> 
> The problem is that the service SRV01-opensips is detected to be started on
> both PSIP-SRV01-active and PSIP-SRV01-passive, and the SRV02-opensips is
> detected to be started on both PSIP-SRV01-active and PSIP-SRV02-active.
> 
> After that and several operations done by the cluster, which include actions
> to stop both SRV01-opensips on both PSIP-SRV01-active and PSIP-SRV01-passive,
> and to stop SRV02-opensips on PSIP-SRV01-active and PSIP-SRV02-active, which
> fail on PSIP-SRV01-passive, the resource SRV01-opensips becomes unmanaged.
> 
>  
> 
> Any ideas on how to fix this?
> 
>  
> 
>  
> 
> Nuno Pereira
> 
> G9Telecom

Your configuration looks appropriate, so it sounds like something is
still starting the opensips services outside cluster control. Pacemaker
recovers from multiple running instances by stopping them all, then
starting on the expected node.

You can verify that Pacemaker did not start the extra instances by
looking for start messages in the logs (they will look like "Operation
SRV01-opensips_start_0" etc.).

The other question is why did the stop command fail. The logs should
shed some light on that too; look for the equivalent "_stop_0" operation
and the messages around it. The resource agent might have reported an
error, or it might have timed out.